InternBootcamp/examples/unittests/README.md

# InternBootcamp Evaluation Guide

To quickly evaluate the performance of a model in different bootcamp environments, you can use the `run_eval.py` script. This script supports multiple configuration options and is highly flexible to accommodate various testing needs.

---

#### Example Execution Command

Below is a complete example command demonstrating how to run the evaluation script:

```bash
cd InternBootcamp
python examples/unittests/run_eval.py \
    --url http://127.0.0.1:8000/v1 \
    --api_key EMPTY \
    --model_name r1_32B \
    --test_dir /path/to/test_dir \
    --max_concurrent_requests 128 \
    --template r1 \
    --max_tokens 32768 \
    --temperature 0 \
    --timeout 6000 \
    --api_mode completion \
    --max_retries 16 \
    --max_retrying_delay 60
```

---

#### Parameter Description

Here are the main parameters supported by the script and their meanings:

| Parameter Name              | Type   | Example Value                   | Description                                                                 |
|----------------------------|--------|----------------------------------|-----------------------------------------------------------------------------|
| `--url`                    | str    | `http://127.0.0.1:8000/v1`      | Base URL for the OpenAI API.                                               |
| `--api_key`                | str    | `EMPTY`                         | API key required to access the model service. Default is `EMPTY`.          |
| `--model_name`             | str    | `r1_32B`                        | The name of the model used, e.g., `r1_32B` or other custom model names.     |
| `--test_dir`               | str    | `/path/to/test_dir`             | Path to the directory containing test data (should include JSONL files).   |
| `--max_concurrent_requests`| int    | `128`                           | Maximum number of concurrent requests allowed globally.                    |
| `--template`               | str    | `r1`                            | Predefined conversation template (e.g., `r1`, `qwen`, `internthinker`, `chatml`). |
| `--max_tokens`             | int    | `32768`                         | Maximum number of tokens generated by the model.                           |
| `--temperature`            | float  | `0`                             | Controls randomness in text generation; lower values yield more deterministic results. |
| `--timeout`                | int    | `6000`                          | Request timeout in milliseconds.                                           |
| `--api_mode`               | str    | `completion`                    | API mode; options are `completion` or `chat_completion`.                   |
| `--sys_prompt`             | str    | `"You are an expert reasoner..."` | System prompt content; only effective when `api_mode` is `chat_completion`. |
| `--max_retries`            | int    | `16`                            | Number of retries per failed request.                                      |
| `--max_retrying_delay`     | int    | `60`                            | Maximum delay between retries in seconds.                                  |

##### Parameter Relationships
- `--sys_prompt` is only effective if `--api_mode` is set to `chat_completion`.
- `--template` is only effective if `--api_mode` is set to `completion`.
- Valid values for `--template` include: `r1`, `qwen`, `internthinker`, `chatml` (from predefined `TEMPLATE_MAP`).
- If `--sys_prompt` is not provided, the default system prompt from the template will be used (if any).

---

#### Output Results

Evaluation results will be saved under the directory:
`examples/unittests/output/{model_name}_{test_dir}_{timestamp}`

The output includes:

1. **Detailed Results**:
   - Each JSONL file's result is saved in `output/details/`, named after the original JSONL file.
   - Each record contains the following fields:
     - `id`: Sample ID.
     - `prompt`: Input prompt.
     - `output`: Model-generated output.
     - `output_len`: Length of the output in tokens.
     - `ground_truth`: Ground truth answer.
     - `score`: Score calculated by `verify_score` method.
     - `extracted_output`: Extracted output via `extract_output` method.

2. **Metadata**:
   - Metadata including average score and average output length per bootcamp is saved in `output/meta.jsonl`.

3. **Summary Report**:
   - A summary report is saved as an Excel file at `output/{model_name}_scores.xlsx`, including:
     - Average score and output length per bootcamp.
     - Overall average score and output length across all bootcamps.

4. **Progress Log**:
   - Progress information is logged in `output/progress.log`, showing real-time progress and estimated remaining time for each dataset.

5. **Parameter Configuration**:
   - The full configuration used in the current run is saved in `output/eval_args.json` for experiment reproducibility.

---

#### Notes

1. **Concurrency Settings**:
   - Adjust `--max_concurrent_requests` based on machine capabilities and the size of the test set to avoid resource exhaustion due to excessive concurrency.

2. **URL Health Check**:
   - Before starting the evaluation, the script automatically checks whether the model service is running and has registered the specified `model_name`.
   - If the service is not ready, it will wait up to 60 minutes (default), retrying every 60 seconds.

3. **Error Handling Mechanism**:
   - Each request can be retried up to `--max_retries` times using exponential backoff (up to `--max_retrying_delay` seconds).
   - If all retries fail, the script raises an exception and terminates processing of the current sample.

---

#### Example Output Directory Structure

After execution, the output directory structure looks like this:

```
examples/unittests/output/
└── {model_name}_{test_dir}_{timestamp}/
    ├── details/
    │   ├── test_file1.jsonl
    │   ├── test_file2.jsonl
    │   └── ...
    ├── meta.jsonl
    ├── eval_args.json
    ├── progress.log
    └── {model_name}_scores.xlsx
```

---