InternBootcamp/examples/unittests/README.md
2025-05-23 15:27:15 +08:00

124 lines
No EOL
6 KiB
Markdown
Executable file

# InternBootcamp Evaluation Guide
To quickly evaluate the performance of a model in different bootcamp environments, you can use the `run_eval.py` script. This script supports multiple configuration options and is highly flexible to accommodate various testing needs.
---
#### Example Execution Command
Below is a complete example command demonstrating how to run the evaluation script:
```bash
cd InternBootcamp
python examples/unittests/run_eval.py \
--url http://127.0.0.1:8000/v1 \
--api_key EMPTY \
--model_name r1_32B \
--test_dir /path/to/test_dir \
--max_concurrent_requests 128 \
--template r1 \
--max_tokens 32768 \
--temperature 0 \
--timeout 6000 \
--api_mode completion \
--max_retries 16 \
--max_retrying_delay 60
```
---
#### Parameter Description
Here are the main parameters supported by the script and their meanings:
| Parameter Name | Type | Example Value | Description |
|----------------------------|--------|----------------------------------|-----------------------------------------------------------------------------|
| `--url` | str | `http://127.0.0.1:8000/v1` | Base URL for the OpenAI API. |
| `--api_key` | str | `EMPTY` | API key required to access the model service. Default is `EMPTY`. |
| `--model_name` | str | `r1_32B` | The name of the model used, e.g., `r1_32B` or other custom model names. |
| `--test_dir` | str | `/path/to/test_dir` | Path to the directory containing test data (should include JSONL files). |
| `--max_concurrent_requests`| int | `128` | Maximum number of concurrent requests allowed globally. |
| `--template` | str | `r1` | Predefined conversation template (e.g., `r1`, `qwen`, `internthinker`, `chatml`). |
| `--max_tokens` | int | `32768` | Maximum number of tokens generated by the model. |
| `--temperature` | float | `0` | Controls randomness in text generation; lower values yield more deterministic results. |
| `--timeout` | int | `6000` | Request timeout in milliseconds. |
| `--api_mode` | str | `completion` | API mode; options are `completion` or `chat_completion`. |
| `--sys_prompt` | str | `"You are an expert reasoner..."` | System prompt content; only effective when `api_mode` is `chat_completion`. |
| `--max_retries` | int | `16` | Number of retries per failed request. |
| `--max_retrying_delay` | int | `60` | Maximum delay between retries in seconds. |
##### Parameter Relationships
- `--sys_prompt` is only effective if `--api_mode` is set to `chat_completion`.
- `--template` is only effective if `--api_mode` is set to `completion`.
- Valid values for `--template` include: `r1`, `qwen`, `internthinker`, `chatml` (from predefined `TEMPLATE_MAP`).
- If `--sys_prompt` is not provided, the default system prompt from the template will be used (if any).
---
#### Output Results
Evaluation results will be saved under the directory:
`examples/unittests/output/{model_name}_{test_dir}_{timestamp}`
The output includes:
1. **Detailed Results**:
- Each JSONL file's result is saved in `output/details/`, named after the original JSONL file.
- Each record contains the following fields:
- `id`: Sample ID.
- `prompt`: Input prompt.
- `output`: Model-generated output.
- `output_len`: Length of the output in tokens.
- `ground_truth`: Ground truth answer.
- `score`: Score calculated by `verify_score` method.
- `extracted_output`: Extracted output via `extract_output` method.
2. **Metadata**:
- Metadata including average score and average output length per bootcamp is saved in `output/meta.jsonl`.
3. **Summary Report**:
- A summary report is saved as an Excel file at `output/{model_name}_scores.xlsx`, including:
- Average score and output length per bootcamp.
- Overall average score and output length across all bootcamps.
4. **Progress Log**:
- Progress information is logged in `output/progress.log`, showing real-time progress and estimated remaining time for each dataset.
5. **Parameter Configuration**:
- The full configuration used in the current run is saved in `output/eval_args.json` for experiment reproducibility.
---
#### Notes
1. **Concurrency Settings**:
- Adjust `--max_concurrent_requests` based on machine capabilities and the size of the test set to avoid resource exhaustion due to excessive concurrency.
2. **URL Health Check**:
- Before starting the evaluation, the script automatically checks whether the model service is running and has registered the specified `model_name`.
- If the service is not ready, it will wait up to 60 minutes (default), retrying every 60 seconds.
3. **Error Handling Mechanism**:
- Each request can be retried up to `--max_retries` times using exponential backoff (up to `--max_retrying_delay` seconds).
- If all retries fail, the script raises an exception and terminates processing of the current sample.
---
#### Example Output Directory Structure
After execution, the output directory structure looks like this:
```
examples/unittests/output/
└── {model_name}_{test_dir}_{timestamp}/
├── details/
│ ├── test_file1.jsonl
│ ├── test_file2.jsonl
│ └── ...
├── meta.jsonl
├── eval_args.json
├── progress.log
└── {model_name}_scores.xlsx
```
---