mirror of
https://github.com/InternLM/InternBootcamp.git
synced 2026-04-19 12:58:04 +00:00
124 lines
No EOL
6 KiB
Markdown
Executable file
124 lines
No EOL
6 KiB
Markdown
Executable file
# InternBootcamp Evaluation Guide
|
|
|
|
To quickly evaluate the performance of a model in different bootcamp environments, you can use the `run_eval.py` script. This script supports multiple configuration options and is highly flexible to accommodate various testing needs.
|
|
|
|
---
|
|
|
|
#### Example Execution Command
|
|
|
|
Below is a complete example command demonstrating how to run the evaluation script:
|
|
|
|
```bash
|
|
cd InternBootcamp
|
|
python examples/unittests/run_eval.py \
|
|
--url http://127.0.0.1:8000/v1 \
|
|
--api_key EMPTY \
|
|
--model_name r1_32B \
|
|
--test_dir /path/to/test_dir \
|
|
--max_concurrent_requests 128 \
|
|
--template r1 \
|
|
--max_tokens 32768 \
|
|
--temperature 0 \
|
|
--timeout 6000 \
|
|
--api_mode completion \
|
|
--max_retries 16 \
|
|
--max_retrying_delay 60
|
|
```
|
|
|
|
---
|
|
|
|
#### Parameter Description
|
|
|
|
Here are the main parameters supported by the script and their meanings:
|
|
|
|
| Parameter Name | Type | Example Value | Description |
|
|
|----------------------------|--------|----------------------------------|-----------------------------------------------------------------------------|
|
|
| `--url` | str | `http://127.0.0.1:8000/v1` | Base URL for the OpenAI API. |
|
|
| `--api_key` | str | `EMPTY` | API key required to access the model service. Default is `EMPTY`. |
|
|
| `--model_name` | str | `r1_32B` | The name of the model used, e.g., `r1_32B` or other custom model names. |
|
|
| `--test_dir` | str | `/path/to/test_dir` | Path to the directory containing test data (should include JSONL files). |
|
|
| `--max_concurrent_requests`| int | `128` | Maximum number of concurrent requests allowed globally. |
|
|
| `--template` | str | `r1` | Predefined conversation template (e.g., `r1`, `qwen`, `internthinker`, `chatml`). |
|
|
| `--max_tokens` | int | `32768` | Maximum number of tokens generated by the model. |
|
|
| `--temperature` | float | `0` | Controls randomness in text generation; lower values yield more deterministic results. |
|
|
| `--timeout` | int | `6000` | Request timeout in milliseconds. |
|
|
| `--api_mode` | str | `completion` | API mode; options are `completion` or `chat_completion`. |
|
|
| `--sys_prompt` | str | `"You are an expert reasoner..."` | System prompt content; only effective when `api_mode` is `chat_completion`. |
|
|
| `--max_retries` | int | `16` | Number of retries per failed request. |
|
|
| `--max_retrying_delay` | int | `60` | Maximum delay between retries in seconds. |
|
|
|
|
##### Parameter Relationships
|
|
- `--sys_prompt` is only effective if `--api_mode` is set to `chat_completion`.
|
|
- `--template` is only effective if `--api_mode` is set to `completion`.
|
|
- Valid values for `--template` include: `r1`, `qwen`, `internthinker`, `chatml` (from predefined `TEMPLATE_MAP`).
|
|
- If `--sys_prompt` is not provided, the default system prompt from the template will be used (if any).
|
|
|
|
---
|
|
|
|
#### Output Results
|
|
|
|
Evaluation results will be saved under the directory:
|
|
`examples/unittests/output/{model_name}_{test_dir}_{timestamp}`
|
|
|
|
The output includes:
|
|
|
|
1. **Detailed Results**:
|
|
- Each JSONL file's result is saved in `output/details/`, named after the original JSONL file.
|
|
- Each record contains the following fields:
|
|
- `id`: Sample ID.
|
|
- `prompt`: Input prompt.
|
|
- `output`: Model-generated output.
|
|
- `output_len`: Length of the output in tokens.
|
|
- `ground_truth`: Ground truth answer.
|
|
- `score`: Score calculated by `verify_score` method.
|
|
- `extracted_output`: Extracted output via `extract_output` method.
|
|
|
|
2. **Metadata**:
|
|
- Metadata including average score and average output length per bootcamp is saved in `output/meta.jsonl`.
|
|
|
|
3. **Summary Report**:
|
|
- A summary report is saved as an Excel file at `output/{model_name}_scores.xlsx`, including:
|
|
- Average score and output length per bootcamp.
|
|
- Overall average score and output length across all bootcamps.
|
|
|
|
4. **Progress Log**:
|
|
- Progress information is logged in `output/progress.log`, showing real-time progress and estimated remaining time for each dataset.
|
|
|
|
5. **Parameter Configuration**:
|
|
- The full configuration used in the current run is saved in `output/eval_args.json` for experiment reproducibility.
|
|
|
|
---
|
|
|
|
#### Notes
|
|
|
|
1. **Concurrency Settings**:
|
|
- Adjust `--max_concurrent_requests` based on machine capabilities and the size of the test set to avoid resource exhaustion due to excessive concurrency.
|
|
|
|
2. **URL Health Check**:
|
|
- Before starting the evaluation, the script automatically checks whether the model service is running and has registered the specified `model_name`.
|
|
- If the service is not ready, it will wait up to 60 minutes (default), retrying every 60 seconds.
|
|
|
|
3. **Error Handling Mechanism**:
|
|
- Each request can be retried up to `--max_retries` times using exponential backoff (up to `--max_retrying_delay` seconds).
|
|
- If all retries fail, the script raises an exception and terminates processing of the current sample.
|
|
|
|
---
|
|
|
|
#### Example Output Directory Structure
|
|
|
|
After execution, the output directory structure looks like this:
|
|
|
|
```
|
|
examples/unittests/output/
|
|
└── {model_name}_{test_dir}_{timestamp}/
|
|
├── details/
|
|
│ ├── test_file1.jsonl
|
|
│ ├── test_file2.jsonl
|
|
│ └── ...
|
|
├── meta.jsonl
|
|
├── eval_args.json
|
|
├── progress.log
|
|
└── {model_name}_scores.xlsx
|
|
```
|
|
|
|
--- |