# InternBootcamp Evaluation Guide To quickly evaluate the performance of a model in different bootcamp environments, you can use the `run_eval.py` script. This script supports multiple configuration options and is highly flexible to accommodate various testing needs. --- #### Example Execution Command Below is a complete example command demonstrating how to run the evaluation script: ```bash cd InternBootcamp python examples/unittests/run_eval.py \ --url http://127.0.0.1:8000/v1 \ --api_key EMPTY \ --model_name r1_32B \ --test_dir /path/to/test_dir \ --max_concurrent_requests 128 \ --template r1 \ --max_tokens 32768 \ --temperature 0 \ --timeout 6000 \ --api_mode completion \ --max_retries 16 \ --max_retrying_delay 60 \ --resume ``` --- #### Parameter Description Here are the main parameters supported by the script and their meanings: | Parameter Name | Type | Example Value | Description | |----------------------------|--------|----------------------------------|-----------------------------------------------------------------------------| | `--url` | str | `http://127.0.0.1:8000/v1` | Base URL for the OpenAI API. | | `--api_key` | str | `EMPTY` | API key required to access the model service. Default is `EMPTY`. | | `--model_name` | str | `r1_32B` | The name of the model used, e.g., `r1_32B` or other custom model names. | | `--test_dir` | str | `/path/to/test_dir` | Path to the directory containing test data (should include JSONL files). | | `--max_concurrent_requests`| int | `128` | Maximum number of concurrent requests allowed globally. | | `--template` | str | `r1` | Predefined conversation template (e.g., `r1`, `qwen`, `internthinker`, `chatml`). | | `--max_tokens` | int | `32768` | Maximum number of tokens generated by the model. | | `--temperature` | float | `0` | Controls randomness in text generation; lower values yield more deterministic results. | | `--timeout` | int | `6000` | Request timeout in milliseconds. | | `--api_mode` | str | `completion` | API mode; options are `completion` or `chat_completion`. | | `--sys_prompt` | str | `"You are an expert reasoner..."` | System prompt content; only effective when `api_mode` is `chat_completion`. | | `--max_retries` | int | `16` | Number of retries per failed request. | | `--max_retrying_delay` | int | `60` | Maximum delay between retries in seconds. | | `--resume` | bool | `true` | Resume from previous run. | | `--check_model_url` | bool | `true` | Check if the model service URL is available before starting the evaluation. | ##### Parameter Relationships - `--sys_prompt` is only effective if `--api_mode` is set to `chat_completion`. - `--template` is only effective if `--api_mode` is set to `completion`. - Valid values for `--template` include: `r1`, `qwen`, `internthinker`, `chatml` (from predefined `TEMPLATE_MAP`). - If `--sys_prompt` is not provided, the default system prompt from the template will be used (if any). --- #### Output Results Evaluation results will be saved under the directory: `examples/unittests/output/{model_name}_{test_dir}_{timestamp}` The output includes: 1. **Detailed Results**: - Each JSONL file's result is saved in `output/details/`, named after the original JSONL file. - Each record contains the following fields: - `id`: Sample ID. - `prompt`: Input prompt. - `output`: Model-generated output. - `output_len`: Length of the output in tokens. - `ground_truth`: Ground truth answer. - `score`: Score calculated by `verify_score` method. - `extracted_output`: Extracted output via `extract_output` method. 2. **Metadata**: - Metadata including average score and average output length per bootcamp is saved in `output/meta.jsonl`. 3. **Summary Report**: - A summary report is saved as an Excel file at `output/{model_name}_scores.xlsx`, including: - Average score and output length per bootcamp. - Overall average score and output length across all bootcamps. 4. **Progress Log**: - Progress information is logged in `output/progress.log`, showing real-time progress and estimated remaining time for each dataset. 5. **Parameter Configuration**: - The full configuration used in the current run is saved in `output/eval_args.json` for experiment reproducibility. --- #### Notes 1. **Concurrency Settings**: - Adjust `--max_concurrent_requests` based on machine capabilities and the size of the test set to avoid resource exhaustion due to excessive concurrency. 2. **URL Health Check**: - Before starting the evaluation, the script automatically checks whether the model service is running and has registered the specified `model_name`. - If the service is not ready, it will wait up to 60 minutes (default), retrying every 60 seconds. 3. **Error Handling Mechanism**: - Each request can be retried up to `--max_retries` times using exponential backoff (up to `--max_retrying_delay` seconds). - If all retries fail, the script raises an exception and terminates processing of the current sample. --- #### Example Output Directory Structure After execution, the output directory structure looks like this: ``` examples/unittests/output/ └── {model_name}_{test_dir}_{timestamp}/ ├── details/ │ ├── test_file1.jsonl │ ├── test_file2.jsonl │ └── ... ├── meta.jsonl ├── eval_args.json ├── progress.log └── {model_name}_scores.xlsx ``` ---