Eval N completions per prompt (#374)

* feat: Add support for generating multiple completions per prompt * feat: Track best and mean scores for multiple completions per prompt * feat: Add checkpoint and resume functionality to evaluation script
2026-04-25 17:10:51 +00:00 · 2025-03-15 16:39:36 +01:00 · 2025-03-15 16:39:36 +01:00 · 424ee6751a
commit 424ee6751a
parent 1d410cc600
12 changed files with 426 additions and 126 deletions
--- a/eval/README.md
+++ b/eval/README.md
@ -56,6 +56,7 @@ default_seed: 42  # Default seed for all datasets
 max_tokens: 32768  # Maximum generation length (optional)
 temperature: 0.6   # Generation temperature (optional)
 top_p: 0.95        # Top-p sampling parameter (optional)
+completions_per_prompt: 1  # Number of completions to generate per prompt (each is a separate API call) (optional)
 system_prompt_id: "default"  # Use a predefined system prompt by ID (optional)
 # system_prompt: "Your custom system prompt here"  # Or specify a custom system prompt directly

@ -160,6 +161,22 @@ You can specify a different API base URL if needed:
 python eval.py --config example_config.yaml --base-url "https://api.together.xyz/v1" --api-key "your-together-api-key"
 ```

+### Resuming Interrupted Evaluations
+
+If an evaluation is interrupted (e.g., due to a network issue or system crash), you can resume it from where it left off:
+
+```bash
+python eval.py --config example_config.yaml --resume results/model_name_20250315_123045/
+```
+
+This will:
+1. Load the checkpoint from the specified directory
+2. Skip datasets that have already been completed
+3. Continue with the remaining datasets
+4. Produce the same final output as if the evaluation had run without interruption
+
+The checkpoint system automatically saves progress after each dataset completes, so you can safely interrupt and resume evaluations at any time.
+

 The results will be stored in a directory named after the model and timestamp, containing:
 - `summary.json` - Summary of all results