Eval N completions per prompt (#374)

* feat: Add support for generating multiple completions per prompt
* feat: Track best and mean scores for multiple completions per prompt
* feat: Add checkpoint and resume functionality to evaluation script
This commit is contained in:
Andreas Köpf 2025-03-15 16:39:36 +01:00 committed by GitHub
parent 1d410cc600
commit 424ee6751a
12 changed files with 426 additions and 126 deletions

View file

@ -89,6 +89,7 @@ categories:
- dataset: rubiks_cube
- category: games
datasets:
- dataset: boxnet
- dataset: countdown
- dataset: emoji_mystery
- dataset: futoshiki