Eval N completions per prompt (#374)

* feat: Add support for generating multiple completions per prompt
* feat: Track best and mean scores for multiple completions per prompt
* feat: Add checkpoint and resume functionality to evaluation script
This commit is contained in:
Andreas Köpf 2025-03-15 16:39:36 +01:00 committed by GitHub
parent bd13b1b92a
commit bfa5f8078b
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
12 changed files with 426 additions and 126 deletions

View file

@ -89,6 +89,7 @@ categories:
- dataset: rubiks_cube
- category: games
datasets:
- dataset: boxnet
- dataset: countdown
- dataset: emoji_mystery
- dataset: futoshiki