mirror of
https://github.com/open-thought/reasoning-gym.git
synced 2026-04-19 12:58:07 +00:00
Eval N completions per prompt (#374)
* feat: Add support for generating multiple completions per prompt * feat: Track best and mean scores for multiple completions per prompt * feat: Add checkpoint and resume functionality to evaluation script
This commit is contained in:
parent
bd13b1b92a
commit
bfa5f8078b
12 changed files with 426 additions and 126 deletions
|
|
@ -89,6 +89,7 @@ categories:
|
|||
- dataset: rubiks_cube
|
||||
- category: games
|
||||
datasets:
|
||||
- dataset: boxnet
|
||||
- dataset: countdown
|
||||
- dataset: emoji_mystery
|
||||
- dataset: futoshiki
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue