Rich Jones
11c9790a25
[Env] Game of Life Halting Prediction ( #272 )
...
This is a variant of the Game of Life task, which rather than trying to test the algorithmic simulation, tests the ability of the model to do explanatory reasoning of the board. The idea is that a model with good explanatory reasoning will be able to see that a game will not halt without simulating it into the future.
The task presents a GoL board, and the model is asked to predict if the board will halt (die, all cells zero) after n steps. Sometimes, the board will be made up of 'oscillators', isolated structures which never die. Othertimes, it is filled with non-oscillators, structures which will always die after a few steps. The model should deduce which case the presented board is.
2025-03-07 10:05:12 +01:00
Andreas Koepf
fa1bf7910a
update gallery, pypi release, bump version
2025-03-05 23:45:45 +01:00
joesharratt1229
1893691c57
updated algorithmics dataset ( #269 )
...
* updated algorithmic datasets
* added changes to symbolic and power
* updated power function test
2025-03-05 23:32:53 +01:00
Zafir Stojanovski
f843ac1b82
shortest path curriculum ( #271 )
2025-03-05 22:46:10 +01:00
Zafir Stojanovski
a048084009
largest island curriculum ( #270 )
2025-03-05 22:45:35 +01:00
Zafir Stojanovski
3d9bb382aa
feat(env): Count Bits Curriculum ( #267 )
...
* add min n
* count bits
2025-03-05 22:44:04 +01:00
Zafir Stojanovski
84158df1c7
feat(env): Course Schedule Curriculum ( #266 )
...
* course schedule curriculum
* update levels
* update comments
* lint
2025-03-05 22:42:46 +01:00
joesharratt1229
2c524c0c6f
Added puzzle24 closes #208 ( #268 )
...
* added puzzle24
2025-03-05 22:36:37 +01:00
Oliver Stanley
3286a68361
First version of CodeI/O reasoning data ( #264 )
...
* notebook for prepping first set of raw code files
* updated codeio processing notebook for repo-level processing
* fix for edge case in codeio scoring
* Add reformat notebook
* filtering pass
* add non-determinism filtering
* Tweak CodeIODataset & include first real data
* add basic codeio test, metadata
2025-03-05 22:34:11 +01:00
joesharratt1229
7458dbc95d
Fixed countdown score_answer ( #265 )
...
* fixed countdown score ans
* checked solution uses all numbers
2025-03-05 22:30:12 +01:00
Zafir Stojanovski
3c544aba20
feat(env): Mahjong Puzzle Curriculum ( #263 )
...
* mahjong curriculum
* typo
* update levels
2025-03-05 22:28:02 +01:00
Zafir Stojanovski
19ca54da72
feat(env): NQueens Curriculum ( #262 )
...
* curriculum & tests
2025-03-05 15:05:17 +01:00
Andreas Köpf
b2904ccab9
Minor question template & score_answer improvements ( #261 )
...
* math prompt improvements
* ignore brackets in complex_arithmetic results
* improve additional instruction in prompt of polynomial_equations
* more strict tests for score_answer in polynomial_equations
* simplify special reward handling
* fix test_intermediate_integration
* fix sokoban dataset
* add common dataset score_answer consistency test
2025-03-04 21:55:09 +01:00
joesharratt1229
bf24999bb0
implemented family_relationships score ans ( #260 )
2025-03-04 21:37:57 +01:00
vncntt
478646622e
should exit if API key isn't defined ( #259 )
...
* should exit if open-router and no api key
2025-03-04 09:45:36 +01:00
Rich Jones
e3b7365f50
Game of Life partial scoring and rule-clarification ( #258 )
...
* partial scoring and rule clarification
* better ql scoring
* word seq reverse typos
2025-03-03 22:22:39 +01:00
joesharratt1229
340d6a7ab9
updated for config by dataset ( #257 )
...
* updated for config by dataset
* updated read me
2025-03-03 21:58:32 +01:00
Andreas Köpf
07388767a2
Reduce precision from 28 to 6 in DecimalArithmeticDataset ( #256 )
2025-03-03 21:57:08 +01:00
Andreas Köpf
17f87476a3
add Chain of Draft and direct system prompt styles ( #255 )
2025-03-03 21:56:31 +01:00
Zafir Stojanovski
2f9d94c1e7
fix: Unify Prompts ( #254 )
...
* remove cot
* fix prompt template
* fix pool matrix
* spiral matrix fixed
2025-03-03 21:55:53 +01:00
joesharratt1229
976e1710a6
small change to word sequence reversal prompt ( #252 )
...
corrected ansewr format
2025-03-02 17:34:35 +01:00
vncntt
8992037ecc
fixed problems in knights_knaves ( #251 )
...
* remove unnecessary variables
* added depth logic
* add depth tests
2025-03-02 08:47:54 +01:00
Andreas Köpf
ece6990709
Remove strip from ProceduralDataset::core score_answer() ( #250 )
...
* remove strip from ProceduralDataset::core score_answer(), strip in extract answer (optional, default=True)
* test: Move test_extract_answer() from test_dataset.py to test_utils.py
* refactor: Improve decimal reward computation with more flexible comparison
* fix: Implement rounding for format_number when round_if_needed is True
* test: Add test case for compute_decimal_reward with sign and zeros
2025-03-02 08:46:36 +01:00
Andreas Köpf
16a4ea1193
Revert "log error message on bad api response ( #243 )" ( #249 )
...
This reverts commit 27e66ba6dd .
2025-03-01 23:56:42 +01:00
Andreas Köpf
1b1c04bb70
feat: Add category property to ProceduralDataset to extract category name ( #248 )
2025-03-01 23:11:40 +01:00
Zafir Stojanovski
1bc9f6f09f
fix manipulate matrix ( #247 )
2025-03-01 23:00:29 +01:00
Rich Jones
80aafda8e5
more dynamic scoring for jumble ( #246 )
2025-03-01 18:50:59 +01:00
Zafir Stojanovski
78c92d7056
Mahjong Puzzle ( #241 )
...
* mahjong
2025-03-01 16:27:26 +01:00
Andreas Köpf
dbd2ac723e
Add base_url and api_key command line args for eval.py script ( #244 )
...
* feat: Add base URL command line parameter to eval.py script
* feat: Add API key parameter and CLI option to AsyncModelEvaluator
2025-02-28 18:32:58 +01:00
Rich Jones
27e66ba6dd
log error message on bad api response ( #243 )
2025-02-28 15:32:27 +01:00
Andreas Köpf
59922486c6
Eval sampling settings for generation (temperature, top-p, max_tokens) ( #242 )
...
* feat: Add sampling parameters to eval configuration and API call
* feat: Add support for system_prompt_id and optional system_prompt configuration
2025-02-28 11:48:37 +01:00
Andreas Koepf
d83e53115a
fix prompt for arc_1d
2025-02-28 08:07:59 +01:00
Andreas Koepf (aider)
82e79d672e
feat: Add system prompt to dataset results and summary output
2025-02-28 00:26:06 +01:00
Andreas Köpf
0b108efac1
Generate eval config tool ( #240 )
...
* feat: Add generate_config.py script to create eval configurations
2025-02-27 21:40:53 +01:00
Andreas Köpf
1ea9a657a7
Eval script consolidation ( #238 )
...
The script now supports:
- YAML and JSON configurations
- Dataset-specific parameters
- Overriding configuration via command line
- Detailed logging and error handling
2025-02-27 17:39:14 +01:00
Andreas Köpf
bd745ae959
Merge pull request #237 from open-thought/rich/richmorevalfixes2
...
Fix graph color example template
2025-02-27 16:08:23 +01:00
Rich Jones
ca5372dcc1
rm typo
2025-02-27 13:44:33 +01:00
Rich Jones
9a8e398f22
fix graph color example template
2025-02-27 13:43:01 +01:00
Andreas Köpf
ba9d625ef4
Merge pull request #186 from zafstojano/feat/codeio
...
feat(env): CodeIO
2025-02-27 12:18:13 +01:00
Andreas Köpf
ed90fff3fa
Merge pull request #220 from open-thought/rich/cubeinstructions
...
Make Rubiks Cube Output Format More Explicit
2025-02-27 12:16:09 +01:00
Andreas Köpf
1cc6eded6a
Merge pull request #236 from open-thought/rich/moreevalfixes
...
Trivial Fixes
2025-02-27 12:14:43 +01:00
Rich Jones
a1b1272e8d
sm fixes
2025-02-27 11:54:04 +01:00
Rich Jones
b2b2311329
seed test config
2025-02-27 10:44:28 +01:00
Rich Jones
9daaccc208
expand more
2025-02-27 10:41:30 +01:00
Zafir Stojanovski
2c566f76ea
final tweaks
2025-02-27 08:38:34 +01:00
Andreas Köpf
6ceb03f224
Merge pull request #233 from open-thought/llama-3.3-70_eval_config
...
Llama 3.3 70 eval config
2025-02-26 22:56:33 +01:00
Andreas Koepf
4cd5bd42c3
verify that OPENROUTER_API_KEY env var is set
2025-02-26 22:15:30 +01:00
Andreas Koepf (aider)
a92dcd4a75
feat: Add comprehensive unit tests for parse_string_to_complex() method
2025-02-26 21:44:32 +01:00
Andreas Koepf
726ba114dc
add llama-3.3-70b-instruct eval yaml files
2025-02-26 20:54:07 +01:00
Zafir Stojanovski
4a59d13100
update timeout
2025-02-26 20:27:43 +01:00