Commit graph

1131 commits

Author SHA1 Message Date
joesharratt1229
e304b20e24 added Decimal curriculum (#280)
* added decimal curricula

* added chain sum decimal curriculum

* register DecimalArithmeticCurriculum & DecimalChainSumCurriculum

---------

Co-authored-by: Andreas Koepf <andreas.koepf@provisio.com>
2025-03-07 23:02:57 +01:00
Zafir Stojanovski
dc657b5ed4 feat(env): Binary Matrix Curriculum (#279)
* binary matrix curriculum

* register BinaryMatrixCurriculum

---------

Co-authored-by: Andreas Koepf <andreas.koepf@provisio.com>
2025-03-07 22:58:47 +01:00
joesharratt1229
98def56bb4 added basic arith curricula (#276)
* added basic arith curricula
* register BasicArithmeticCurriculum

---------

Co-authored-by: Andreas Koepf <andreas.koepf@provisio.com>
2025-03-07 22:54:49 +01:00
Oliver Stanley
35c32cd5e7 Tolerant scoring for CodeI/O based on edit distances (#277)
* add zss dep

* codeio edit distance-based scoring

* edit distance tweaks
2025-03-07 22:49:35 +01:00
Zafir Stojanovski
dfc28c94d6 feat(env): Binary Alternation Curriculum (#278)
* binary alternation

---------

Co-authored-by: Andreas Koepf <andreas.koepf@provisio.com>
2025-03-07 22:44:32 +01:00
Andreas Koepf
ce55d528ad register MahjongPuzzleCurriculum 2025-03-07 19:17:04 +01:00
Zafir Stojanovski
0fb90ce8c4 feat(env): Leg Counting Curriculum (#275)
* leg  counting curriculum

---------

Co-authored-by: Andreas Koepf <andreas.koepf@provisio.com>
2025-03-07 19:15:18 +01:00
Zafir Stojanovski
a48ff14507 add difficulty where possible (#274) 2025-03-07 19:01:26 +01:00
Andreas Koepf
8790c6be00 update gallery 2025-03-07 16:24:47 +01:00
Andreas Koepf
178b0bd22e remove data/ from main .gitignore 2025-03-07 16:16:40 +01:00
Andreas Koepf
2b1f7ce5ee use relative import for reasoning_gym.data 2025-03-07 15:56:45 +01:00
Andreas Köpf
c2263979bc Basic curriculum (#198)
* feat: Add optional curriculum support to dataset registration and creation
* docs: Add docstrings to create_curriculum() and register_dataset()
* feat: Add curriculum configuration classes for CurriculumExperiment
* feat: Add weight parameter to CurriculumAttributeConfig and use in DatasetSpec
* refactor: Simplify CurriculumAttributeConfig with "*" attribute level support
* test: Add unit tests for CurriculumExperiment class
* feat: Add from_yaml() method to CurriculumExperimentConfig with unit test
2025-03-07 11:22:12 +01:00
Rich Jones
34889d0517 Add Modulo Grid Task (#273)
* add modulo_grid dataset
* ensure the pattern is mathematical, not just spatial

---------

Co-authored-by: Andreas Koepf <andreas.koepf@provisio.com>
2025-03-07 11:11:41 +01:00
Rich Jones
11c9790a25 [Env] Game of Life Halting Prediction (#272)
This is a variant of the Game of Life task, which rather than trying to test the algorithmic simulation, tests the ability of the model to do explanatory reasoning of the board. The idea is that a model with good explanatory reasoning will be able to see that a game will not halt without simulating it into the future.

The task presents a GoL board, and the model is asked to predict if the board will halt (die, all cells zero) after n steps. Sometimes, the board will be made up of 'oscillators', isolated structures which never die. Othertimes, it is filled with non-oscillators, structures which will always die after a few steps. The model should deduce which case the presented board is.
2025-03-07 10:05:12 +01:00
Andreas Koepf
fa1bf7910a update gallery, pypi release, bump version 2025-03-05 23:45:45 +01:00
joesharratt1229
1893691c57 updated algorithmics dataset (#269)
* updated algorithmic datasets
* added changes to symbolic and power
* updated power function test
2025-03-05 23:32:53 +01:00
Zafir Stojanovski
f843ac1b82 shortest path curriculum (#271) 2025-03-05 22:46:10 +01:00
Zafir Stojanovski
a048084009 largest island curriculum (#270) 2025-03-05 22:45:35 +01:00
Zafir Stojanovski
3d9bb382aa feat(env): Count Bits Curriculum (#267)
* add min n

* count bits
2025-03-05 22:44:04 +01:00
Zafir Stojanovski
84158df1c7 feat(env): Course Schedule Curriculum (#266)
* course schedule curriculum

* update levels

* update comments

* lint
2025-03-05 22:42:46 +01:00
joesharratt1229
2c524c0c6f Added puzzle24 closes #208 (#268)
* added puzzle24
2025-03-05 22:36:37 +01:00
Oliver Stanley
3286a68361 First version of CodeI/O reasoning data (#264)
* notebook for prepping first set of raw code files
* updated codeio processing notebook for repo-level processing
* fix for edge case in codeio scoring
* Add reformat notebook
* filtering pass
* add non-determinism filtering
* Tweak CodeIODataset & include first real data
* add basic codeio test, metadata
2025-03-05 22:34:11 +01:00
joesharratt1229
7458dbc95d Fixed countdown score_answer (#265)
* fixed countdown score ans
* checked solution uses all numbers
2025-03-05 22:30:12 +01:00
Zafir Stojanovski
3c544aba20 feat(env): Mahjong Puzzle Curriculum (#263)
* mahjong curriculum

* typo

* update levels
2025-03-05 22:28:02 +01:00
Zafir Stojanovski
19ca54da72 feat(env): NQueens Curriculum (#262)
* curriculum & tests
2025-03-05 15:05:17 +01:00
Andreas Köpf
b2904ccab9 Minor question template & score_answer improvements (#261)
* math prompt improvements
* ignore brackets in complex_arithmetic results
* improve additional instruction in prompt of polynomial_equations
* more strict tests for score_answer in polynomial_equations
* simplify special reward handling
* fix test_intermediate_integration
* fix sokoban dataset
* add common dataset score_answer consistency test
2025-03-04 21:55:09 +01:00
joesharratt1229
bf24999bb0 implemented family_relationships score ans (#260) 2025-03-04 21:37:57 +01:00
vncntt
478646622e should exit if API key isn't defined (#259)
* should exit if open-router and no api key
2025-03-04 09:45:36 +01:00
Rich Jones
e3b7365f50 Game of Life partial scoring and rule-clarification (#258)
* partial scoring and rule clarification
* better ql scoring
* word seq reverse typos
2025-03-03 22:22:39 +01:00
joesharratt1229
340d6a7ab9 updated for config by dataset (#257)
* updated for config by dataset

* updated read me
2025-03-03 21:58:32 +01:00
Andreas Köpf
07388767a2 Reduce precision from 28 to 6 in DecimalArithmeticDataset (#256) 2025-03-03 21:57:08 +01:00
Andreas Köpf
17f87476a3 add Chain of Draft and direct system prompt styles (#255) 2025-03-03 21:56:31 +01:00
Zafir Stojanovski
2f9d94c1e7 fix: Unify Prompts (#254)
* remove cot
* fix prompt template
* fix pool matrix
* spiral matrix fixed
2025-03-03 21:55:53 +01:00
joesharratt1229
976e1710a6 small change to word sequence reversal prompt (#252)
corrected ansewr format
2025-03-02 17:34:35 +01:00
vncntt
8992037ecc fixed problems in knights_knaves (#251)
* remove unnecessary variables

* added depth logic

* add depth tests
2025-03-02 08:47:54 +01:00
Andreas Köpf
ece6990709 Remove strip from ProceduralDataset::core score_answer() (#250)
* remove strip from ProceduralDataset::core score_answer(), strip in extract answer (optional, default=True)
* test: Move test_extract_answer() from test_dataset.py to test_utils.py
* refactor: Improve decimal reward computation with more flexible comparison
* fix: Implement rounding for format_number when round_if_needed is True
* test: Add test case for compute_decimal_reward with sign and zeros
2025-03-02 08:46:36 +01:00
Andreas Köpf
16a4ea1193 Revert "log error message on bad api response (#243)" (#249)
This reverts commit 27e66ba6dd.
2025-03-01 23:56:42 +01:00
Andreas Köpf
1b1c04bb70 feat: Add category property to ProceduralDataset to extract category name (#248) 2025-03-01 23:11:40 +01:00
Zafir Stojanovski
1bc9f6f09f fix manipulate matrix (#247) 2025-03-01 23:00:29 +01:00
Rich Jones
80aafda8e5 more dynamic scoring for jumble (#246) 2025-03-01 18:50:59 +01:00
Zafir Stojanovski
78c92d7056 Mahjong Puzzle (#241)
* mahjong
2025-03-01 16:27:26 +01:00
Andreas Köpf
dbd2ac723e Add base_url and api_key command line args for eval.py script (#244)
* feat: Add base URL command line parameter to eval.py script
* feat: Add API key parameter and CLI option to AsyncModelEvaluator
2025-02-28 18:32:58 +01:00
Rich Jones
27e66ba6dd log error message on bad api response (#243) 2025-02-28 15:32:27 +01:00
Andreas Köpf
59922486c6 Eval sampling settings for generation (temperature, top-p, max_tokens) (#242)
* feat: Add sampling parameters to eval configuration and API call
* feat: Add support for system_prompt_id and optional system_prompt configuration
2025-02-28 11:48:37 +01:00
Andreas Koepf
d83e53115a fix prompt for arc_1d 2025-02-28 08:07:59 +01:00
Andreas Koepf (aider)
82e79d672e feat: Add system prompt to dataset results and summary output 2025-02-28 00:26:06 +01:00
Andreas Köpf
0b108efac1 Generate eval config tool (#240)
* feat: Add generate_config.py script to create eval  configurations
2025-02-27 21:40:53 +01:00
Andreas Köpf
1ea9a657a7 Eval script consolidation (#238)
The script now supports:
   - YAML and JSON configurations
   - Dataset-specific parameters
   - Overriding configuration via command line
   - Detailed logging and error handling
2025-02-27 17:39:14 +01:00
Andreas Köpf
bd745ae959 Merge pull request #237 from open-thought/rich/richmorevalfixes2
Fix graph color example template
2025-02-27 16:08:23 +01:00
Rich Jones
ca5372dcc1 rm typo 2025-02-27 13:44:33 +01:00