reasoning-gym

mirror of https://github.com/open-thought/reasoning-gym.git synced 2026-04-19 12:58:07 +00:00

Author	SHA1	Message	Date
Andreas Koepf	fb06038e88	update gallery	2025-03-07 16:24:47 +01:00
Andreas Koepf	2802066233	remove data/ from main .gitignore	2025-03-07 16:16:40 +01:00
Andreas Koepf	c504efc2c3	use relative import for reasoning_gym.data	2025-03-07 15:56:45 +01:00
Andreas Köpf	c69bc5d4e6	Basic curriculum (#198 ) * feat: Add optional curriculum support to dataset registration and creation * docs: Add docstrings to create_curriculum() and register_dataset() * feat: Add curriculum configuration classes for CurriculumExperiment * feat: Add weight parameter to CurriculumAttributeConfig and use in DatasetSpec * refactor: Simplify CurriculumAttributeConfig with "" attribute level support test: Add unit tests for CurriculumExperiment class * feat: Add from_yaml() method to CurriculumExperimentConfig with unit test	2025-03-07 11:22:12 +01:00
Rich Jones	cbfdf097a0	Add Modulo Grid Task (#273 ) * add modulo_grid dataset * ensure the pattern is mathematical, not just spatial --------- Co-authored-by: Andreas Koepf <andreas.koepf@provisio.com>	2025-03-07 11:11:41 +01:00
Rich Jones	07dc01ad87	[Env] Game of Life Halting Prediction (#272 ) This is a variant of the Game of Life task, which rather than trying to test the algorithmic simulation, tests the ability of the model to do explanatory reasoning of the board. The idea is that a model with good explanatory reasoning will be able to see that a game will not halt without simulating it into the future. The task presents a GoL board, and the model is asked to predict if the board will halt (die, all cells zero) after n steps. Sometimes, the board will be made up of 'oscillators', isolated structures which never die. Othertimes, it is filled with non-oscillators, structures which will always die after a few steps. The model should deduce which case the presented board is.	2025-03-07 10:05:12 +01:00
Andreas Koepf	862617b7e0	update gallery, pypi release, bump version	2025-03-05 23:45:45 +01:00
joesharratt1229	d9638df79c	updated algorithmics dataset (#269 ) * updated algorithmic datasets * added changes to symbolic and power * updated power function test	2025-03-05 23:32:53 +01:00
Zafir Stojanovski	f426db90ec	shortest path curriculum (#271 )	2025-03-05 22:46:10 +01:00
Zafir Stojanovski	5bac641650	largest island curriculum (#270 )	2025-03-05 22:45:35 +01:00
Zafir Stojanovski	9bb6d028a3	feat(env): Count Bits Curriculum (#267 ) * add min n * count bits	2025-03-05 22:44:04 +01:00
Zafir Stojanovski	8ccc4d7b0c	feat(env): Course Schedule Curriculum (#266 ) * course schedule curriculum * update levels * update comments * lint	2025-03-05 22:42:46 +01:00
joesharratt1229	f3ee9a91a2	Added puzzle24 closes #208 (#268 ) * added puzzle24	2025-03-05 22:36:37 +01:00
Oliver Stanley	d1e505a8e9	First version of CodeI/O reasoning data (#264 ) * notebook for prepping first set of raw code files * updated codeio processing notebook for repo-level processing * fix for edge case in codeio scoring * Add reformat notebook * filtering pass * add non-determinism filtering * Tweak CodeIODataset & include first real data * add basic codeio test, metadata	2025-03-05 22:34:11 +01:00
joesharratt1229	e30be066ec	Fixed `countdown` `score_answer` (#265 ) * fixed countdown score ans * checked solution uses all numbers	2025-03-05 22:30:12 +01:00
Zafir Stojanovski	d0a42116fb	feat(env): Mahjong Puzzle Curriculum (#263 ) * mahjong curriculum * typo * update levels	2025-03-05 22:28:02 +01:00
Zafir Stojanovski	8ecc723607	feat(env): NQueens Curriculum (#262 ) * curriculum & tests	2025-03-05 15:05:17 +01:00
Andreas Köpf	5d7fbac0ad	Minor question template & score_answer improvements (#261 ) * math prompt improvements * ignore brackets in complex_arithmetic results * improve additional instruction in prompt of polynomial_equations * more strict tests for score_answer in polynomial_equations * simplify special reward handling * fix test_intermediate_integration * fix sokoban dataset * add common dataset score_answer consistency test	2025-03-04 21:55:09 +01:00
joesharratt1229	061282e373	implemented family_relationships score ans (#260 )	2025-03-04 21:37:57 +01:00
vncntt	3672b231f1	should exit if API key isn't defined (#259 ) * should exit if open-router and no api key	2025-03-04 09:45:36 +01:00
Rich Jones	0ba6119850	Game of Life partial scoring and rule-clarification (#258 ) * partial scoring and rule clarification * better ql scoring * word seq reverse typos	2025-03-03 22:22:39 +01:00
joesharratt1229	6770ee3eef	updated for config by dataset (#257 ) * updated for config by dataset * updated read me	2025-03-03 21:58:32 +01:00
Andreas Köpf	c0cf237474	Reduce precision from 28 to 6 in DecimalArithmeticDataset (#256 )	2025-03-03 21:57:08 +01:00
Andreas Köpf	68ecdca2bb	add Chain of Draft and direct system prompt styles (#255 )	2025-03-03 21:56:31 +01:00
Zafir Stojanovski	01e1c8f9af	fix: Unify Prompts (#254 ) * remove cot * fix prompt template * fix pool matrix * spiral matrix fixed	2025-03-03 21:55:53 +01:00
joesharratt1229	49db4ed761	small change to word sequence reversal prompt (#252 ) corrected ansewr format	2025-03-02 17:34:35 +01:00
vncntt	3149edf2c4	fixed problems in knights_knaves (#251 ) * remove unnecessary variables * added depth logic * add depth tests	2025-03-02 08:47:54 +01:00
Andreas Köpf	24828e1889	Remove strip from ProceduralDataset::core score_answer() (#250 ) * remove strip from ProceduralDataset::core score_answer(), strip in extract answer (optional, default=True) * test: Move test_extract_answer() from test_dataset.py to test_utils.py * refactor: Improve decimal reward computation with more flexible comparison * fix: Implement rounding for format_number when round_if_needed is True * test: Add test case for compute_decimal_reward with sign and zeros	2025-03-02 08:46:36 +01:00
Andreas Köpf	a66a7e7965	Revert "log error message on bad api response (#243 )" (#249 ) This reverts commit `8e2089b6c0`.	2025-03-01 23:56:42 +01:00
Andreas Köpf	e71d2a96b6	feat: Add `category` property to `ProceduralDataset` to extract category name (#248 )	2025-03-01 23:11:40 +01:00
Zafir Stojanovski	f549909c3d	fix manipulate matrix (#247 )	2025-03-01 23:00:29 +01:00
Rich Jones	39f151ad14	more dynamic scoring for jumble (#246 )	2025-03-01 18:50:59 +01:00
Zafir Stojanovski	9c581f1be1	Mahjong Puzzle (#241 ) * mahjong	2025-03-01 16:27:26 +01:00
Andreas Köpf	4ad9d22fa3	Add base_url and api_key command line args for eval.py script (#244 ) * feat: Add base URL command line parameter to eval.py script * feat: Add API key parameter and CLI option to AsyncModelEvaluator	2025-02-28 18:32:58 +01:00
Rich Jones	8e2089b6c0	log error message on bad api response (#243 )	2025-02-28 15:32:27 +01:00
Andreas Köpf	b4207162ff	Eval sampling settings for generation (temperature, top-p, max_tokens) (#242 ) * feat: Add sampling parameters to eval configuration and API call * feat: Add support for system_prompt_id and optional system_prompt configuration	2025-02-28 11:48:37 +01:00
Andreas Koepf	b1c8840129	fix prompt for arc_1d	2025-02-28 08:07:59 +01:00
Andreas Koepf (aider)	24a4b7a4c8	feat: Add system prompt to dataset results and summary output	2025-02-28 00:26:06 +01:00
Andreas Köpf	5b8d1b5175	Generate eval config tool (#240 ) * feat: Add generate_config.py script to create eval configurations	2025-02-27 21:40:53 +01:00
Andreas Köpf	850c1cf6f4	Eval script consolidation (#238 ) The script now supports: - YAML and JSON configurations - Dataset-specific parameters - Overriding configuration via command line - Detailed logging and error handling	2025-02-27 17:39:14 +01:00
Andreas Köpf	8a66d2a216	Merge pull request #237 from open-thought/rich/richmorevalfixes2 Fix graph color example template	2025-02-27 16:08:23 +01:00
Rich Jones	a6c90f40a1	rm typo	2025-02-27 13:44:33 +01:00
Rich Jones	1b95cd3206	fix graph color example template	2025-02-27 13:43:01 +01:00
Andreas Köpf	a56b3b6c5c	Merge pull request #186 from zafstojano/feat/codeio feat(env): CodeIO	2025-02-27 12:18:13 +01:00
Andreas Köpf	c98cc5fcd6	Merge pull request #220 from open-thought/rich/cubeinstructions Make Rubiks Cube Output Format More Explicit	2025-02-27 12:16:09 +01:00
Andreas Köpf	7f64a1bb7c	Merge pull request #236 from open-thought/rich/moreevalfixes Trivial Fixes	2025-02-27 12:14:43 +01:00
Rich Jones	253e49aecf	sm fixes	2025-02-27 11:54:04 +01:00
Rich Jones	52d6b2efd2	seed test config	2025-02-27 10:44:28 +01:00
Rich Jones	633a1aa1ba	expand more	2025-02-27 10:41:30 +01:00
Zafir Stojanovski	4c637c3b13	final tweaks	2025-02-27 08:38:34 +01:00

... 3 4 5 6 7 ...

1323 commits