reasoning-gym

mirror of https://github.com/open-thought/reasoning-gym.git synced 2026-04-28 17:29:39 +00:00

Author	SHA1	Message	Date
Rich Jones	11c9790a25	[Env] Game of Life Halting Prediction (#272 ) This is a variant of the Game of Life task, which rather than trying to test the algorithmic simulation, tests the ability of the model to do explanatory reasoning of the board. The idea is that a model with good explanatory reasoning will be able to see that a game will not halt without simulating it into the future. The task presents a GoL board, and the model is asked to predict if the board will halt (die, all cells zero) after n steps. Sometimes, the board will be made up of 'oscillators', isolated structures which never die. Othertimes, it is filled with non-oscillators, structures which will always die after a few steps. The model should deduce which case the presented board is.	2025-03-07 10:05:12 +01:00
Andreas Koepf	fa1bf7910a	update gallery, pypi release, bump version	2025-03-05 23:45:45 +01:00
joesharratt1229	1893691c57	updated algorithmics dataset (#269 ) * updated algorithmic datasets * added changes to symbolic and power * updated power function test	2025-03-05 23:32:53 +01:00
Zafir Stojanovski	f843ac1b82	shortest path curriculum (#271 )	2025-03-05 22:46:10 +01:00
Zafir Stojanovski	a048084009	largest island curriculum (#270 )	2025-03-05 22:45:35 +01:00
Zafir Stojanovski	3d9bb382aa	feat(env): Count Bits Curriculum (#267 ) * add min n * count bits	2025-03-05 22:44:04 +01:00
Zafir Stojanovski	84158df1c7	feat(env): Course Schedule Curriculum (#266 ) * course schedule curriculum * update levels * update comments * lint	2025-03-05 22:42:46 +01:00
joesharratt1229	2c524c0c6f	Added puzzle24 closes #208 (#268 ) * added puzzle24	2025-03-05 22:36:37 +01:00
Oliver Stanley	3286a68361	First version of CodeI/O reasoning data (#264 ) * notebook for prepping first set of raw code files * updated codeio processing notebook for repo-level processing * fix for edge case in codeio scoring * Add reformat notebook * filtering pass * add non-determinism filtering * Tweak CodeIODataset & include first real data * add basic codeio test, metadata	2025-03-05 22:34:11 +01:00
joesharratt1229	7458dbc95d	Fixed `countdown` `score_answer` (#265 ) * fixed countdown score ans * checked solution uses all numbers	2025-03-05 22:30:12 +01:00
Zafir Stojanovski	3c544aba20	feat(env): Mahjong Puzzle Curriculum (#263 ) * mahjong curriculum * typo * update levels	2025-03-05 22:28:02 +01:00
Zafir Stojanovski	19ca54da72	feat(env): NQueens Curriculum (#262 ) * curriculum & tests	2025-03-05 15:05:17 +01:00
Andreas Köpf	b2904ccab9	Minor question template & score_answer improvements (#261 ) * math prompt improvements * ignore brackets in complex_arithmetic results * improve additional instruction in prompt of polynomial_equations * more strict tests for score_answer in polynomial_equations * simplify special reward handling * fix test_intermediate_integration * fix sokoban dataset * add common dataset score_answer consistency test	2025-03-04 21:55:09 +01:00
joesharratt1229	bf24999bb0	implemented family_relationships score ans (#260 )	2025-03-04 21:37:57 +01:00
vncntt	478646622e	should exit if API key isn't defined (#259 ) * should exit if open-router and no api key	2025-03-04 09:45:36 +01:00
Rich Jones	e3b7365f50	Game of Life partial scoring and rule-clarification (#258 ) * partial scoring and rule clarification * better ql scoring * word seq reverse typos	2025-03-03 22:22:39 +01:00
joesharratt1229	340d6a7ab9	updated for config by dataset (#257 ) * updated for config by dataset * updated read me	2025-03-03 21:58:32 +01:00
Andreas Köpf	07388767a2	Reduce precision from 28 to 6 in DecimalArithmeticDataset (#256 )	2025-03-03 21:57:08 +01:00
Andreas Köpf	17f87476a3	add Chain of Draft and direct system prompt styles (#255 )	2025-03-03 21:56:31 +01:00
Zafir Stojanovski	2f9d94c1e7	fix: Unify Prompts (#254 ) * remove cot * fix prompt template * fix pool matrix * spiral matrix fixed	2025-03-03 21:55:53 +01:00
joesharratt1229	976e1710a6	small change to word sequence reversal prompt (#252 ) corrected ansewr format	2025-03-02 17:34:35 +01:00
vncntt	8992037ecc	fixed problems in knights_knaves (#251 ) * remove unnecessary variables * added depth logic * add depth tests	2025-03-02 08:47:54 +01:00
Andreas Köpf	ece6990709	Remove strip from ProceduralDataset::core score_answer() (#250 ) * remove strip from ProceduralDataset::core score_answer(), strip in extract answer (optional, default=True) * test: Move test_extract_answer() from test_dataset.py to test_utils.py * refactor: Improve decimal reward computation with more flexible comparison * fix: Implement rounding for format_number when round_if_needed is True * test: Add test case for compute_decimal_reward with sign and zeros	2025-03-02 08:46:36 +01:00
Andreas Köpf	16a4ea1193	Revert "log error message on bad api response (#243 )" (#249 ) This reverts commit `27e66ba6dd`.	2025-03-01 23:56:42 +01:00
Andreas Köpf	1b1c04bb70	feat: Add `category` property to `ProceduralDataset` to extract category name (#248 )	2025-03-01 23:11:40 +01:00
Zafir Stojanovski	1bc9f6f09f	fix manipulate matrix (#247 )	2025-03-01 23:00:29 +01:00
Rich Jones	80aafda8e5	more dynamic scoring for jumble (#246 )	2025-03-01 18:50:59 +01:00
Zafir Stojanovski	78c92d7056	Mahjong Puzzle (#241 ) * mahjong	2025-03-01 16:27:26 +01:00
Andreas Köpf	dbd2ac723e	Add base_url and api_key command line args for eval.py script (#244 ) * feat: Add base URL command line parameter to eval.py script * feat: Add API key parameter and CLI option to AsyncModelEvaluator	2025-02-28 18:32:58 +01:00
Rich Jones	27e66ba6dd	log error message on bad api response (#243 )	2025-02-28 15:32:27 +01:00
Andreas Köpf	59922486c6	Eval sampling settings for generation (temperature, top-p, max_tokens) (#242 ) * feat: Add sampling parameters to eval configuration and API call * feat: Add support for system_prompt_id and optional system_prompt configuration	2025-02-28 11:48:37 +01:00
Andreas Koepf	d83e53115a	fix prompt for arc_1d	2025-02-28 08:07:59 +01:00
Andreas Koepf (aider)	82e79d672e	feat: Add system prompt to dataset results and summary output	2025-02-28 00:26:06 +01:00
Andreas Köpf	0b108efac1	Generate eval config tool (#240 ) * feat: Add generate_config.py script to create eval configurations	2025-02-27 21:40:53 +01:00
Andreas Köpf	1ea9a657a7	Eval script consolidation (#238 ) The script now supports: - YAML and JSON configurations - Dataset-specific parameters - Overriding configuration via command line - Detailed logging and error handling	2025-02-27 17:39:14 +01:00
Andreas Köpf	bd745ae959	Merge pull request #237 from open-thought/rich/richmorevalfixes2 Fix graph color example template	2025-02-27 16:08:23 +01:00
Rich Jones	ca5372dcc1	rm typo	2025-02-27 13:44:33 +01:00
Rich Jones	9a8e398f22	fix graph color example template	2025-02-27 13:43:01 +01:00
Andreas Köpf	ba9d625ef4	Merge pull request #186 from zafstojano/feat/codeio feat(env): CodeIO	2025-02-27 12:18:13 +01:00
Andreas Köpf	ed90fff3fa	Merge pull request #220 from open-thought/rich/cubeinstructions Make Rubiks Cube Output Format More Explicit	2025-02-27 12:16:09 +01:00
Andreas Köpf	1cc6eded6a	Merge pull request #236 from open-thought/rich/moreevalfixes Trivial Fixes	2025-02-27 12:14:43 +01:00
Rich Jones	a1b1272e8d	sm fixes	2025-02-27 11:54:04 +01:00
Rich Jones	b2b2311329	seed test config	2025-02-27 10:44:28 +01:00
Rich Jones	9daaccc208	expand more	2025-02-27 10:41:30 +01:00
Zafir Stojanovski	2c566f76ea	final tweaks	2025-02-27 08:38:34 +01:00
Andreas Köpf	6ceb03f224	Merge pull request #233 from open-thought/llama-3.3-70_eval_config Llama 3.3 70 eval config	2025-02-26 22:56:33 +01:00
Andreas Koepf	4cd5bd42c3	verify that OPENROUTER_API_KEY env var is set	2025-02-26 22:15:30 +01:00
Andreas Koepf (aider)	a92dcd4a75	feat: Add comprehensive unit tests for parse_string_to_complex() method	2025-02-26 21:44:32 +01:00
Andreas Koepf	726ba114dc	add llama-3.3-70b-instruct eval yaml files	2025-02-26 20:54:07 +01:00
Zafir Stojanovski	4a59d13100	update timeout	2025-02-26 20:27:43 +01:00

1 2 3 4 5 ...

1118 commits