reasoning-gym

mirror of https://github.com/open-thought/reasoning-gym.git synced 2026-04-24 17:05:03 +00:00

Author	SHA1	Message	Date
Andreas Koepf	fa1bf7910a	update gallery, pypi release, bump version	2025-03-05 23:45:45 +01:00
joesharratt1229	1893691c57	updated algorithmics dataset (#269 ) * updated algorithmic datasets * added changes to symbolic and power * updated power function test	2025-03-05 23:32:53 +01:00
Zafir Stojanovski	f843ac1b82	shortest path curriculum (#271 )	2025-03-05 22:46:10 +01:00
Zafir Stojanovski	a048084009	largest island curriculum (#270 )	2025-03-05 22:45:35 +01:00
Zafir Stojanovski	3d9bb382aa	feat(env): Count Bits Curriculum (#267 ) * add min n * count bits	2025-03-05 22:44:04 +01:00
Zafir Stojanovski	84158df1c7	feat(env): Course Schedule Curriculum (#266 ) * course schedule curriculum * update levels * update comments * lint	2025-03-05 22:42:46 +01:00
joesharratt1229	2c524c0c6f	Added puzzle24 closes #208 (#268 ) * added puzzle24	2025-03-05 22:36:37 +01:00
Oliver Stanley	3286a68361	First version of CodeI/O reasoning data (#264 ) * notebook for prepping first set of raw code files * updated codeio processing notebook for repo-level processing * fix for edge case in codeio scoring * Add reformat notebook * filtering pass * add non-determinism filtering * Tweak CodeIODataset & include first real data * add basic codeio test, metadata	2025-03-05 22:34:11 +01:00
joesharratt1229	7458dbc95d	Fixed `countdown` `score_answer` (#265 ) * fixed countdown score ans * checked solution uses all numbers	2025-03-05 22:30:12 +01:00
Zafir Stojanovski	3c544aba20	feat(env): Mahjong Puzzle Curriculum (#263 ) * mahjong curriculum * typo * update levels	2025-03-05 22:28:02 +01:00
Zafir Stojanovski	19ca54da72	feat(env): NQueens Curriculum (#262 ) * curriculum & tests	2025-03-05 15:05:17 +01:00
Andreas Köpf	b2904ccab9	Minor question template & score_answer improvements (#261 ) * math prompt improvements * ignore brackets in complex_arithmetic results * improve additional instruction in prompt of polynomial_equations * more strict tests for score_answer in polynomial_equations * simplify special reward handling * fix test_intermediate_integration * fix sokoban dataset * add common dataset score_answer consistency test	2025-03-04 21:55:09 +01:00
joesharratt1229	bf24999bb0	implemented family_relationships score ans (#260 )	2025-03-04 21:37:57 +01:00
vncntt	478646622e	should exit if API key isn't defined (#259 ) * should exit if open-router and no api key	2025-03-04 09:45:36 +01:00
Rich Jones	e3b7365f50	Game of Life partial scoring and rule-clarification (#258 ) * partial scoring and rule clarification * better ql scoring * word seq reverse typos	2025-03-03 22:22:39 +01:00
joesharratt1229	340d6a7ab9	updated for config by dataset (#257 ) * updated for config by dataset * updated read me	2025-03-03 21:58:32 +01:00
Andreas Köpf	07388767a2	Reduce precision from 28 to 6 in DecimalArithmeticDataset (#256 )	2025-03-03 21:57:08 +01:00
Andreas Köpf	17f87476a3	add Chain of Draft and direct system prompt styles (#255 )	2025-03-03 21:56:31 +01:00
Zafir Stojanovski	2f9d94c1e7	fix: Unify Prompts (#254 ) * remove cot * fix prompt template * fix pool matrix * spiral matrix fixed	2025-03-03 21:55:53 +01:00
joesharratt1229	976e1710a6	small change to word sequence reversal prompt (#252 ) corrected ansewr format	2025-03-02 17:34:35 +01:00
vncntt	8992037ecc	fixed problems in knights_knaves (#251 ) * remove unnecessary variables * added depth logic * add depth tests	2025-03-02 08:47:54 +01:00
Andreas Köpf	ece6990709	Remove strip from ProceduralDataset::core score_answer() (#250 ) * remove strip from ProceduralDataset::core score_answer(), strip in extract answer (optional, default=True) * test: Move test_extract_answer() from test_dataset.py to test_utils.py * refactor: Improve decimal reward computation with more flexible comparison * fix: Implement rounding for format_number when round_if_needed is True * test: Add test case for compute_decimal_reward with sign and zeros	2025-03-02 08:46:36 +01:00
Andreas Köpf	16a4ea1193	Revert "log error message on bad api response (#243 )" (#249 ) This reverts commit `27e66ba6dd`.	2025-03-01 23:56:42 +01:00
Andreas Köpf	1b1c04bb70	feat: Add `category` property to `ProceduralDataset` to extract category name (#248 )	2025-03-01 23:11:40 +01:00
Zafir Stojanovski	1bc9f6f09f	fix manipulate matrix (#247 )	2025-03-01 23:00:29 +01:00
Rich Jones	80aafda8e5	more dynamic scoring for jumble (#246 )	2025-03-01 18:50:59 +01:00
Zafir Stojanovski	78c92d7056	Mahjong Puzzle (#241 ) * mahjong	2025-03-01 16:27:26 +01:00
Andreas Köpf	dbd2ac723e	Add base_url and api_key command line args for eval.py script (#244 ) * feat: Add base URL command line parameter to eval.py script * feat: Add API key parameter and CLI option to AsyncModelEvaluator	2025-02-28 18:32:58 +01:00
Rich Jones	27e66ba6dd	log error message on bad api response (#243 )	2025-02-28 15:32:27 +01:00
Andreas Köpf	59922486c6	Eval sampling settings for generation (temperature, top-p, max_tokens) (#242 ) * feat: Add sampling parameters to eval configuration and API call * feat: Add support for system_prompt_id and optional system_prompt configuration	2025-02-28 11:48:37 +01:00
Andreas Koepf	d83e53115a	fix prompt for arc_1d	2025-02-28 08:07:59 +01:00
Andreas Koepf (aider)	82e79d672e	feat: Add system prompt to dataset results and summary output	2025-02-28 00:26:06 +01:00
Andreas Köpf	0b108efac1	Generate eval config tool (#240 ) * feat: Add generate_config.py script to create eval configurations	2025-02-27 21:40:53 +01:00
Andreas Köpf	1ea9a657a7	Eval script consolidation (#238 ) The script now supports: - YAML and JSON configurations - Dataset-specific parameters - Overriding configuration via command line - Detailed logging and error handling	2025-02-27 17:39:14 +01:00
Andreas Köpf	bd745ae959	Merge pull request #237 from open-thought/rich/richmorevalfixes2 Fix graph color example template	2025-02-27 16:08:23 +01:00
Rich Jones	ca5372dcc1	rm typo	2025-02-27 13:44:33 +01:00
Rich Jones	9a8e398f22	fix graph color example template	2025-02-27 13:43:01 +01:00
Andreas Köpf	ba9d625ef4	Merge pull request #186 from zafstojano/feat/codeio feat(env): CodeIO	2025-02-27 12:18:13 +01:00
Andreas Köpf	ed90fff3fa	Merge pull request #220 from open-thought/rich/cubeinstructions Make Rubiks Cube Output Format More Explicit	2025-02-27 12:16:09 +01:00
Andreas Köpf	1cc6eded6a	Merge pull request #236 from open-thought/rich/moreevalfixes Trivial Fixes	2025-02-27 12:14:43 +01:00
Rich Jones	a1b1272e8d	sm fixes	2025-02-27 11:54:04 +01:00
Rich Jones	b2b2311329	seed test config	2025-02-27 10:44:28 +01:00
Rich Jones	9daaccc208	expand more	2025-02-27 10:41:30 +01:00
Zafir Stojanovski	2c566f76ea	final tweaks	2025-02-27 08:38:34 +01:00
Andreas Köpf	6ceb03f224	Merge pull request #233 from open-thought/llama-3.3-70_eval_config Llama 3.3 70 eval config	2025-02-26 22:56:33 +01:00
Andreas Koepf	4cd5bd42c3	verify that OPENROUTER_API_KEY env var is set	2025-02-26 22:15:30 +01:00
Andreas Koepf (aider)	a92dcd4a75	feat: Add comprehensive unit tests for parse_string_to_complex() method	2025-02-26 21:44:32 +01:00
Andreas Koepf	726ba114dc	add llama-3.3-70b-instruct eval yaml files	2025-02-26 20:54:07 +01:00
Zafir Stojanovski	4a59d13100	update timeout	2025-02-26 20:27:43 +01:00
Zafir Stojanovski	20c8392417	e2b testing	2025-02-26 20:19:52 +01:00

1 2 3 4 5 ...

1117 commits