Andreas Koepf
fa1bf7910a
update gallery, pypi release, bump version
2025-03-05 23:45:45 +01:00
joesharratt1229
1893691c57
updated algorithmics dataset ( #269 )
...
* updated algorithmic datasets
* added changes to symbolic and power
* updated power function test
2025-03-05 23:32:53 +01:00
Zafir Stojanovski
f843ac1b82
shortest path curriculum ( #271 )
2025-03-05 22:46:10 +01:00
Zafir Stojanovski
a048084009
largest island curriculum ( #270 )
2025-03-05 22:45:35 +01:00
Zafir Stojanovski
3d9bb382aa
feat(env): Count Bits Curriculum ( #267 )
...
* add min n
* count bits
2025-03-05 22:44:04 +01:00
Zafir Stojanovski
84158df1c7
feat(env): Course Schedule Curriculum ( #266 )
...
* course schedule curriculum
* update levels
* update comments
* lint
2025-03-05 22:42:46 +01:00
joesharratt1229
2c524c0c6f
Added puzzle24 closes #208 ( #268 )
...
* added puzzle24
2025-03-05 22:36:37 +01:00
Oliver Stanley
3286a68361
First version of CodeI/O reasoning data ( #264 )
...
* notebook for prepping first set of raw code files
* updated codeio processing notebook for repo-level processing
* fix for edge case in codeio scoring
* Add reformat notebook
* filtering pass
* add non-determinism filtering
* Tweak CodeIODataset & include first real data
* add basic codeio test, metadata
2025-03-05 22:34:11 +01:00
joesharratt1229
7458dbc95d
Fixed countdown score_answer ( #265 )
...
* fixed countdown score ans
* checked solution uses all numbers
2025-03-05 22:30:12 +01:00
Zafir Stojanovski
3c544aba20
feat(env): Mahjong Puzzle Curriculum ( #263 )
...
* mahjong curriculum
* typo
* update levels
2025-03-05 22:28:02 +01:00
Zafir Stojanovski
19ca54da72
feat(env): NQueens Curriculum ( #262 )
...
* curriculum & tests
2025-03-05 15:05:17 +01:00
Andreas Köpf
b2904ccab9
Minor question template & score_answer improvements ( #261 )
...
* math prompt improvements
* ignore brackets in complex_arithmetic results
* improve additional instruction in prompt of polynomial_equations
* more strict tests for score_answer in polynomial_equations
* simplify special reward handling
* fix test_intermediate_integration
* fix sokoban dataset
* add common dataset score_answer consistency test
2025-03-04 21:55:09 +01:00
joesharratt1229
bf24999bb0
implemented family_relationships score ans ( #260 )
2025-03-04 21:37:57 +01:00
vncntt
478646622e
should exit if API key isn't defined ( #259 )
...
* should exit if open-router and no api key
2025-03-04 09:45:36 +01:00
Rich Jones
e3b7365f50
Game of Life partial scoring and rule-clarification ( #258 )
...
* partial scoring and rule clarification
* better ql scoring
* word seq reverse typos
2025-03-03 22:22:39 +01:00
joesharratt1229
340d6a7ab9
updated for config by dataset ( #257 )
...
* updated for config by dataset
* updated read me
2025-03-03 21:58:32 +01:00
Andreas Köpf
07388767a2
Reduce precision from 28 to 6 in DecimalArithmeticDataset ( #256 )
2025-03-03 21:57:08 +01:00
Andreas Köpf
17f87476a3
add Chain of Draft and direct system prompt styles ( #255 )
2025-03-03 21:56:31 +01:00
Zafir Stojanovski
2f9d94c1e7
fix: Unify Prompts ( #254 )
...
* remove cot
* fix prompt template
* fix pool matrix
* spiral matrix fixed
2025-03-03 21:55:53 +01:00
joesharratt1229
976e1710a6
small change to word sequence reversal prompt ( #252 )
...
corrected ansewr format
2025-03-02 17:34:35 +01:00
vncntt
8992037ecc
fixed problems in knights_knaves ( #251 )
...
* remove unnecessary variables
* added depth logic
* add depth tests
2025-03-02 08:47:54 +01:00
Andreas Köpf
ece6990709
Remove strip from ProceduralDataset::core score_answer() ( #250 )
...
* remove strip from ProceduralDataset::core score_answer(), strip in extract answer (optional, default=True)
* test: Move test_extract_answer() from test_dataset.py to test_utils.py
* refactor: Improve decimal reward computation with more flexible comparison
* fix: Implement rounding for format_number when round_if_needed is True
* test: Add test case for compute_decimal_reward with sign and zeros
2025-03-02 08:46:36 +01:00
Andreas Köpf
16a4ea1193
Revert "log error message on bad api response ( #243 )" ( #249 )
...
This reverts commit 27e66ba6dd .
2025-03-01 23:56:42 +01:00
Andreas Köpf
1b1c04bb70
feat: Add category property to ProceduralDataset to extract category name ( #248 )
2025-03-01 23:11:40 +01:00
Zafir Stojanovski
1bc9f6f09f
fix manipulate matrix ( #247 )
2025-03-01 23:00:29 +01:00
Rich Jones
80aafda8e5
more dynamic scoring for jumble ( #246 )
2025-03-01 18:50:59 +01:00
Zafir Stojanovski
78c92d7056
Mahjong Puzzle ( #241 )
...
* mahjong
2025-03-01 16:27:26 +01:00
Andreas Köpf
dbd2ac723e
Add base_url and api_key command line args for eval.py script ( #244 )
...
* feat: Add base URL command line parameter to eval.py script
* feat: Add API key parameter and CLI option to AsyncModelEvaluator
2025-02-28 18:32:58 +01:00
Rich Jones
27e66ba6dd
log error message on bad api response ( #243 )
2025-02-28 15:32:27 +01:00
Andreas Köpf
59922486c6
Eval sampling settings for generation (temperature, top-p, max_tokens) ( #242 )
...
* feat: Add sampling parameters to eval configuration and API call
* feat: Add support for system_prompt_id and optional system_prompt configuration
2025-02-28 11:48:37 +01:00
Andreas Koepf
d83e53115a
fix prompt for arc_1d
2025-02-28 08:07:59 +01:00
Andreas Koepf (aider)
82e79d672e
feat: Add system prompt to dataset results and summary output
2025-02-28 00:26:06 +01:00
Andreas Köpf
0b108efac1
Generate eval config tool ( #240 )
...
* feat: Add generate_config.py script to create eval configurations
2025-02-27 21:40:53 +01:00
Andreas Köpf
1ea9a657a7
Eval script consolidation ( #238 )
...
The script now supports:
- YAML and JSON configurations
- Dataset-specific parameters
- Overriding configuration via command line
- Detailed logging and error handling
2025-02-27 17:39:14 +01:00
Andreas Köpf
bd745ae959
Merge pull request #237 from open-thought/rich/richmorevalfixes2
...
Fix graph color example template
2025-02-27 16:08:23 +01:00
Rich Jones
ca5372dcc1
rm typo
2025-02-27 13:44:33 +01:00
Rich Jones
9a8e398f22
fix graph color example template
2025-02-27 13:43:01 +01:00
Andreas Köpf
ba9d625ef4
Merge pull request #186 from zafstojano/feat/codeio
...
feat(env): CodeIO
2025-02-27 12:18:13 +01:00
Andreas Köpf
ed90fff3fa
Merge pull request #220 from open-thought/rich/cubeinstructions
...
Make Rubiks Cube Output Format More Explicit
2025-02-27 12:16:09 +01:00
Andreas Köpf
1cc6eded6a
Merge pull request #236 from open-thought/rich/moreevalfixes
...
Trivial Fixes
2025-02-27 12:14:43 +01:00
Rich Jones
a1b1272e8d
sm fixes
2025-02-27 11:54:04 +01:00
Rich Jones
b2b2311329
seed test config
2025-02-27 10:44:28 +01:00
Rich Jones
9daaccc208
expand more
2025-02-27 10:41:30 +01:00
Zafir Stojanovski
2c566f76ea
final tweaks
2025-02-27 08:38:34 +01:00
Andreas Köpf
6ceb03f224
Merge pull request #233 from open-thought/llama-3.3-70_eval_config
...
Llama 3.3 70 eval config
2025-02-26 22:56:33 +01:00
Andreas Koepf
4cd5bd42c3
verify that OPENROUTER_API_KEY env var is set
2025-02-26 22:15:30 +01:00
Andreas Koepf (aider)
a92dcd4a75
feat: Add comprehensive unit tests for parse_string_to_complex() method
2025-02-26 21:44:32 +01:00
Andreas Koepf
726ba114dc
add llama-3.3-70b-instruct eval yaml files
2025-02-26 20:54:07 +01:00
Zafir Stojanovski
4a59d13100
update timeout
2025-02-26 20:27:43 +01:00
Zafir Stojanovski
20c8392417
e2b testing
2025-02-26 20:19:52 +01:00