Andreas Köpf
1b1c04bb70
feat: Add category property to ProceduralDataset to extract category name ( #248 )
2025-03-01 23:11:40 +01:00
Zafir Stojanovski
1bc9f6f09f
fix manipulate matrix ( #247 )
2025-03-01 23:00:29 +01:00
Rich Jones
80aafda8e5
more dynamic scoring for jumble ( #246 )
2025-03-01 18:50:59 +01:00
Zafir Stojanovski
78c92d7056
Mahjong Puzzle ( #241 )
...
* mahjong
2025-03-01 16:27:26 +01:00
Andreas Köpf
dbd2ac723e
Add base_url and api_key command line args for eval.py script ( #244 )
...
* feat: Add base URL command line parameter to eval.py script
* feat: Add API key parameter and CLI option to AsyncModelEvaluator
2025-02-28 18:32:58 +01:00
Rich Jones
27e66ba6dd
log error message on bad api response ( #243 )
2025-02-28 15:32:27 +01:00
Andreas Köpf
59922486c6
Eval sampling settings for generation (temperature, top-p, max_tokens) ( #242 )
...
* feat: Add sampling parameters to eval configuration and API call
* feat: Add support for system_prompt_id and optional system_prompt configuration
2025-02-28 11:48:37 +01:00
Andreas Koepf
d83e53115a
fix prompt for arc_1d
2025-02-28 08:07:59 +01:00
Andreas Koepf (aider)
82e79d672e
feat: Add system prompt to dataset results and summary output
2025-02-28 00:26:06 +01:00
Andreas Köpf
0b108efac1
Generate eval config tool ( #240 )
...
* feat: Add generate_config.py script to create eval configurations
2025-02-27 21:40:53 +01:00
Andreas Köpf
1ea9a657a7
Eval script consolidation ( #238 )
...
The script now supports:
- YAML and JSON configurations
- Dataset-specific parameters
- Overriding configuration via command line
- Detailed logging and error handling
2025-02-27 17:39:14 +01:00
Andreas Köpf
bd745ae959
Merge pull request #237 from open-thought/rich/richmorevalfixes2
...
Fix graph color example template
2025-02-27 16:08:23 +01:00
Rich Jones
ca5372dcc1
rm typo
2025-02-27 13:44:33 +01:00
Rich Jones
9a8e398f22
fix graph color example template
2025-02-27 13:43:01 +01:00
Andreas Köpf
ba9d625ef4
Merge pull request #186 from zafstojano/feat/codeio
...
feat(env): CodeIO
2025-02-27 12:18:13 +01:00
Andreas Köpf
ed90fff3fa
Merge pull request #220 from open-thought/rich/cubeinstructions
...
Make Rubiks Cube Output Format More Explicit
2025-02-27 12:16:09 +01:00
Andreas Köpf
1cc6eded6a
Merge pull request #236 from open-thought/rich/moreevalfixes
...
Trivial Fixes
2025-02-27 12:14:43 +01:00
Rich Jones
a1b1272e8d
sm fixes
2025-02-27 11:54:04 +01:00
Rich Jones
b2b2311329
seed test config
2025-02-27 10:44:28 +01:00
Rich Jones
9daaccc208
expand more
2025-02-27 10:41:30 +01:00
Zafir Stojanovski
2c566f76ea
final tweaks
2025-02-27 08:38:34 +01:00
Andreas Köpf
6ceb03f224
Merge pull request #233 from open-thought/llama-3.3-70_eval_config
...
Llama 3.3 70 eval config
2025-02-26 22:56:33 +01:00
Andreas Koepf
4cd5bd42c3
verify that OPENROUTER_API_KEY env var is set
2025-02-26 22:15:30 +01:00
Andreas Koepf (aider)
a92dcd4a75
feat: Add comprehensive unit tests for parse_string_to_complex() method
2025-02-26 21:44:32 +01:00
Andreas Koepf
726ba114dc
add llama-3.3-70b-instruct eval yaml files
2025-02-26 20:54:07 +01:00
Zafir Stojanovski
4a59d13100
update timeout
2025-02-26 20:27:43 +01:00
Zafir Stojanovski
20c8392417
e2b testing
2025-02-26 20:19:52 +01:00
Andreas Koepf
2362b52d24
add markdown tripple backticks around tsumego board
2025-02-26 19:39:05 +01:00
Andreas Köpf
95821a72bc
Merge pull request #232 from open-thought/211_fix_tsumego_score_answer
...
Fix & simplify score_answer() of TsumegoDataset
2025-02-26 19:07:32 +01:00
Andreas Koepf
2ddcb7c3c7
fix & simplify score_answer() of TsumegoDataset
2025-02-26 19:04:30 +01:00
Andreas Koepf
3bdf531122
bump version, pypi release of 0.1.12
2025-02-26 18:25:16 +01:00
Andreas Koepf
3c16f1c195
update gallery
2025-02-26 18:23:06 +01:00
Oliver Stanley
a0d466765a
Merge pull request #188 from olliestanley/codeio-sampler
...
Procedural dataset for generating reasoning problems from CodeI/O-style data
2025-02-26 16:51:45 +00:00
Andreas Köpf
7c4ab296fd
Merge pull request #231 from AhmedSaif2/count-primes
...
Fix primes representation in count_primes dataset metadata
2025-02-26 17:49:50 +01:00
Andreas Köpf
42d42aae89
Merge pull request #219 from open-thought/rich/fix_ccc
...
Fix Cube Rotation Scoring
2025-02-26 17:41:18 +01:00
AhmedSaif2
e9e36f3a23
Fix primes representation in count_primes dataset metadata
2025-02-26 14:58:21 +02:00
Rich Jones
214e9d4957
support expanded notation anyway
2025-02-26 13:17:03 +01:00
Rich Jones
b252937f99
rubiks cube instructions
2025-02-26 13:07:17 +01:00
Rich Jones
f2479fcacc
fix CCC scoring
2025-02-26 12:54:40 +01:00
Oliver
8f05e6108c
Fix
2025-02-26 11:17:23 +00:00
Andreas Köpf
c0e5941fe5
Merge pull request #217 from open-thought/feat/o3-mini-eun
...
added o3 mini yaml rconfiguration
2025-02-26 09:38:11 +01:00
vncntt
98af865309
fix sonnet eval_dir ( #216 )
...
* fix eval_dir
* add logging
2025-02-26 09:37:09 +01:00
joesharratt1229
8eaece6f05
added o3 mini yaml
2025-02-26 08:09:12 +00:00
Andreas Köpf
6b923d5ea0
Fix PoolMatrixConfigs::score_answer(), add unit tests ( #215 )
2025-02-26 00:43:18 +01:00
Andreas Köpf
7317c6f0b4
Merge pull request #212 from open-thought/eval_consolidation_2
...
Add llama-3.3-70b-instruct algebra, algorithmic eval configs
2025-02-25 23:46:08 +01:00
Andreas Koepf
ba6bdb7d6b
fix score_answer of pool_matrix (if -> elif), remove print
2025-02-25 23:43:29 +01:00
Andreas Koepf
969ec6a208
add try-except to GraphColorDataset.score_answer()
2025-02-25 23:43:29 +01:00
Andreas Koepf
d1f2f30d8a
add None/empty check to score_answer of cryptarithm
2025-02-25 23:43:29 +01:00
Andreas Koepf
9b7eec2d64
add llama-3.3-70b-instruct algebra, algorithmic eval configs
2025-02-25 23:43:29 +01:00
Andreas Koepf
70b9cc813e
fix formatting of NOTICE.txt
2025-02-25 23:43:12 +01:00