Andreas Köpf
850c1cf6f4
Eval script consolidation ( #238 )
...
The script now supports:
- YAML and JSON configurations
- Dataset-specific parameters
- Overriding configuration via command line
- Detailed logging and error handling
2025-02-27 17:39:14 +01:00
Andreas Köpf
8a66d2a216
Merge pull request #237 from open-thought/rich/richmorevalfixes2
...
Fix graph color example template
2025-02-27 16:08:23 +01:00
Rich Jones
a6c90f40a1
rm typo
2025-02-27 13:44:33 +01:00
Rich Jones
1b95cd3206
fix graph color example template
2025-02-27 13:43:01 +01:00
Andreas Köpf
a56b3b6c5c
Merge pull request #186 from zafstojano/feat/codeio
...
feat(env): CodeIO
2025-02-27 12:18:13 +01:00
Andreas Köpf
c98cc5fcd6
Merge pull request #220 from open-thought/rich/cubeinstructions
...
Make Rubiks Cube Output Format More Explicit
2025-02-27 12:16:09 +01:00
Andreas Köpf
7f64a1bb7c
Merge pull request #236 from open-thought/rich/moreevalfixes
...
Trivial Fixes
2025-02-27 12:14:43 +01:00
Rich Jones
253e49aecf
sm fixes
2025-02-27 11:54:04 +01:00
Rich Jones
52d6b2efd2
seed test config
2025-02-27 10:44:28 +01:00
Rich Jones
633a1aa1ba
expand more
2025-02-27 10:41:30 +01:00
Zafir Stojanovski
4c637c3b13
final tweaks
2025-02-27 08:38:34 +01:00
Andreas Köpf
359feb47c3
Merge pull request #233 from open-thought/llama-3.3-70_eval_config
...
Llama 3.3 70 eval config
2025-02-26 22:56:33 +01:00
Andreas Koepf
477e1f85cc
verify that OPENROUTER_API_KEY env var is set
2025-02-26 22:15:30 +01:00
Andreas Koepf (aider)
941da618d8
feat: Add comprehensive unit tests for parse_string_to_complex() method
2025-02-26 21:44:32 +01:00
Andreas Koepf
acb2d7eb53
add llama-3.3-70b-instruct eval yaml files
2025-02-26 20:54:07 +01:00
Zafir Stojanovski
1ec625cbd9
update timeout
2025-02-26 20:27:43 +01:00
Zafir Stojanovski
2ce450486d
e2b testing
2025-02-26 20:19:52 +01:00
Andreas Koepf
6511725711
add markdown tripple backticks around tsumego board
2025-02-26 19:39:05 +01:00
Andreas Köpf
fb556c39ca
Merge pull request #232 from open-thought/211_fix_tsumego_score_answer
...
Fix & simplify score_answer() of TsumegoDataset
2025-02-26 19:07:32 +01:00
Andreas Koepf
f97bf94caa
fix & simplify score_answer() of TsumegoDataset
2025-02-26 19:04:30 +01:00
Andreas Koepf
72233fc2ea
bump version, pypi release of 0.1.12
2025-02-26 18:25:16 +01:00
Andreas Koepf
b5742de5e5
update gallery
2025-02-26 18:23:06 +01:00
Oliver Stanley
ac4ce13369
Merge pull request #188 from olliestanley/codeio-sampler
...
Procedural dataset for generating reasoning problems from CodeI/O-style data
2025-02-26 16:51:45 +00:00
Andreas Köpf
5e1594da16
Merge pull request #231 from AhmedSaif2/count-primes
...
Fix primes representation in count_primes dataset metadata
2025-02-26 17:49:50 +01:00
Andreas Köpf
e351d302a3
Merge pull request #219 from open-thought/rich/fix_ccc
...
Fix Cube Rotation Scoring
2025-02-26 17:41:18 +01:00
AhmedSaif2
dcdc38b15d
Fix primes representation in count_primes dataset metadata
2025-02-26 14:58:21 +02:00
Rich Jones
f0ca949aaf
support expanded notation anyway
2025-02-26 13:17:03 +01:00
Rich Jones
285e2b20cc
rubiks cube instructions
2025-02-26 13:07:17 +01:00
Rich Jones
229086131a
fix CCC scoring
2025-02-26 12:54:40 +01:00
Oliver
5fa06c961f
Fix
2025-02-26 11:17:23 +00:00
Andreas Köpf
5b89a3a2d0
Merge pull request #217 from open-thought/feat/o3-mini-eun
...
added o3 mini yaml rconfiguration
2025-02-26 09:38:11 +01:00
vncntt
29179f783e
fix sonnet eval_dir ( #216 )
...
* fix eval_dir
* add logging
2025-02-26 09:37:09 +01:00
joesharratt1229
7d7e44d1af
added o3 mini yaml
2025-02-26 08:09:12 +00:00
Andreas Köpf
48f082663a
Fix PoolMatrixConfigs::score_answer(), add unit tests ( #215 )
2025-02-26 00:43:18 +01:00
Andreas Köpf
99fec3425d
Merge pull request #212 from open-thought/eval_consolidation_2
...
Add llama-3.3-70b-instruct algebra, algorithmic eval configs
2025-02-25 23:46:08 +01:00
Andreas Koepf
bba128ffd0
fix score_answer of pool_matrix (if -> elif), remove print
2025-02-25 23:43:29 +01:00
Andreas Koepf
f9e8f8b064
add try-except to GraphColorDataset.score_answer()
2025-02-25 23:43:29 +01:00
Andreas Koepf
65d17b9850
add None/empty check to score_answer of cryptarithm
2025-02-25 23:43:29 +01:00
Andreas Koepf
6d5168d1e5
add llama-3.3-70b-instruct algebra, algorithmic eval configs
2025-02-25 23:43:29 +01:00
Andreas Koepf
92c8be1699
fix formatting of NOTICE.txt
2025-02-25 23:43:12 +01:00
Oliver
3028492933
Fix pre-commit
2025-02-25 22:42:06 +00:00
Oliver
aa6759c160
Merge branch 'main' into codeio-sampler
2025-02-25 22:41:47 +00:00
Oliver
81c77a495d
Add note on code execution to CodeIODataset
2025-02-25 22:39:06 +00:00
Oliver
0252dd905f
Move data file & load into memory on first object creation
2025-02-25 22:36:38 +00:00
Zafir Stojanovski
b47bf882ce
filtering
2025-02-25 22:21:26 +01:00
Andreas Koepf (aider)
8ccf077faf
docs: Add BibTeX citation for Re-ARC dataset in NOTICE.txt
2025-02-25 20:19:11 +01:00
vncntt
5f01049607
Add KnightsKnavesDataset (knights_knaves)
...
Adapted code from https://github.com/AlphaPav/mem-kk-logic/blob/main/data_prep/lib_kk.py
---------
Co-authored-by: Andreas Koepf (aider) <andreas.koepf@provisio.com>
2025-02-25 20:15:38 +01:00
Andreas Köpf
ed9292a7f4
Merge pull request #205 from open-thought/consolidate_eval_script
...
Consolidate eval scripts to have single eval.py
2025-02-25 19:45:05 +01:00
Andreas Koepf
791f16ec0f
use results folder name for eval results
2025-02-25 19:41:21 +01:00
joesharratt1229
ffe60ef112
finalised readme
2025-02-25 18:14:39 +00:00