* feat: add scoring cascade for reducing false negatives in answer verification
* style: fix black and isort formatting
Run black and isort to satisfy pre-commit checks.
Made-with: Cursor
* docs: add scoring cascade example to Quickstart section
Mention the experimental scoring cascade feature at the end of the
Quickstart section with a disclaimer and complete usage examples
showing both the dataset method and standalone function.
Made-with: Cursor
* docs: shorten scoring cascade section in README
Trim to a concise standalone example per review feedback.
Made-with: Cursor
* docs: simplify scoring cascade description in README
Made-with: Cursor
* update readme
---------
Co-authored-by: Zafir Stojanovski <zaf.stojano@gmail.com>
Move make_impossible decision before the retry loop so it's fixed per
item instead of re-rolled on every attempt, which skewed the actual
ratio of impossible puzzles well above the configured value.
The prompt asked to "find the length of the shortest path" but the expected
answer is a sequence of directions. This caused models to answer with a number
instead of directions, degrading evaluation results.
Closes#522
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* Added assertion and infinite loop fix for maze environment
* Fixed maze grid size validation formula
* Removed assertion check due not working for all maze configurations
* Refactor expression generation and substitution logic
Updated symbol naming and added safe replacement for expressions.
* Add expr_str to return values in countdown.py
Modified return statement to include the modified expression string.
* Implement test for min_numbers exceeding 10
Add test for CountdownDataset with more than 10 numbers
* Remove trailing-whitespace
* Improve readability of CountdownDataset initialization
Refactor CountdownDataset initialization for readability.
* v0
* 2 gpu setup
* improve parsing from yaml
* update yaml dataset example
* remove restriction on flash attn
* more comments
* first version of the readme
* pin torch
* simplify requirements
* just flash attn
* use set env instead
* simpler set env
* readme
* add wandb project to setup
* update template
* update model id
* post init to capture the config and weight
* extract metadata
* update config
* update dataset config
* move env for wandb project
* pre-commit
* remove qwen-math from training
* more instructions
* unused import
* remove trl old
* warmup ratio
* warmup ratio
* change model id
* change model_id
* add info about CUDA_VISIBLE_DEVICES