* feat: add scoring cascade for reducing false negatives in answer verification
* style: fix black and isort formatting
Run black and isort to satisfy pre-commit checks.
Made-with: Cursor
* docs: add scoring cascade example to Quickstart section
Mention the experimental scoring cascade feature at the end of the
Quickstart section with a disclaimer and complete usage examples
showing both the dataset method and standalone function.
Made-with: Cursor
* docs: shorten scoring cascade section in README
Trim to a concise standalone example per review feedback.
Made-with: Cursor
* docs: simplify scoring cascade description in README
Made-with: Cursor
* update readme
---------
Co-authored-by: Zafir Stojanovski <zaf.stojano@gmail.com>
* math prompt improvements
* ignore brackets in complex_arithmetic results
* improve additional instruction in prompt of polynomial_equations
* more strict tests for score_answer in polynomial_equations
* simplify special reward handling
* fix test_intermediate_integration
* fix sokoban dataset
* add common dataset score_answer consistency test
* remove strip from ProceduralDataset::core score_answer(), strip in extract answer (optional, default=True)
* test: Move test_extract_answer() from test_dataset.py to test_utils.py
* refactor: Improve decimal reward computation with more flexible comparison
* fix: Implement rounding for format_number when round_if_needed is True
* test: Add test case for compute_decimal_reward with sign and zeros
The script now supports:
- YAML and JSON configurations
- Dataset-specific parameters
- Overriding configuration via command line
- Detailed logging and error handling