* feat: Add support for generating multiple completions per prompt
* feat: Track best and mean scores for multiple completions per prompt
* feat: Add checkpoint and resume functionality to evaluation script
The script now supports:
- YAML and JSON configurations
- Dataset-specific parameters
- Overriding configuration via command line
- Detailed logging and error handling