mirror of
https://github.com/open-thought/reasoning-gym.git
synced 2026-04-19 12:58:07 +00:00
1,000 B
1,000 B
TRL Examples
This directory contains examples using the TRL (Transformer Reinforcement Learning) library to fine-tune language models with reinforcement learning techniques.
GRPO Example
The main example demonstrates using GRPO (Group Relative Policy Optimization) to fine-tune a language model on reasoning tasks from reasoning-gym. It includes:
- Custom reward functions for answer accuracy and format compliance
- Integration with reasoning-gym datasets
- Configurable training parameters via YAML config
- Wandb logging and model checkpointing
- Evaluation on held-out test sets
Setup
- Install the required dependencies:
pip install -r requirements.txt
Usage
- Configure the training parameters in
config/grpo.yaml - Run the training script:
python main_grpo_reward.py
The model will be trained using GRPO with the specified reasoning-gym dataset and evaluation metrics will be logged to Weights & Biases.