mirror of https://github.com/open-thought/reasoning-gym.git synced 2026-04-19 12:58:07 +00:00

History

Andreas Köpf 7b72c3470b docs: Update TRL README with GRPO example details and usage instructions (#76 )		2025-02-07 07:56:22 +01:00
..
config	Test training with trl (#70 )	2025-02-07 07:42:32 +01:00
grpo_config.py	Test training with trl (#70 )	2025-02-07 07:42:32 +01:00
main_grpo_reward.py	docs: Update TRL README with GRPO example details and usage instructions (#76 )	2025-02-07 07:56:22 +01:00
README.md	docs: Update TRL README with GRPO example details and usage instructions (#76 )	2025-02-07 07:56:22 +01:00
requirements.txt	docs: Update TRL README with GRPO example details and usage instructions (#76 )	2025-02-07 07:56:22 +01:00

README.md

TRL Examples

This directory contains examples using the TRL (Transformer Reinforcement Learning) library to fine-tune language models with reinforcement learning techniques.

GRPO Example

The main example demonstrates using GRPO (Group Relative Policy Optimization) to fine-tune a language model on reasoning tasks from reasoning-gym. It includes:

Custom reward functions for answer accuracy and format compliance
Integration with reasoning-gym datasets
Configurable training parameters via YAML config
Wandb logging and model checkpointing
Evaluation on held-out test sets

Setup

Install the required dependencies:

pip install -r requirements.txt

Usage

Configure the training parameters in config/grpo.yaml
Run the training script:

python main_grpo_reward.py

The model will be trained using GRPO with the specified reasoning-gym dataset and evaluation metrics will be logged to Weights & Biases.