mirror of
https://github.com/open-thought/reasoning-gym.git
synced 2026-04-23 16:55:05 +00:00
docs: Update TRL README with GRPO example details and usage instructions (#76)
This commit is contained in:
parent
d61db3772a
commit
a8f9eafd43
3 changed files with 37 additions and 12 deletions
|
|
@ -1,5 +1,32 @@
|
|||
1. Install the requirements in the txt file
|
||||
# TRL Examples
|
||||
|
||||
```
|
||||
This directory contains examples using the [TRL (Transformer Reinforcement Learning) library](https://github.com/huggingface/trl) to fine-tune language models with reinforcement learning techniques.
|
||||
|
||||
## GRPO Example
|
||||
|
||||
The main example demonstrates using GRPO (Group Relative Policy Optimization) to fine-tune a language model on reasoning tasks from reasoning-gym. It includes:
|
||||
|
||||
- Custom reward functions for answer accuracy and format compliance
|
||||
- Integration with reasoning-gym datasets
|
||||
- Configurable training parameters via YAML config
|
||||
- Wandb logging and model checkpointing
|
||||
- Evaluation on held-out test sets
|
||||
|
||||
## Setup
|
||||
|
||||
1. Install the required dependencies:
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
1. Configure the training parameters in `config/grpo.yaml`
|
||||
2. Run the training script:
|
||||
|
||||
```bash
|
||||
python main_grpo_reward.py
|
||||
```
|
||||
|
||||
The model will be trained using GRPO with the specified reasoning-gym dataset and evaluation metrics will be logged to Weights & Biases.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue