mirror of https://github.com/open-thought/reasoning-gym.git synced 2026-04-19 12:58:07 +00:00

History

Zafir Stojanovski 56ce2e79a7 tutorial(training): Add a minimal example with `trl` (#473 ) * v0 * 2 gpu setup * improve parsing from yaml * update yaml dataset example * remove restriction on flash attn * more comments * first version of the readme * pin torch * simplify requirements * just flash attn * use set env instead * simpler set env * readme * add wandb project to setup * update template * update model id * post init to capture the config and weight * extract metadata * update config * update dataset config * move env for wandb project * pre-commit * remove qwen-math from training * more instructions * unused import * remove trl old * warmup ratio * warmup ratio * change model id * change model_id * add info about CUDA_VISIBLE_DEVICES		2025-06-21 00:01:31 +02:00
..
config	tutorial(training): Add a minimal example with `trl` (#473 )	2025-06-21 00:01:31 +02:00
grpo.py	tutorial(training): Add a minimal example with `trl` (#473 )	2025-06-21 00:01:31 +02:00
README.md	tutorial(training): Add a minimal example with `trl` (#473 )	2025-06-21 00:01:31 +02:00
set_env.sh	tutorial(training): Add a minimal example with `trl` (#473 )	2025-06-21 00:01:31 +02:00
train.sh	tutorial(training): Add a minimal example with `trl` (#473 )	2025-06-21 00:01:31 +02:00

README.md

Training with TRL

Training stack:

TRL for reinforcement learning training
Accelerate (with DeepSpeed) for distributed training
vLLM for rollouts

Setup

This tutorial uses CUDA 11.8, Python 3.10, and PyTorch 2.5.1

Moreover, we assume that you have 2 GPUs on your machine, the last of which is used for vLLM rollouts.

If you have more than 2 GPUs, adjust the ./config/grpo.yaml file so that the vllm_device is set to the last index of your GPU. For example, if you have 4 GPUs, set it to 3:

vllm_device: 3  # If you have 4 GPUs, set this to 3

Moreover, you would need to update the CUDA_VISIBLE_DEVICES environment variable in the train.sh script to include all your available GPUs. For example, if you have 4 GPUs, set it to:

# ./train.sh

# ... beginning of the script
export CUDA_VISIBLE_DEVICES=0,1,2,3
# ... rest of the script

Install the required packages:

# First, give execute permissions to the script
# chmod +x ./set_env.sh

# Then, run the setup script
./set_env.sh

(Optional) Log in to Weights & Biases for experiment tracking:

# First, set your WANDB_API_KEY as an environment variable
export WANDB_API_KEY=your_wandb_api_key

# Set the project name
export WANDB_PROJECT=your_wandb_project_name

Run the training script

# First, give execute permissions to the script
# chmod +x ./train.sh

# Then, run the training script
./train.sh