reasoning-gym/examples/trl
Zafir Stojanovski 56ce2e79a7
tutorial(training): Add a minimal example with trl (#473)
* v0

* 2 gpu setup

* improve parsing from yaml

* update yaml dataset example

* remove restriction on flash attn

* more comments

* first version of the readme

* pin torch

* simplify requirements

* just flash attn

* use set env instead

* simpler set env

* readme

* add wandb project to setup

* update template

* update model id

* post init to capture the config and weight

* extract metadata

* update config

* update dataset config

* move env for wandb project

* pre-commit

* remove qwen-math from training

* more instructions

* unused import

* remove trl old

* warmup ratio

* warmup ratio

* change model id

* change model_id

* add info about CUDA_VISIBLE_DEVICES
2025-06-21 00:01:31 +02:00
..
config tutorial(training): Add a minimal example with trl (#473) 2025-06-21 00:01:31 +02:00
grpo.py tutorial(training): Add a minimal example with trl (#473) 2025-06-21 00:01:31 +02:00
README.md tutorial(training): Add a minimal example with trl (#473) 2025-06-21 00:01:31 +02:00
set_env.sh tutorial(training): Add a minimal example with trl (#473) 2025-06-21 00:01:31 +02:00
train.sh tutorial(training): Add a minimal example with trl (#473) 2025-06-21 00:01:31 +02:00

Training with TRL

Training stack:

  • TRL for reinforcement learning training
  • Accelerate (with DeepSpeed) for distributed training
  • vLLM for rollouts

Setup

This tutorial uses CUDA 11.8, Python 3.10, and PyTorch 2.5.1

Moreover, we assume that you have 2 GPUs on your machine, the last of which is used for vLLM rollouts.

If you have more than 2 GPUs, adjust the ./config/grpo.yaml file so that the vllm_device is set to the last index of your GPU. For example, if you have 4 GPUs, set it to 3:

vllm_device: 3  # If you have 4 GPUs, set this to 3

Moreover, you would need to update the CUDA_VISIBLE_DEVICES environment variable in the train.sh script to include all your available GPUs. For example, if you have 4 GPUs, set it to:

# ./train.sh

# ... beginning of the script
export CUDA_VISIBLE_DEVICES=0,1,2,3
# ... rest of the script
  1. Install the required packages:
# First, give execute permissions to the script
# chmod +x ./set_env.sh

# Then, run the setup script
./set_env.sh
  1. (Optional) Log in to Weights & Biases for experiment tracking:
# First, set your WANDB_API_KEY as an environment variable
export WANDB_API_KEY=your_wandb_api_key

# Set the project name
export WANDB_PROJECT=your_wandb_project_name
  1. Run the training script
# First, give execute permissions to the script
# chmod +x ./train.sh

# Then, run the training script
./train.sh