reasoning-gym/examples/open-instruct/README.md
joesharratt1229 9234aa77bf
Feat/open instruct example (#381)
* added open-instruct

* fixed hooks

* GRPO

---------

Co-authored-by: Andreas Koepf <andreas.koepf@provisio.com>
2025-03-17 23:20:11 +01:00

1.1 KiB

Reasoning Gym Open-Instruct Example

This example demonstrates how to use open-instruct with reasoning-gym for training language models through reinforcement learning.

Environment Setup

Before running the training script, you may want to set the following environment variables:

export CUDA_VISIBLE_DEVICES=0
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
  • CUDA_VISIBLE_DEVICES=0 specifies which GPUs to use (in this case, GPUs 0 )
  • PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True enables dynamic GPU memory allocation

Running the Training Script

To start training, run:

bash grpo_config.sh

This will train a DeepSeek-R1-Distill-Qwen-1.5B model using GRPO (Group Relative Policy Optimization) on reasoning tasks.

Key Features

  • Uses vLLM for efficient inference
  • Supports multi-GPU training
  • Implements reward functions for both answer correctness and response formatting
  • Includes evaluation during training
  • Supports model checkpointing and pushing to Hugging Face Hub
  • Integrates with Weights & Biases for experiment tracking