simplify training setup instructions (#454)

* simplify training setup instructions * tweaks * update cfgs * readme update * readme update
2026-04-19 12:58:07 +00:00 · 2025-06-06 09:51:29 +01:00 · 2025-06-06 09:51:29 +01:00 · 1232a7d1e5
commit 1232a7d1e5
parent 0ebabf709b
13 changed files with 128 additions and 40 deletions
--- a/README.md
+++ b/README.md
@ -73,7 +73,7 @@ Evaluation results of different reasoning models will be tracked in the [reasoni

 ## 🤓 Training

-The `training/` directory has full details of the training runs we carried out with RG for the paper. In our experiments, we utilise custom Dataset code to dynamically create RG samples at runtime, and to access the RG scoring function for use as a training reward.
+The `training/` directory has full details of the training runs we carried out with RG for the paper. In our experiments, we utilise custom Dataset code to dynamically create RG samples at runtime, and to access the RG scoring function for use as a training reward. See `training/README.md` to reproduce our runs.

 For a more plug-and-play experience, it may be easier to build a dataset ahead of time. See `scripts/hf_dataset/` for a simple script allowing generation of RG data and conversion to a HuggingFace dataset. To use the script, build your dataset configurations in the YAML. You can find a list of tasks and configurable parameters in [the dataset gallery](GALLERY.md). Then run `save_hf_dataset.py` with desired arguments.

--- a/training/README.md
+++ b/training/README.md
@ -13,38 +13,29 @@ This readme documents:

 ### Requirements

-1. Prepare and activate a Python 3.11 virtual environment however you prefer.
-2. Install Reasoning Gym:
+We note that we used Python 3.11 and CUDA 11.8 for our experiments. If you are using different versions, you may need to tweak some of the setup.
+
+1. Prepare and activate a Python virtual environment however you prefer.
+
+2. Clone and install Reasoning Gym, and RG-specific training dependencies:

 ```bash
+pip install wheel fire
+git clone https://github.com/open-thought/reasoning-gym.git
 cd reasoning-gym/
 pip install -e .
+cd training/
 ```

-3. Install training-specific Python package dependencies:
+3. Follow setup steps for verl with vLLM from the verl docs, ensuring the `verl` repo is directly within this `training/` directory:

-```bash
-pip install ray wandb
-pip install torch==2.6.0
-```
+We used verl at commit hash `c34206925e2a50fd452e474db857b4d488f8602d` with vLLM 0.7.3: [Instructions to install verl & vLLM 0.7.3](https://verl.readthedocs.io/en/latest/README_vllm0.7.html).

-4. Install veRL (tested with HEAD c34206925e2a50fd452e474db857b4d488f8602d):
+To use the exact same verl version, simply `git checkout c34206925e2a50fd452e474db857b4d488f8602d` before installing verl.

-```bash
-pip install git+https://github.com/volcengine/verl.git@c6dc8b73cf011aa75b8c6a47b0322f50aed800ad#egg=verl
-```
+You may alternatively wish to try newer verl versions, which support vLLM 0.8: [Instructions to install verl & vLLM 0.8](https://verl.readthedocs.io/en/latest/README_vllm0.8.html). However, our code does override some verl code, so there may be incompatibilites with newer versions.

-5. Install vLLM:
-
-```bash
-pip install vllm==0.6.3 transformers==4.50.3 fire==0.7.0
-```
-6. Install flash attention
-```
-pip install flash-attn --no-build-isolation
-```
-
-6. Log in to HF and W&B:
+4. Log in to HF and W&B:

 ```bash
 huggingface-cli login
@ -53,27 +44,33 @@ wandb login

 ### Usage

-First, activate the virtual environment you prepared.
+Activate the virtual environment you prepared.

-Example GRPO training usage:
+Example GRPO training usage, using the config for our inter-domain generalisation experiment trained on Algorithmic problems:

 ```bash
-python3 -u train_grpo.py --config-name llama3.1_1b_grpo \
-    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
-    trainer.project_name=rg-test \
-    trainer.experiment_name=verl_grpo_llama3.1_1b \
-    trainer.n_gpus_per_node=2 $@ 2>&1 | tee verl_output.log
+python3 -u train_grpo.py --config-paths configs/inter_generalisation --config-name algorithmic_qwen_3b
 ```

-Then, having saved this as a bash script such as `train.sh`, run it:
+Set `project_name` and `experiment_name` if logging your runs to W&B. This config assumes a 4 GPU node, but you can configure this too. The following command would be for 2 GPUs, with 1 used for vLLM rollouts:

 ```bash
-CUDA_VISIBLE_DEVICES=0,1 bash train.sh
+python3 -u train_grpo.py --config-paths configs/inter_generalisation --config-name algorithmic_qwen_3b \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
+    trainer.n_gpus_per_node=2 \
+    trainer.project_name=rg-grpo \
+    trainer.experiment_name=algorithmic_qwen2.5_3b
 ```

-CUDA_VISIBLE_DEVICES is set to 0,1 to use the first two GPUs on the machine (see `nvidia-smi` output). This can be adjusted as needed. `tensor_model_parallel_size` and `n_gpus_per_node` should also be set to the number of GPUs you are using.
+If you need to use only a subset of the GPUs on the machine, set the `CUDA_VISIBLE_DEVICES` environment variable, for example:

-You can change all configuration options by either modifying the config YAML (in this case, `config/llama3.1_1b_grpo.yaml`) or providing them as arguments to the Python script. Note that the batch sizes set in the Llama 1B and Qwen 1.5B configs are as high as it was possible for me to set them for the puzzles dataset mix on 2xA6000 GPUs without OOMs. Depending on the hardware you use and the datasets you train on, you may need to adjust these.
+```bash
+export CUDA_VISIBLE_DEVICES=0,1
+```
+
+See `nvidia-smi` output for your system GPU IDs. `n_gpus_per_node` should be set to the total number of GPUs you are using. `tensor_model_parallel_size` should be set to the number you wish to use for vLLM rollouts.
+
+You can change all configuration options by either modifying the config YAML (in this case, `configs/inter_generalisation/algorithmic_qwen_3b.yaml`) or providing them as args to the Python script.


 # Exporting from FSDP checkpoint to HF model checkpoint
@ -82,13 +79,13 @@ After training your model the weights are saved across as a sharded checkpoints

 To run this script. Navigate to the training directory and run the following

-```python
+```bash
 python load_fsdp_to_hf.py /path/to/fsdp/checkpoint/global_step_num/actor /path/to/hugginface/checkpoint/global_step_num/actor/huggingface saved_model_name
 ```

 For example

-```python
+```bash
 python utils/load_fsdp_to_hf.py checkpoints/rg-test/intra_reasoning_algorithmic_qwen_3b_composite/global_step_400/actor/ checkpoints/rg-test/intra_reasoning_algorithmic_qwen_3b_composite/global_step_400/actor/huggingface qwen3b
 ```

@ -98,16 +95,19 @@ From here you may to run evaluations of your trained model. In the `training/eva

 ## Run the script

-```
+```bash
 export VLLM_ATTENTION_BACKEND=XFORMERS
 ```

 Navigate to evaluations directory:
-```
+
+```bash
 python evaluate_model.py --config path-to-yaml
 ```
-For example
-```
+
+For example:
+
+```bash
 python evaluate_model.py --config eval_algorithmic_composite.yaml
 ```

--- a/training/configs/external_benchmarks/maths_llama_3b.yaml
+++ b/training/configs/external_benchmarks/maths_llama_3b.yaml
@ -1,3 +1,11 @@
+hydra:
+  searchpath:
+    - file://verl/verl/trainer/config
+
+defaults:
+  - ppo_trainer
+  - _self_
+
 reasoning_gym:
  dataset_size: 40000
  developer_prompt: DeepSeekZero
--- a/training/configs/inter_generalisation/algebra_qwen_3b.yaml
+++ b/training/configs/inter_generalisation/algebra_qwen_3b.yaml
@ -1,3 +1,11 @@
+hydra:
+  searchpath:
+    - file://verl/verl/trainer/config
+
+defaults:
+  - ppo_trainer
+  - _self_
+
 reasoning_gym:
  dataset_size: 20000
  developer_prompt: DeepSeekZero
--- a/training/configs/inter_generalisation/algorithmic_qwen_3b.yaml
+++ b/training/configs/inter_generalisation/algorithmic_qwen_3b.yaml
@ -1,3 +1,11 @@
+hydra:
+  searchpath:
+    - file://verl/verl/trainer/config
+
+defaults:
+  - ppo_trainer
+  - _self_
+
 reasoning_gym:
  dataset_size: 20000
  developer_prompt: DeepSeekZero
--- a/training/configs/inter_generalisation/games_qwen_3b.yaml
+++ b/training/configs/inter_generalisation/games_qwen_3b.yaml
@ -1,3 +1,11 @@
+hydra:
+  searchpath:
+    - file://verl/verl/trainer/config
+
+defaults:
+  - ppo_trainer
+  - _self_
+
 reasoning_gym:
  dataset_size: 20000
  developer_prompt: DeepSeekZero
--- a/training/configs/inter_generalisation/logic_qwen_3b.yaml
+++ b/training/configs/inter_generalisation/logic_qwen_3b.yaml
@ -1,3 +1,11 @@
+hydra:
+  searchpath:
+    - file://verl/verl/trainer/config
+
+defaults:
+  - ppo_trainer
+  - _self_
+
 reasoning_gym:
  dataset_size: 20000
  developer_prompt: DeepSeekZero
--- a/training/configs/intra_generalisation/algebra_qwen_3b.yaml
+++ b/training/configs/intra_generalisation/algebra_qwen_3b.yaml
@ -1,3 +1,11 @@
+hydra:
+  searchpath:
+    - file://verl/verl/trainer/config
+
+defaults:
+  - ppo_trainer
+  - _self_
+
 reasoning_gym:
  dataset_size: 20000
  developer_prompt: DeepSeekZero
--- a/training/configs/intra_generalisation/algorithmic_qwen_3b.yaml
+++ b/training/configs/intra_generalisation/algorithmic_qwen_3b.yaml
@ -1,3 +1,11 @@
+hydra:
+  searchpath:
+    - file://verl/verl/trainer/config
+
+defaults:
+  - ppo_trainer
+  - _self_
+
 reasoning_gym:
  dataset_size: 20000
  developer_prompt: DeepSeekZero
--- a/training/configs/intra_generalisation/arithmetic_qwen_3b.yaml
+++ b/training/configs/intra_generalisation/arithmetic_qwen_3b.yaml
@ -1,3 +1,11 @@
+hydra:
+  searchpath:
+    - file://verl/verl/trainer/config
+
+defaults:
+  - ppo_trainer
+  - _self_
+
 reasoning_gym:
  dataset_size: 20000
  developer_prompt: DeepSeekZero
--- a/training/configs/intra_generalisation/cognition_qwen_3b.yaml
+++ b/training/configs/intra_generalisation/cognition_qwen_3b.yaml
@ -1,3 +1,11 @@
+hydra:
+  searchpath:
+    - file://verl/verl/trainer/config
+
+defaults:
+  - ppo_trainer
+  - _self_
+
 reasoning_gym:
  dataset_size: 20000
  developer_prompt: DeepSeekZero
--- a/training/configs/intra_generalisation/games_qwen_3b.yaml
+++ b/training/configs/intra_generalisation/games_qwen_3b.yaml
@ -1,3 +1,11 @@
+hydra:
+  searchpath:
+    - file://verl/verl/trainer/config
+
+defaults:
+  - ppo_trainer
+  - _self_
+
 reasoning_gym:
  dataset_size: 20000
  developer_prompt: DeepSeekZero
--- a/training/configs/intra_generalisation/graphs_qwen_3b.yaml
+++ b/training/configs/intra_generalisation/graphs_qwen_3b.yaml
@ -1,3 +1,11 @@
+hydra:
+  searchpath:
+    - file://verl/verl/trainer/config
+
+defaults:
+  - ppo_trainer
+  - _self_
+
 reasoning_gym:
  dataset_size: 20000
  developer_prompt: DeepSeekZero