simplify training setup instructions (#454)

* simplify training setup instructions

* tweaks

* update cfgs

* readme update

* readme update
This commit is contained in:
Oliver Stanley 2025-06-06 09:51:29 +01:00 committed by GitHub
parent 0ebabf709b
commit 1232a7d1e5
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
13 changed files with 128 additions and 40 deletions

View file

@ -73,7 +73,7 @@ Evaluation results of different reasoning models will be tracked in the [reasoni
## 🤓 Training
The `training/` directory has full details of the training runs we carried out with RG for the paper. In our experiments, we utilise custom Dataset code to dynamically create RG samples at runtime, and to access the RG scoring function for use as a training reward.
The `training/` directory has full details of the training runs we carried out with RG for the paper. In our experiments, we utilise custom Dataset code to dynamically create RG samples at runtime, and to access the RG scoring function for use as a training reward. See `training/README.md` to reproduce our runs.
For a more plug-and-play experience, it may be easier to build a dataset ahead of time. See `scripts/hf_dataset/` for a simple script allowing generation of RG data and conversion to a HuggingFace dataset. To use the script, build your dataset configurations in the YAML. You can find a list of tasks and configurable parameters in [the dataset gallery](GALLERY.md). Then run `save_hf_dataset.py` with desired arguments.

View file

@ -13,38 +13,29 @@ This readme documents:
### Requirements
1. Prepare and activate a Python 3.11 virtual environment however you prefer.
2. Install Reasoning Gym:
We note that we used Python 3.11 and CUDA 11.8 for our experiments. If you are using different versions, you may need to tweak some of the setup.
1. Prepare and activate a Python virtual environment however you prefer.
2. Clone and install Reasoning Gym, and RG-specific training dependencies:
```bash
pip install wheel fire
git clone https://github.com/open-thought/reasoning-gym.git
cd reasoning-gym/
pip install -e .
cd training/
```
3. Install training-specific Python package dependencies:
3. Follow setup steps for verl with vLLM from the verl docs, ensuring the `verl` repo is directly within this `training/` directory:
```bash
pip install ray wandb
pip install torch==2.6.0
```
We used verl at commit hash `c34206925e2a50fd452e474db857b4d488f8602d` with vLLM 0.7.3: [Instructions to install verl & vLLM 0.7.3](https://verl.readthedocs.io/en/latest/README_vllm0.7.html).
4. Install veRL (tested with HEAD c34206925e2a50fd452e474db857b4d488f8602d):
To use the exact same verl version, simply `git checkout c34206925e2a50fd452e474db857b4d488f8602d` before installing verl.
```bash
pip install git+https://github.com/volcengine/verl.git@c6dc8b73cf011aa75b8c6a47b0322f50aed800ad#egg=verl
```
You may alternatively wish to try newer verl versions, which support vLLM 0.8: [Instructions to install verl & vLLM 0.8](https://verl.readthedocs.io/en/latest/README_vllm0.8.html). However, our code does override some verl code, so there may be incompatibilites with newer versions.
5. Install vLLM:
```bash
pip install vllm==0.6.3 transformers==4.50.3 fire==0.7.0
```
6. Install flash attention
```
pip install flash-attn --no-build-isolation
```
6. Log in to HF and W&B:
4. Log in to HF and W&B:
```bash
huggingface-cli login
@ -53,27 +44,33 @@ wandb login
### Usage
First, activate the virtual environment you prepared.
Activate the virtual environment you prepared.
Example GRPO training usage:
Example GRPO training usage, using the config for our inter-domain generalisation experiment trained on Algorithmic problems:
```bash
python3 -u train_grpo.py --config-name llama3.1_1b_grpo \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
trainer.project_name=rg-test \
trainer.experiment_name=verl_grpo_llama3.1_1b \
trainer.n_gpus_per_node=2 $@ 2>&1 | tee verl_output.log
python3 -u train_grpo.py --config-paths configs/inter_generalisation --config-name algorithmic_qwen_3b
```
Then, having saved this as a bash script such as `train.sh`, run it:
Set `project_name` and `experiment_name` if logging your runs to W&B. This config assumes a 4 GPU node, but you can configure this too. The following command would be for 2 GPUs, with 1 used for vLLM rollouts:
```bash
CUDA_VISIBLE_DEVICES=0,1 bash train.sh
python3 -u train_grpo.py --config-paths configs/inter_generalisation --config-name algorithmic_qwen_3b \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
trainer.n_gpus_per_node=2 \
trainer.project_name=rg-grpo \
trainer.experiment_name=algorithmic_qwen2.5_3b
```
CUDA_VISIBLE_DEVICES is set to 0,1 to use the first two GPUs on the machine (see `nvidia-smi` output). This can be adjusted as needed. `tensor_model_parallel_size` and `n_gpus_per_node` should also be set to the number of GPUs you are using.
If you need to use only a subset of the GPUs on the machine, set the `CUDA_VISIBLE_DEVICES` environment variable, for example:
You can change all configuration options by either modifying the config YAML (in this case, `config/llama3.1_1b_grpo.yaml`) or providing them as arguments to the Python script. Note that the batch sizes set in the Llama 1B and Qwen 1.5B configs are as high as it was possible for me to set them for the puzzles dataset mix on 2xA6000 GPUs without OOMs. Depending on the hardware you use and the datasets you train on, you may need to adjust these.
```bash
export CUDA_VISIBLE_DEVICES=0,1
```
See `nvidia-smi` output for your system GPU IDs. `n_gpus_per_node` should be set to the total number of GPUs you are using. `tensor_model_parallel_size` should be set to the number you wish to use for vLLM rollouts.
You can change all configuration options by either modifying the config YAML (in this case, `configs/inter_generalisation/algorithmic_qwen_3b.yaml`) or providing them as args to the Python script.
# Exporting from FSDP checkpoint to HF model checkpoint
@ -82,13 +79,13 @@ After training your model the weights are saved across as a sharded checkpoints
To run this script. Navigate to the training directory and run the following
```python
```bash
python load_fsdp_to_hf.py /path/to/fsdp/checkpoint/global_step_num/actor /path/to/hugginface/checkpoint/global_step_num/actor/huggingface saved_model_name
```
For example
```python
```bash
python utils/load_fsdp_to_hf.py checkpoints/rg-test/intra_reasoning_algorithmic_qwen_3b_composite/global_step_400/actor/ checkpoints/rg-test/intra_reasoning_algorithmic_qwen_3b_composite/global_step_400/actor/huggingface qwen3b
```
@ -98,16 +95,19 @@ From here you may to run evaluations of your trained model. In the `training/eva
## Run the script
```
```bash
export VLLM_ATTENTION_BACKEND=XFORMERS
```
Navigate to evaluations directory:
```
```bash
python evaluate_model.py --config path-to-yaml
```
For example
```
For example:
```bash
python evaluate_model.py --config eval_algorithmic_composite.yaml
```

View file

@ -1,3 +1,11 @@
hydra:
searchpath:
- file://verl/verl/trainer/config
defaults:
- ppo_trainer
- _self_
reasoning_gym:
dataset_size: 40000
developer_prompt: DeepSeekZero

View file

@ -1,3 +1,11 @@
hydra:
searchpath:
- file://verl/verl/trainer/config
defaults:
- ppo_trainer
- _self_
reasoning_gym:
dataset_size: 20000
developer_prompt: DeepSeekZero

View file

@ -1,3 +1,11 @@
hydra:
searchpath:
- file://verl/verl/trainer/config
defaults:
- ppo_trainer
- _self_
reasoning_gym:
dataset_size: 20000
developer_prompt: DeepSeekZero

View file

@ -1,3 +1,11 @@
hydra:
searchpath:
- file://verl/verl/trainer/config
defaults:
- ppo_trainer
- _self_
reasoning_gym:
dataset_size: 20000
developer_prompt: DeepSeekZero

View file

@ -1,3 +1,11 @@
hydra:
searchpath:
- file://verl/verl/trainer/config
defaults:
- ppo_trainer
- _self_
reasoning_gym:
dataset_size: 20000
developer_prompt: DeepSeekZero

View file

@ -1,3 +1,11 @@
hydra:
searchpath:
- file://verl/verl/trainer/config
defaults:
- ppo_trainer
- _self_
reasoning_gym:
dataset_size: 20000
developer_prompt: DeepSeekZero

View file

@ -1,3 +1,11 @@
hydra:
searchpath:
- file://verl/verl/trainer/config
defaults:
- ppo_trainer
- _self_
reasoning_gym:
dataset_size: 20000
developer_prompt: DeepSeekZero

View file

@ -1,3 +1,11 @@
hydra:
searchpath:
- file://verl/verl/trainer/config
defaults:
- ppo_trainer
- _self_
reasoning_gym:
dataset_size: 20000
developer_prompt: DeepSeekZero

View file

@ -1,3 +1,11 @@
hydra:
searchpath:
- file://verl/verl/trainer/config
defaults:
- ppo_trainer
- _self_
reasoning_gym:
dataset_size: 20000
developer_prompt: DeepSeekZero

View file

@ -1,3 +1,11 @@
hydra:
searchpath:
- file://verl/verl/trainer/config
defaults:
- ppo_trainer
- _self_
reasoning_gym:
dataset_size: 20000
developer_prompt: DeepSeekZero

View file

@ -1,3 +1,11 @@
hydra:
searchpath:
- file://verl/verl/trainer/config
defaults:
- ppo_trainer
- _self_
reasoning_gym:
dataset_size: 20000
developer_prompt: DeepSeekZero