cleaned up examples

This commit is contained in:
joesharratt1229 2025-06-27 07:58:46 +00:00
parent d9cd20c174
commit 799eb51800
33 changed files with 117 additions and 2954 deletions

View file

@ -1,3 +0,0 @@
outputs/
wandb/
verl_output.log

View file

@ -1,32 +1,72 @@
### env setup
# Chain Sum Training with veRL
```
conda create --name verl python=3.11 -y
conda activate verl
This example demonstrates how to train a language model using veRL (Volcano Engine Reinforcement Learning) with the reasoning-gym environment for chain sum problems.
pip install flash-attn --no-build-isolation
pip install ray wandb
# pip3 install vllm==0.7.0
pip3 install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
Requirements:
python >= 3.10
## Installation
1. **Install veRL**: Follow the installation instructions at [veRL repository](https://github.com/volcengine/verl)
2. **Install reasoning-gym**:
```bash
pip install reasoning-gym
```
## Training
To start training the model on chain sum problems:
```bash
python grpo_train.py --config-path config --config-name grpo_trainer
```
Regarding vllm>0.7 see: [docs](https://verl.readthedocs.io/en/latest/README_vllm0.7.html)
### Configuration
You can modify the training by editing the configuration file or overriding arguments in the shell scripts directly
### clone and install veRL
tested with verl HEAD c34206925e2a50fd452e474db857b4d488f8602d
```
git clone https://github.com/volcengine/verl.git
cd verl
pip install -e .
```bash
# Change dataset
Here it is easiest to modify the `config/grpo_trainer.yaml` file with a custom training composite. Here is an example experiment which uses a composite of algorithmic training tasks
```yaml
reasoning_gym:
dataset_size: 20000
developer_prompt: DeepSeekZero
datasets:
ab:
weight: 1
base_conversion:
weight: 1
binary_alternation:
weight: 1
config:
p_solvable: 0.9
binary_matrix:
weight: 1
config:
min_n: 2
max_n: 6
caesar_cipher:
weight: 1
config:
max_words: 10
cryptarithm:
weight: 1
isomorphic_strings:
weight: 1
config:
max_string_length: 8
```
# Change configuration Set project_name and experiment_name if logging your runs to W&B. T
This config assumes a single GPU node, but you can configure this too. The following command would be for 2 GPUs, with 1 used for vLLM rollouts:
Optionally log in to huggingface hub and wandb with your keys:
python3 -u train_grpo.py --config-paths configs/inter_generalisation --config-name algorithmic_qwen_3b \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
trainer.n_gpus_per_node=2 \
trainer.project_name=rg-grpo \
trainer.experiment_name=algorithmic_qwen2.5_3b
```
huggingface-cli login
wandb login
```
Or similarly you could define this in a config file directly

View file

@ -3,10 +3,10 @@ defaults:
- _self_
reasoning_gym:
dataset_size: 20000
dataset_size: 10000
developer_prompt: DeepSeekZero
datasets:
ab:
chain_sum:
weight: 1
data:
@ -16,7 +16,7 @@ data:
prompt_key: prompt
max_prompt_length: 512
max_response_length: 512
train_batch_size: 16
train_batch_size: 64
val_batch_size: 1
actor_rollout_ref:
@ -32,7 +32,7 @@ actor_rollout_ref:
strategy: fsdp # This is for backward-compatibility
ppo_mini_batch_size: 16
ppo_micro_batch_size: null # will be deprecated, use ppo_micro_batch_size_per_gpu
ppo_micro_batch_size_per_gpu: 4
ppo_micro_batch_size_per_gpu: 8
use_dynamic_bsz: False
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
grad_clip: 1.0
@ -70,7 +70,7 @@ actor_rollout_ref:
# transformer_layer_cls_to_wrap: None
min_num_params: 0
log_prob_micro_batch_size: null # will be deprecated, use log_prob_micro_batch_size_per_gpu
log_prob_micro_batch_size_per_gpu: 4
log_prob_micro_batch_size_per_gpu: 8
log_prob_use_dynamic_bsz: ${actor_rollout_ref.actor.use_dynamic_bsz}
log_prob_max_token_len_per_gpu: ${actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
ulysses_sequence_parallel_size: ${actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size
@ -90,7 +90,7 @@ actor_rollout_ref:
enforce_eager: True
free_cache_engine: True
load_format: dummy_dtensor
tensor_model_parallel_size: 1
tensor_model_parallel_size: 2
max_num_batched_tokens: 8192
max_num_seqs: 1024
log_prob_micro_batch_size: null # will be deprecated, use log_prob_micro_batch_size_per_gpu
@ -184,14 +184,14 @@ algorithm:
trainer:
balance_batch: True
total_epochs: 30
total_epochs: 1
total_training_steps: null
project_name: verl_examples
experiment_name: chain_sum
logger: [ 'console', 'wandb' ]
val_generations_to_log_to_wandb: 0
nnodes: 1
n_gpus_per_node: 1
n_gpus_per_node: 2
save_freq: -1
# auto: find the last ckpt to resume. If can't find, start from scratch
resume_mode: auto # or auto or resume_path if

View file

@ -1,72 +0,0 @@
# Chain Sum Training with veRL
This example demonstrates how to train a language model using veRL (Volcano Engine Reinforcement Learning) with the reasoning-gym environment for chain sum problems.
Requirements:
python >= 3.10
## Installation
1. **Install veRL**: Follow the installation instructions at [veRL repository](https://github.com/volcengine/verl)
2. **Install reasoning-gym**:
```bash
pip install reasoning-gym
```
## Training
To start training the model on chain sum problems:
```bash
python grpo_train.py --config-path config --config-name grpo_trainer
```
### Configuration
You can modify the training by editing the configuration file or overriding arguments in the shell scripts directly
```bash
# Change dataset
Here it is easiest to modify the `config/grpo_trainer.yaml` file with a custom training composite. Here is an example experiment which uses a composite of algorithmic training tasks
```yaml
reasoning_gym:
dataset_size: 20000
developer_prompt: DeepSeekZero
datasets:
ab:
weight: 1
base_conversion:
weight: 1
binary_alternation:
weight: 1
config:
p_solvable: 0.9
binary_matrix:
weight: 1
config:
min_n: 2
max_n: 6
caesar_cipher:
weight: 1
config:
max_words: 10
cryptarithm:
weight: 1
isomorphic_strings:
weight: 1
config:
max_string_length: 8
```
# Change configuration Set project_name and experiment_name if logging your runs to W&B. T
This config assumes a single GPU node, but you can configure this too. The following command would be for 2 GPUs, with 1 used for vLLM rollouts:
python3 -u train_grpo.py --config-paths configs/inter_generalisation --config-name algorithmic_qwen_3b \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
trainer.n_gpus_per_node=2 \
trainer.project_name=rg-grpo \
trainer.experiment_name=algorithmic_qwen2.5_3b
Or similarly you could define this in a config file directly