mirror of
https://github.com/open-thought/reasoning-gym.git
synced 2026-04-19 12:58:07 +00:00
Fix/verl example (#465)
* updated verl ex * updated script * removed curriculum verl and updated * updatied linting errors * renamed * updated config
This commit is contained in:
parent
5726034a26
commit
51c2afc1fc
14 changed files with 229 additions and 1229 deletions
72
examples/veRL/multi_env/README.md
Normal file
72
examples/veRL/multi_env/README.md
Normal file
|
|
@ -0,0 +1,72 @@
|
|||
# Chain Sum Training with veRL
|
||||
|
||||
This example demonstrates how to train a language model using veRL (Volcano Engine Reinforcement Learning) with the reasoning-gym environment for chain sum problems.
|
||||
|
||||
Requirements:
|
||||
|
||||
python >= 3.10
|
||||
|
||||
## Installation
|
||||
|
||||
1. **Install veRL**: Follow the installation instructions at [veRL repository](https://github.com/volcengine/verl)
|
||||
|
||||
2. **Install reasoning-gym**:
|
||||
```bash
|
||||
pip install reasoning-gym
|
||||
```
|
||||
|
||||
## Training
|
||||
|
||||
To start training the model on chain sum problems:
|
||||
|
||||
```bash
|
||||
python grpo_train.py --config-path config --config-name grpo_trainer
|
||||
```
|
||||
|
||||
### Configuration
|
||||
|
||||
You can modify the training by editing the configuration file or overriding arguments in the shell scripts directly
|
||||
|
||||
```bash
|
||||
# Change dataset
|
||||
Here it is easiest to modify the `config/grpo_trainer.yaml` file with a custom training composite. Here is an example experiment which uses a composite of algorithmic training tasks
|
||||
```yaml
|
||||
reasoning_gym:
|
||||
dataset_size: 20000
|
||||
developer_prompt: DeepSeekZero
|
||||
datasets:
|
||||
ab:
|
||||
weight: 1
|
||||
base_conversion:
|
||||
weight: 1
|
||||
binary_alternation:
|
||||
weight: 1
|
||||
config:
|
||||
p_solvable: 0.9
|
||||
binary_matrix:
|
||||
weight: 1
|
||||
config:
|
||||
min_n: 2
|
||||
max_n: 6
|
||||
caesar_cipher:
|
||||
weight: 1
|
||||
config:
|
||||
max_words: 10
|
||||
cryptarithm:
|
||||
weight: 1
|
||||
isomorphic_strings:
|
||||
weight: 1
|
||||
config:
|
||||
max_string_length: 8
|
||||
```
|
||||
|
||||
# Change configuration Set project_name and experiment_name if logging your runs to W&B. T
|
||||
This config assumes a single GPU node, but you can configure this too. The following command would be for 2 GPUs, with 1 used for vLLM rollouts:
|
||||
|
||||
python3 -u train_grpo.py --config-paths configs/inter_generalisation --config-name algorithmic_qwen_3b \
|
||||
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
|
||||
trainer.n_gpus_per_node=2 \
|
||||
trainer.project_name=rg-grpo \
|
||||
trainer.experiment_name=algorithmic_qwen2.5_3b
|
||||
|
||||
Or similarly you could define this in a config file directly
|
||||
Loading…
Add table
Add a link
Reference in a new issue