cleaned up examples

2026-04-23 16:55:05 +00:00 · 2025-06-27 07:58:46 +00:00 · 2025-06-27 07:58:46 +00:00 · 799eb51800
commit 799eb51800
parent d9cd20c174
33 changed files with 117 additions and 2954 deletions
--- a/examples/veRL/.gitignore
+++ b/examples/veRL/.gitignore
@ -1,3 +0,0 @@
-outputs/
-wandb/
-verl_output.log
--- a/examples/veRL/README.md
+++ b/examples/veRL/README.md
@ -1,32 +1,72 @@
-### env setup
+# Chain Sum Training with veRL

-```
-conda create --name verl python=3.11 -y
-conda activate verl
+This example demonstrates how to train a language model using veRL (Volcano Engine Reinforcement Learning) with the reasoning-gym environment for chain sum problems.

-pip install flash-attn --no-build-isolation
-pip install ray wandb
-# pip3 install vllm==0.7.0
-pip3 install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
+Requirements:
+
+python >= 3.10
+
+## Installation
+
+1. **Install veRL**: Follow the installation instructions at [veRL repository](https://github.com/volcengine/verl)
+
+2. **Install reasoning-gym**:
+   ```bash
+   pip install reasoning-gym
+   ```
+
+## Training
+
+To start training the model on chain sum problems:
+
+```bash
+python grpo_train.py --config-path config --config-name grpo_trainer
 ```

-Regarding vllm>0.7 see: [docs](https://verl.readthedocs.io/en/latest/README_vllm0.7.html)
+### Configuration

+You can modify the training by editing the configuration file or overriding arguments in the shell scripts directly

-### clone and install veRL
-
-tested with verl HEAD c34206925e2a50fd452e474db857b4d488f8602d
-
-```
-git clone https://github.com/volcengine/verl.git
-cd verl
-pip install -e .
+```bash
+# Change dataset
+Here it is easiest to modify the `config/grpo_trainer.yaml` file with a custom training composite. Here is an example experiment which uses a composite of algorithmic training tasks
+```yaml
+reasoning_gym:
+  dataset_size: 20000
+  developer_prompt: DeepSeekZero
+  datasets:
+    ab:
+      weight: 1
+    base_conversion:
+      weight: 1
+    binary_alternation:
+      weight: 1
+      config:
+        p_solvable: 0.9
+    binary_matrix:
+      weight: 1
+      config:
+        min_n: 2
+        max_n: 6
+    caesar_cipher:
+      weight: 1
+      config:
+        max_words: 10
+    cryptarithm:
+      weight: 1
+    isomorphic_strings:
+      weight: 1
+      config:
+        max_string_length: 8
 ```

+# Change configuration Set project_name and experiment_name if logging your runs to W&B. T
+This config assumes a single GPU node, but you can configure this too. The following command would be for 2 GPUs, with 1 used for vLLM rollouts:

-Optionally log in to huggingface hub and wandb with your keys:
+python3 -u train_grpo.py --config-paths configs/inter_generalisation --config-name algorithmic_qwen_3b \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
+    trainer.n_gpus_per_node=2 \
+    trainer.project_name=rg-grpo \
+    trainer.experiment_name=algorithmic_qwen2.5_3b

-```
-huggingface-cli login
-wandb login
-```
+Or similarly you could define this in a config file directly
--- a/examples/veRL/multi_env/config/grpo_trainer.yaml
+++ b/examples/veRL/multi_env/config/grpo_trainer.yaml
@ -3,10 +3,10 @@ defaults:
  - _self_

 reasoning_gym:
-  dataset_size: 20000
+  dataset_size: 10000
  developer_prompt: DeepSeekZero
  datasets:
-    ab:
+    chain_sum:
      weight: 1

 data:
@ -16,7 +16,7 @@ data:
  prompt_key: prompt
  max_prompt_length: 512
  max_response_length: 512
-  train_batch_size: 16
+  train_batch_size: 64
  val_batch_size: 1

 actor_rollout_ref:
@ -32,7 +32,7 @@ actor_rollout_ref:
    strategy: fsdp  # This is for backward-compatibility
    ppo_mini_batch_size: 16
    ppo_micro_batch_size: null # will be deprecated, use ppo_micro_batch_size_per_gpu
-    ppo_micro_batch_size_per_gpu: 4
+    ppo_micro_batch_size_per_gpu: 8
    use_dynamic_bsz: False
    ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
    grad_clip: 1.0
@ -70,7 +70,7 @@ actor_rollout_ref:
        # transformer_layer_cls_to_wrap: None
        min_num_params: 0
    log_prob_micro_batch_size: null # will be deprecated, use log_prob_micro_batch_size_per_gpu
-    log_prob_micro_batch_size_per_gpu: 4
+    log_prob_micro_batch_size_per_gpu: 8
    log_prob_use_dynamic_bsz: ${actor_rollout_ref.actor.use_dynamic_bsz}
    log_prob_max_token_len_per_gpu: ${actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
    ulysses_sequence_parallel_size: ${actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size
@ -90,7 +90,7 @@ actor_rollout_ref:
    enforce_eager: True
    free_cache_engine: True
    load_format: dummy_dtensor
-    tensor_model_parallel_size: 1
+    tensor_model_parallel_size: 2
    max_num_batched_tokens: 8192
    max_num_seqs: 1024
    log_prob_micro_batch_size: null # will be deprecated, use log_prob_micro_batch_size_per_gpu
@ -184,14 +184,14 @@ algorithm:

 trainer:
  balance_batch: True
-  total_epochs: 30
+  total_epochs: 1
  total_training_steps: null
  project_name: verl_examples
  experiment_name: chain_sum
  logger: [ 'console', 'wandb' ]
  val_generations_to_log_to_wandb: 0
  nnodes: 1
-  n_gpus_per_node: 1
+  n_gpus_per_node: 2
  save_freq: -1
  # auto: find the last ckpt to resume. If can't find, start from scratch
  resume_mode: auto # or auto or resume_path if
--- a/examples/veRL/multi_env/grpo_train.py
+++ b/examples/veRL/multi_env/grpo_train.py
--- a/examples/veRL/multi_env/README.md
+++ b/examples/veRL/multi_env/README.md
@ -1,72 +0,0 @@
-# Chain Sum Training with veRL
-
-This example demonstrates how to train a language model using veRL (Volcano Engine Reinforcement Learning) with the reasoning-gym environment for chain sum problems.
-
-Requirements:
-
-python >= 3.10
-
-## Installation
-
-1. **Install veRL**: Follow the installation instructions at [veRL repository](https://github.com/volcengine/verl)
-
-2. **Install reasoning-gym**:
-   ```bash
-   pip install reasoning-gym
-   ```
-
-## Training
-
-To start training the model on chain sum problems:
-
-```bash
-python grpo_train.py --config-path config --config-name grpo_trainer
-```
-
-### Configuration
-
-You can modify the training by editing the configuration file or overriding arguments in the shell scripts directly
-
-```bash
-# Change dataset
-Here it is easiest to modify the `config/grpo_trainer.yaml` file with a custom training composite. Here is an example experiment which uses a composite of algorithmic training tasks
-```yaml
-reasoning_gym:
-  dataset_size: 20000
-  developer_prompt: DeepSeekZero
-  datasets:
-    ab:
-      weight: 1
-    base_conversion:
-      weight: 1
-    binary_alternation:
-      weight: 1
-      config:
-        p_solvable: 0.9
-    binary_matrix:
-      weight: 1
-      config:
-        min_n: 2
-        max_n: 6
-    caesar_cipher:
-      weight: 1
-      config:
-        max_words: 10
-    cryptarithm:
-      weight: 1
-    isomorphic_strings:
-      weight: 1
-      config:
-        max_string_length: 8
-```
-
-# Change configuration Set project_name and experiment_name if logging your runs to W&B. T
-This config assumes a single GPU node, but you can configure this too. The following command would be for 2 GPUs, with 1 used for vLLM rollouts:
-
-python3 -u train_grpo.py --config-paths configs/inter_generalisation --config-name algorithmic_qwen_3b \
-    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
-    trainer.n_gpus_per_node=2 \
-    trainer.project_name=rg-grpo \
-    trainer.experiment_name=algorithmic_qwen2.5_3b
-
-Or similarly you could define this in a config file directly