Feat/unsloth example (#482)

* cleaned up examples * updated failing hooks * updated readme * corrected linting checks
2026-04-19 12:58:07 +00:00 · 2025-06-28 17:04:38 +01:00 · 2025-06-28 17:04:38 +01:00 · 1c98584f28
commit 1c98584f28
parent d9cd20c174
29 changed files with 122 additions and 2857 deletions
--- a/examples/veRL/README.md
+++ b/examples/veRL/README.md
@ -1,32 +1,74 @@
-### env setup
+# Chain Sum Training with veRL

-```
-conda create --name verl python=3.11 -y
-conda activate verl
+This example demonstrates how to train a language model using veRL (Volcano Engine Reinforcement Learning) with the reasoning-gym environment for chain sum problems.

-pip install flash-attn --no-build-isolation
-pip install ray wandb
-# pip3 install vllm==0.7.0
-pip3 install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
+Requirements:
+
+python >= 3.10
+
+## Installation
+
+1. **Install veRL**: Follow the installation instructions at [veRL repository](https://github.com/volcengine/verl)
+
+2. **Install reasoning-gym**:
+   ```bash
+   pip install reasoning-gym
+   ```
+
+## Training
+
+To start training the model on chain sum problems:
+
+```bash
+python grpo_train.py --config-path config --config-name grpo_trainer
 ```

-Regarding vllm>0.7 see: [docs](https://verl.readthedocs.io/en/latest/README_vllm0.7.html)
+### Configuration

+You can modify the training by editing the configuration file or overriding arguments in the shell scripts directly

-### clone and install veRL
-
-tested with verl HEAD c34206925e2a50fd452e474db857b4d488f8602d
-
-```
-git clone https://github.com/volcengine/verl.git
-cd verl
-pip install -e .
+```bash
+# Change dataset
+Here it is easiest to modify the `config/grpo_trainer.yaml` file with a custom training composite. Here is an example experiment which uses a composite of algorithmic training tasks
+```yaml
+reasoning_gym:
+  dataset_size: 20000
+  developer_prompt: DeepSeekZero
+  datasets:
+    ab:
+      weight: 1
+    base_conversion:
+      weight: 1
+    binary_alternation:
+      weight: 1
+      config:
+        p_solvable: 0.9
+    binary_matrix:
+      weight: 1
+      config:
+        min_n: 2
+        max_n: 6
+    caesar_cipher:
+      weight: 1
+      config:
+        max_words: 10
+    cryptarithm:
+      weight: 1
+    isomorphic_strings:
+      weight: 1
+      config:
+        max_string_length: 8
 ```

+**Note**: In `config/grpo_trainer.yaml` we specify default arguments to be read from `file:///workspace/verl/verl/trainer/config`. Modify this accordingly if you have a different local folder setup.

-Optionally log in to huggingface hub and wandb with your keys:
+# Change configuration Set project_name and experiment_name if logging your runs to W&B. T
+This config assumes a single GPU node, but you can configure this too. The following command would be for 2 GPUs, with 1 used for vLLM rollouts:

-```
-huggingface-cli login
-wandb login
-```
+python3 -u train_grpo.py --config-paths configs/inter_generalisation --config-name algorithmic_qwen_3b \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
+    trainer.n_gpus_per_node=2 \
+    trainer.project_name=rg-grpo \
+    trainer.experiment_name=algorithmic_qwen2.5_3b
+
+Or similarly you could define this in a config file directly