init-commit

2026-04-19 12:58:04 +00:00 · 2025-05-23 15:27:15 +08:00 · 2025-05-23 15:27:15 +08:00 · 18a552597a
commit 18a552597a
3461 changed files with 1150579 additions and 0 deletions
--- a/examples/xpuyu_usage/README.md
+++ b/examples/xpuyu_usage/README.md
@ -0,0 +1,91 @@
+# bootcamp Training with Xtuner
+
+
+
+## 🚄 Training Tutorial
+
+### 1. Install Dependencies
+
+We utilizes [XTuner](https://github.com/InternLM/xtuner/tree/main) as the training engine. 
+
+You should make sure that InternBootcamp is successfully installed.
+
+```bash
+pip install -e $InternBootcamp_path
+```
+
+Then install xtuner and its dependencies.
+
+```bash
+pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
+pip install flash-attn --no-build-isolation
+pip install xtuner[all]==0.2.0rc0
+```
+
+### 2. Prepare Data
+
+
+The bootcamp data can be transfered into training format by using examples/xpuyu_usage/xpuyu_data_preprocess.py. 
+
+
+**Example usage:**
+```python
+python examples/xpuyu_usage/xpuyu_preprocess.py --src examples/bootcamp_generator_outputs/{%Y-%m-%d-%H:%M:%S}
+```
+
+
+
+### 3. Prepare your training config
+
+Prepare your training config for starting GRPO training.
+
+An example config is in
+
+```
+examples/xpuyu_usage/bootcamp_rl/configs/example_training_config.py
+```
+
+
+### 4. Start Training
+
+
+```bash
+cd examples/xpuyu_usage
+
+GPUS_PER_NODE=$(python -c 'import torch; print(torch.cuda.device_count())')
+
+# Number of GPU workers, for single-worker training, please set to 1
+NNODES=${WORLD_SIZE:-1} # modified to adapt cluster
+
+# The rank of this worker, should be in {0, ..., WORKER_CNT-1}, for single-worker training, please set to 0
+NODE_RANK=${RANK:-0} # modified to adapt cluster
+
+# The ip address of the rank-0 worker, for single-worker training, please set to localhost
+MASTER_ADDR=${MASTER_ADDR:-localhost}
+
+# The port for communication
+MASTER_PORT=${MASTER_PORT:-6001}
+
+DISTRIBUTED_ARGS="
+    --nproc_per_node $GPUS_PER_NODE \
+    --nnodes $NNODES \
+    --node_rank $NODE_RANK \
+    --master_addr $MASTER_ADDR \
+    --master_port $MASTER_PORT
+"
+
+echo $DISTRIBUTED_ARGS
+
+torchrun $DISTRIBUTED_ARGS train_grpo.py ./bootcamp_rl/configs/example_training_config.py --work_dir examples/xpuyu_usage/ckpts/experiment_name
+```
+
+
+### 5. Training Curve Visualization
+
+You could use examples/xpuyu_usage/report_to_wandb.py to visualize your training curve.
+
+```bash
+python examples/xpuyu_usage/report_to_wandb.py examples/xpuyu_usage/ckpts/{experiment_name}/{timestamp}/rank0.log.jsonl {wandb_project_name}
+```
+
+