InternBootcamp/examples/xpuyu_usage
chenyongkang d2b7ff6d38 docs(README): add quick start guide and update related documentation
- Add quick start guide for InternBootcamp in both English and Chinese
- Update README and README_zh to include new quick start links
- Create detailed documentation for using Xtuner with Bootcamp data
2025-06-12 21:26:49 +08:00
..
bootcamp_rl add-missing-config 2025-05-28 14:08:37 +08:00
LICENSE init-commit 2025-05-23 15:27:15 +08:00
README.md init-commit 2025-05-23 15:27:15 +08:00
README_zh.md docs(README): add quick start guide and update related documentation 2025-06-12 21:26:49 +08:00
report_to_wandb.py init-commit 2025-05-23 15:27:15 +08:00
requirements.text init-commit 2025-05-23 15:27:15 +08:00
train_grpo.py init-commit 2025-05-23 15:27:15 +08:00
xpuyu_data_preprocess.py init-commit 2025-05-23 15:27:15 +08:00

bootcamp Training with Xtuner

🚄 Training Tutorial

1. Install Dependencies

We utilizes XTuner as the training engine.

You should make sure that InternBootcamp is successfully installed.

pip install -e $InternBootcamp_path

Then install xtuner and its dependencies.

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install flash-attn --no-build-isolation
pip install xtuner[all]==0.2.0rc0

2. Prepare Data

The bootcamp data can be transfered into training format by using examples/xpuyu_usage/xpuyu_data_preprocess.py.

Example usage:

python examples/xpuyu_usage/xpuyu_preprocess.py --src examples/bootcamp_generator_outputs/{%Y-%m-%d-%H:%M:%S}

3. Prepare your training config

Prepare your training config for starting GRPO training.

An example config is in

examples/xpuyu_usage/bootcamp_rl/configs/example_training_config.py

4. Start Training

cd examples/xpuyu_usage

GPUS_PER_NODE=$(python -c 'import torch; print(torch.cuda.device_count())')

# Number of GPU workers, for single-worker training, please set to 1
NNODES=${WORLD_SIZE:-1} # modified to adapt cluster

# The rank of this worker, should be in {0, ..., WORKER_CNT-1}, for single-worker training, please set to 0
NODE_RANK=${RANK:-0} # modified to adapt cluster

# The ip address of the rank-0 worker, for single-worker training, please set to localhost
MASTER_ADDR=${MASTER_ADDR:-localhost}

# The port for communication
MASTER_PORT=${MASTER_PORT:-6001}

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"

echo $DISTRIBUTED_ARGS

torchrun $DISTRIBUTED_ARGS train_grpo.py ./bootcamp_rl/configs/example_training_config.py --work_dir examples/xpuyu_usage/ckpts/experiment_name

5. Training Curve Visualization

You could use examples/xpuyu_usage/report_to_wandb.py to visualize your training curve.

python examples/xpuyu_usage/report_to_wandb.py examples/xpuyu_usage/ckpts/{experiment_name}/{timestamp}/rank0.log.jsonl {wandb_project_name}