mirror of
https://github.com/InternLM/InternBootcamp.git
synced 2026-04-19 12:58:04 +00:00
91 lines
2.2 KiB
Markdown
Executable file
91 lines
2.2 KiB
Markdown
Executable file
# bootcamp Training with Xtuner
|
|
|
|
|
|
|
|
## 🚄 Training Tutorial
|
|
|
|
### 1. Install Dependencies
|
|
|
|
We utilizes [XTuner](https://github.com/InternLM/xtuner/tree/main) as the training engine.
|
|
|
|
You should make sure that InternBootcamp is successfully installed.
|
|
|
|
```bash
|
|
pip install -e $InternBootcamp_path
|
|
```
|
|
|
|
Then install xtuner and its dependencies.
|
|
|
|
```bash
|
|
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
|
|
pip install flash-attn --no-build-isolation
|
|
pip install xtuner[all]==0.2.0rc0
|
|
```
|
|
|
|
### 2. Prepare Data
|
|
|
|
|
|
The bootcamp data can be transfered into training format by using examples/xpuyu_usage/xpuyu_data_preprocess.py.
|
|
|
|
|
|
**Example usage:**
|
|
```python
|
|
python examples/xpuyu_usage/xpuyu_preprocess.py --src examples/bootcamp_generator_outputs/{%Y-%m-%d-%H:%M:%S}
|
|
```
|
|
|
|
|
|
|
|
### 3. Prepare your training config
|
|
|
|
Prepare your training config for starting GRPO training.
|
|
|
|
An example config is in
|
|
|
|
```
|
|
examples/xpuyu_usage/bootcamp_rl/configs/example_training_config.py
|
|
```
|
|
|
|
|
|
### 4. Start Training
|
|
|
|
|
|
```bash
|
|
cd examples/xpuyu_usage
|
|
|
|
GPUS_PER_NODE=$(python -c 'import torch; print(torch.cuda.device_count())')
|
|
|
|
# Number of GPU workers, for single-worker training, please set to 1
|
|
NNODES=${WORLD_SIZE:-1} # modified to adapt cluster
|
|
|
|
# The rank of this worker, should be in {0, ..., WORKER_CNT-1}, for single-worker training, please set to 0
|
|
NODE_RANK=${RANK:-0} # modified to adapt cluster
|
|
|
|
# The ip address of the rank-0 worker, for single-worker training, please set to localhost
|
|
MASTER_ADDR=${MASTER_ADDR:-localhost}
|
|
|
|
# The port for communication
|
|
MASTER_PORT=${MASTER_PORT:-6001}
|
|
|
|
DISTRIBUTED_ARGS="
|
|
--nproc_per_node $GPUS_PER_NODE \
|
|
--nnodes $NNODES \
|
|
--node_rank $NODE_RANK \
|
|
--master_addr $MASTER_ADDR \
|
|
--master_port $MASTER_PORT
|
|
"
|
|
|
|
echo $DISTRIBUTED_ARGS
|
|
|
|
torchrun $DISTRIBUTED_ARGS train_grpo.py ./bootcamp_rl/configs/example_training_config.py --work_dir examples/xpuyu_usage/ckpts/experiment_name
|
|
```
|
|
|
|
|
|
### 5. Training Curve Visualization
|
|
|
|
You could use examples/xpuyu_usage/report_to_wandb.py to visualize your training curve.
|
|
|
|
```bash
|
|
python examples/xpuyu_usage/report_to_wandb.py examples/xpuyu_usage/ckpts/{experiment_name}/{timestamp}/rank0.log.jsonl {wandb_project_name}
|
|
```
|
|
|
|
|