mirror of
https://github.com/InternLM/InternBootcamp.git
synced 2026-04-19 12:58:04 +00:00
init-commit
This commit is contained in:
commit
18a552597a
3461 changed files with 1150579 additions and 0 deletions
155
examples/verl_usage/README.md
Normal file
155
examples/verl_usage/README.md
Normal file
|
|
@ -0,0 +1,155 @@
|
|||
# VERL USAGE
|
||||
|
||||
This directory contains examples and data processing scripts for training with bootcamp datasets in the VERL framework. Here is a brief overview of the file and its functions:
|
||||
|
||||
---
|
||||
|
||||
## Directory structure
|
||||
|
||||
```
|
||||
/InternBootcamp/examples/verl_usage
|
||||
├── run_bootcamp.sh
|
||||
├── bootcamp_reward_for_verl.py
|
||||
└── verl_data_preprocess.py
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 1. `run_bootcamp.sh`
|
||||
|
||||
This is a Shell script for setting up and running VERL training experiments. The main functions include:
|
||||
|
||||
- Set basic parameters such as experiment name and data path.
|
||||
- Install the necessary dependencies and configure the environment.
|
||||
- Start the VERL trainer and configure the experiment parameters (such as model path, batch size, learning rate, etc.).
|
||||
|
||||
|
||||
|
||||
#### How to use
|
||||
|
||||
- Before starting, ensure that important parameters are correctly configured, such as internbootcamp_path, model.path, trainer.default_local_dir, etc
|
||||
|
||||
```bash
|
||||
./run_bootcamp.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. `bootcamp_reward_for_verl.py`
|
||||
|
||||
This Python script is used to dynamically calculate the reward score of the bootcamp during training. Need to be embedded in the verl framework for use.
|
||||
|
||||
The script provides a function _default_compute_score to calculate a score based on 'data_source', 'model_output', and 'groud_truth' (here, a dictionary of parameters needed to validate model_output). Support for multiple data sources, including:
|
||||
- **Non-bootcamped data sets** : Scores are calculated using modules such as gsm8k, math, prime_math, prime_code, etc.
|
||||
- **bootcamp dataset** : Dynamically import external bootcamp modules to calculate scores.
|
||||
- Raises NotImplementedError if the data source is not supported.
|
||||
|
||||
|
||||
### 3. `verl_data_preprocess.py`
|
||||
|
||||
|
||||
|
||||
This Python script is used to convert raw data into a VerL-compatible format. The main functions include:
|
||||
|
||||
- **Recursively process input directories or files** : Converts the '.jsonl 'raw data file to the'.parquet 'format and preserves the directory structure.
|
||||
- **Data partition** : According to the file path, the data belongs to the 'train' or 'test' partition.
|
||||
- **Random shuffle and merge** : Combine multiple '.parquet 'files into one file and randomly shuffle the data.
|
||||
- **Metadata Add** : Add 'split' and other necessary information for each piece of data.
|
||||
|
||||
|
||||
|
||||
#### Directory Structure of Outputs
|
||||
|
||||
```merged
|
||||
examples/bootcamp_generator_outputs/<time_stamp>_for_verl_merged/
|
||||
├── train/
|
||||
│ └── bootcamps.parquet
|
||||
└── test/
|
||||
└── bootcamps.parquet
|
||||
```
|
||||
```not merged
|
||||
examples/bootcamp_generator_outputs/<time_stamp>_for_verl/
|
||||
├── train/
|
||||
│ ├── bootcamp1.parquet
|
||||
│ ├── bootcamp2.parquet
|
||||
│ └── ...
|
||||
└── test/
|
||||
├── bootcamp1.parquet
|
||||
├── bootcamp2.parquet
|
||||
└── ...
|
||||
```
|
||||
|
||||
|
||||
#### Example Command
|
||||
|
||||
```bash
|
||||
python examples/verl_usage/verl_preprocess.py --src examples/bootcamp_generator_outputs/2025-03-07-16:48:28
|
||||
```
|
||||
|
||||
|
||||
|
||||
This command converts all.jsonl files in examples/bootcamp_generator_outputs/2025-03-07-16:48:28 to VerL-compatible.parquet files and outputs to the default directory:
|
||||
|
||||
```
|
||||
examples/bootcamp_generator_outputs/2025-03-07-16:48:28_for_verl_merged/
|
||||
examples/bootcamp_generator_outputs/2025-03-07-16:48:28_for_verl/
|
||||
```
|
||||
|
||||
|
||||
|
||||
---
|
||||
|
||||
|
||||
|
||||
### 4. Sumarizing the Workflow
|
||||
|
||||
|
||||
1. **Data Preprocessing** :
|
||||
|
||||
- Convert raw.jsonL data to VerL-compatible.parquet format using verl_data_preprocess.py.
|
||||
- merged output directory is' <src>_for_verl_merged ', containing 'train/bootcamps.parquet' and 'test/bootcamps.parquet' files.
|
||||
|
||||
merged output directory <src>_for_verl ', containing multiple.parquet files, each corresponding to a.jsonl file.
|
||||
|
||||
|
||||
|
||||
2. **Embed bootcamp_reward_for_verl.py into the verl framework**
|
||||
|
||||
This is the following code snippet
|
||||
|
||||
```python
|
||||
elif data_source.startswith("bootcamp/"):
|
||||
try:
|
||||
import importlib
|
||||
import json
|
||||
bootcamp_name = data_source.split('/')[1]
|
||||
class_name = bootcamp_name[0].upper() + bootcamp_name[1:] + "bootcamp"
|
||||
module = importlib.import_module(f"internbootcamp")
|
||||
ground_truth = json.loads(ground_truth)
|
||||
return getattr(module, class_name).verify_score(solution_str, ground_truth, format_score=0)
|
||||
except Exception as e:
|
||||
print("Something woring with bootcamp reward because of ",e)
|
||||
return 0
|
||||
```
|
||||
|
||||
Embed the relevant code in the VeRL framework as' /verl/utils/reward_score/__init__.py '
|
||||
```
|
||||
/verl
|
||||
└── utils
|
||||
└── reward_score
|
||||
└── __init__.py
|
||||
```
|
||||
|
||||
3. **Start the training experiment** :
|
||||
|
||||
- Complete the Settings in 'run_bootcamp.sh' `experiment_name`,`internbootcamp_path`,`train_files`,`test_files`,`actor_rollout_ref.model.path`,`trainer.default_local _dir 'and other important experimental parameters.
|
||||
|
||||
- Run 'run_bootcamp.sh' to start VERL training.
|
||||
|
||||
|
||||
|
||||
---
|
||||
|
||||
|
||||
|
||||
With the above tools and processes, you can efficiently prepare data and run VERL experiments.
|
||||
125
examples/verl_usage/README_zh.md
Normal file
125
examples/verl_usage/README_zh.md
Normal file
|
|
@ -0,0 +1,125 @@
|
|||
# VERL USAGE
|
||||
|
||||
本目录包含在VERL框架下加入bootcamp数据集进行训练的示例和数据处理脚本。以下是文件及其功能的简要概述:
|
||||
|
||||
---
|
||||
|
||||
## 目录结构
|
||||
|
||||
```
|
||||
/InternBootcamp/examples/verl_usage
|
||||
├── run_bootcamp.sh
|
||||
├── bootcamp_reward_for_verl.py
|
||||
└── verl_data_preprocess.py
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 1. `run_bootcamp.sh`
|
||||
|
||||
这是一个 Shell 脚本,用于设置和运行 VERL 训练实验。主要功能包括:
|
||||
|
||||
- 设置实验名称、数据路径等基本参数。
|
||||
- 安装必要的依赖项并配置环境。
|
||||
- 启动 VERL 训练器,并配置实验参数(如模型路径、批处理大小、学习率等)。
|
||||
|
||||
#### 使用方法
|
||||
- 启动前,确保重要参数已配置正确,如`internbootcamp_path`,`model.path`,`trainer.default_local_dir`等
|
||||
```bash
|
||||
./run_bootcamp.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. `bootcamp_reward_for_verl.py`
|
||||
|
||||
该Python脚本用于训练中动态计算 bootcamp 的奖励分数。需嵌入到verl框架中使用。
|
||||
|
||||
该脚本提供了一个函数 _default_compute_score,用于根据`data_source`、`model_output`和`groud_truth`(这里指验证model_output所需的各项参数组成的字典)计算分数。支持多种数据源,包括:
|
||||
|
||||
|
||||
- **非训练场数据集**:通过 gsm8k, math, prime_math, prime_code 等模块计算分数。
|
||||
|
||||
- **训练场数据集**:动态导入外部训练场模块以计算分数。
|
||||
|
||||
- 如果数据源不受支持,则抛出 NotImplementedError。
|
||||
|
||||
### 3. `verl_data_preprocess.py`
|
||||
|
||||
该 Python 脚本用于将原始数据转换为 VERL 兼容的格式。主要功能包括:
|
||||
|
||||
- **递归处理输入目录或文件**:将 `.jsonl` 原始数据文件转换为 `.parquet` 格式,并保留目录结构。
|
||||
- **数据划分**:根据文件路径判断数据属于 `train` 或 `test` 划分。
|
||||
- **随机打乱与合并**:将多个 `.parquet` 文件合并为一个文件,并随机打乱数据。
|
||||
- **元数据添加**:为每条数据添加 `split` 和其他必要信息。
|
||||
|
||||
#### 输出目录结构
|
||||
```merged
|
||||
examples/bootcamp_generator_outputs/<time_stamp>_for_verl_merged/
|
||||
├── train/
|
||||
│ └── bootcamps.parquet
|
||||
└── test/
|
||||
└── bootcamps.parquet
|
||||
```
|
||||
```not merged
|
||||
examples/bootcamp_generator_outputs/<time_stamp>_for_verl/
|
||||
├── train/
|
||||
│ ├── bootcamp1.parquet
|
||||
│ ├── bootcamp2.parquet
|
||||
│ └── ...
|
||||
└── test/
|
||||
├── bootcamp1.parquet
|
||||
├── bootcamp2.parquet
|
||||
└── ...
|
||||
```
|
||||
|
||||
#### 示例命令
|
||||
```bash
|
||||
python examples/verl_usage/verl_preprocess.py --src examples/bootcamp_generator_outputs/2025-03-07-16:48:28
|
||||
```
|
||||
|
||||
此命令将 `examples/bootcamp_generator_outputs/2025-03-07-16:48:28` 中的所有 `.jsonl` 文件转换为 VERL 兼容的 `.parquet` 文件,并输出到默认目录:
|
||||
```
|
||||
examples/bootcamp_generator_outputs/2025-03-07-16:48:28_for_verl_merged/
|
||||
examples/bootcamp_generator_outputs/2025-03-07-16:48:28_for_verl/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 4. 工作流程总结
|
||||
|
||||
1. **数据预处理**:
|
||||
- 使用 `verl_data_preprocess.py` 将原始 `.jsonl` 数据转换为 VERL 兼容的 `.parquet` 格式。
|
||||
- merged输出目录为 `<src>_for_verl_merged`,包含 `train/bootcamps.parquet` 和 `test/bootcamps.parquet` 文件。
|
||||
- not merged输出目录为 `<src>_for_verl`,包含多个 `.parquet` 文件,每个文件对应一个 `.jsonl` 文件。
|
||||
|
||||
2. **将bootcamp_reward_for_verl.py嵌入verl框架**
|
||||
即如下代码段
|
||||
```python
|
||||
elif data_source.startswith("bootcamp/"):
|
||||
try:
|
||||
import importlib
|
||||
import json
|
||||
bootcamp_name = data_source.split('/')[1]
|
||||
class_name = bootcamp_name[0].upper() + bootcamp_name[1:] + "bootcamp"
|
||||
module = importlib.import_module(f"internbootcamp")
|
||||
ground_truth = json.loads(ground_truth)
|
||||
return getattr(module, class_name).verify_score(solution_str, ground_truth, format_score=0)
|
||||
except Exception as e:
|
||||
print("Something woring with bootcamp reward because of ",e)
|
||||
return 0
|
||||
```
|
||||
将其中相关代码嵌入VeRL框架中即 `/verl/utils/reward_score/__init__.py`
|
||||
```
|
||||
/verl
|
||||
└── utils
|
||||
└── reward_score
|
||||
└── __init__.py
|
||||
```
|
||||
3. **启动训练实验**:
|
||||
- 在 `run_bootcamp.sh` 中完成设置 `experiment_name`,`internbootcamp_path`,`train_files`,`test_files`,`actor_rollout_ref.model.path`,`trainer.default_local_dir`等重要实验参数。
|
||||
- 运行 `run_bootcamp.sh` 启动 VERL 训练。
|
||||
|
||||
---
|
||||
|
||||
通过以上工具和流程,您可以高效地准备数据并运行 VERL 实验。
|
||||
54
examples/verl_usage/bootcamp_reward_for_verl.py
Executable file
54
examples/verl_usage/bootcamp_reward_for_verl.py
Executable file
|
|
@ -0,0 +1,54 @@
|
|||
# Copyright 2024 Bytedance Ltd. and/or its affiliates
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
# from . import gsm8k, math, prime_math, prime_code
|
||||
|
||||
def _default_compute_score(data_source, solution_str, ground_truth, extra_info=None):
|
||||
|
||||
if data_source == 'openai/gsm8k':
|
||||
from . import gsm8k
|
||||
res = gsm8k.compute_score(solution_str, ground_truth)
|
||||
elif data_source in ['lighteval/MATH', 'DigitalLearningGmbH/MATH-lighteval']:
|
||||
from . import math_data
|
||||
res = math_data.compute_score(solution_str, ground_truth)
|
||||
elif data_source in [
|
||||
'numina_aops_forum', 'numina_synthetic_math', 'numina_amc_aime', 'numina_synthetic_amc', 'numina_cn_k12',
|
||||
'numina_olympiads'
|
||||
]:
|
||||
from . import prime_math
|
||||
res = prime_math.compute_score(solution_str, ground_truth)
|
||||
elif data_source in ['codecontests', 'apps', 'codeforces', 'taco']:
|
||||
from . import prime_code
|
||||
res = prime_code.compute_score(solution_str, ground_truth, continuous=True)
|
||||
elif data_source.startswith("bootcamp/"):
|
||||
try:
|
||||
import importlib
|
||||
import json
|
||||
bootcamp_name = data_source.split('/')[1]
|
||||
class_name = bootcamp_name[0].upper() + bootcamp_name[1:] + "bootcamp"
|
||||
module = importlib.import_module(f"internbootcamp")
|
||||
ground_truth = json.loads(ground_truth)
|
||||
return getattr(module, class_name).verify_score(solution_str, ground_truth, format_score=0)
|
||||
except Exception as e:
|
||||
print("Something woring with bootcamp reward because of ",e)
|
||||
return 0
|
||||
else:
|
||||
raise NotImplementedError
|
||||
|
||||
if isinstance(res, (int, float, bool)):
|
||||
return float(res)
|
||||
else:
|
||||
return float(res[0])
|
||||
|
||||
|
||||
# print('solution string (generated raw output)', solution_str.replace('\n', ''))
|
||||
97
examples/verl_usage/run_bootcamp.sh
Executable file
97
examples/verl_usage/run_bootcamp.sh
Executable file
|
|
@ -0,0 +1,97 @@
|
|||
#!/bin/bash
|
||||
|
||||
# 定义变量
|
||||
project_name="bootcamp_example_project"
|
||||
experiment_name="default_experiment_name"
|
||||
internbootcamp_path="/path/to/bootcamp"
|
||||
actor_model="/path/to/actor_model"
|
||||
verl_path="/path/to/verl"
|
||||
export WANDB_API_KEY="your_wandb_api_key"
|
||||
|
||||
# 安装依赖
|
||||
pip install -e "$internbootcamp_path"
|
||||
|
||||
# 设置环境变量
|
||||
export VERL_PPO_LOGGING_LEVEL=DEBUG
|
||||
# Do not use these in new version of Verl
|
||||
# export HYDRA_FULL_ERROR=1
|
||||
# export VLLM_ATTENTION_BACKEND=XFORMERS
|
||||
|
||||
# 定义文件路径数组
|
||||
train_files=(
|
||||
"examples/bootcamp_generator_outputs/<time_stamp>_for_verl_merged/train/bootcamps.parquet"
|
||||
)
|
||||
test_files=(
|
||||
"examples/bootcamp_generator_outputs/<time_stamp>_for_verl_merged/test/bootcamps.parquet"
|
||||
"examples/bootcamp_generator_outputs/<time_stamp>_for_verl/test/aime.parquet"
|
||||
"examples/bootcamp_generator_outputs/<time_stamp>_for_verl/test/math.parquet"
|
||||
)
|
||||
|
||||
# 构建 train_files 和 test_files 字符串
|
||||
build_json_array() {
|
||||
local array=("$@")
|
||||
if [[ ${#array[@]} -eq 0 ]]; then
|
||||
echo "[]"
|
||||
else
|
||||
local result="["
|
||||
for item in "${array[@]}"; do
|
||||
result+="\"$item\","
|
||||
done
|
||||
result="${result%,}]"
|
||||
echo "$result"
|
||||
fi
|
||||
}
|
||||
|
||||
train_files_str=$(build_json_array "${train_files[@]}")
|
||||
test_files_str=$(build_json_array "${test_files[@]}")
|
||||
|
||||
|
||||
# 调试模式
|
||||
set -x
|
||||
|
||||
# 运行训练脚本
|
||||
python -m verl.trainer.main_ppo \
|
||||
algorithm.adv_estimator=grpo \
|
||||
data.train_files="$train_files_str" \
|
||||
data.val_files="$test_files_str" \
|
||||
+data.no_chat_template=False \
|
||||
data.train_batch_size=64 \
|
||||
data.val_batch_size=64 \
|
||||
data.truncation=right \
|
||||
data.max_prompt_length=4096 \
|
||||
data.max_response_length=8192 \
|
||||
actor_rollout_ref.model.path=$actor_model \
|
||||
trainer.default_hdfs_dir=null \
|
||||
trainer.default_local_dir=$verl_path/ckpts/$experiment_name \
|
||||
actor_rollout_ref.actor.optim.lr=1e-6 \
|
||||
actor_rollout_ref.model.use_remove_padding=True \
|
||||
actor_rollout_ref.actor.ppo_mini_batch_size=32 \
|
||||
actor_rollout_ref.actor.use_dynamic_bsz=True \
|
||||
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=20000 \
|
||||
actor_rollout_ref.actor.use_kl_loss=True \
|
||||
actor_rollout_ref.actor.kl_loss_coef=0.001 \
|
||||
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
|
||||
actor_rollout_ref.model.enable_gradient_checkpointing=True \
|
||||
actor_rollout_ref.actor.fsdp_config.param_offload=False \
|
||||
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
|
||||
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
|
||||
actor_rollout_ref.rollout.disable_log_stats=True \
|
||||
actor_rollout_ref.rollout.name=vllm \
|
||||
actor_rollout_ref.rollout.max_num_batched_tokens=32768 \
|
||||
actor_rollout_ref.rollout.max_num_seqs=32 \
|
||||
actor_rollout_ref.rollout.gpu_memory_utilization=0.65 \
|
||||
actor_rollout_ref.rollout.n=4 \
|
||||
actor_rollout_ref.ref.fsdp_config.param_offload=True \
|
||||
actor_rollout_ref.rollout.enforce_eager=False \
|
||||
actor_rollout_ref.rollout.free_cache_engine=False \
|
||||
algorithm.kl_ctrl.kl_coef=0.001 \
|
||||
trainer.critic_warmup=0 \
|
||||
trainer.logger=['console','wandb'] \
|
||||
trainer.project_name=$project_name \
|
||||
trainer.experiment_name=$experiment_name \
|
||||
trainer.val_before_train=False \
|
||||
trainer.n_gpus_per_node=8 \
|
||||
trainer.nnodes=1 \
|
||||
trainer.save_freq=20 \
|
||||
trainer.test_freq=20 \
|
||||
trainer.total_epochs=1 $@
|
||||
186
examples/verl_usage/verl_data_preprocess.py
Normal file
186
examples/verl_usage/verl_data_preprocess.py
Normal file
|
|
@ -0,0 +1,186 @@
|
|||
import os
|
||||
import json
|
||||
import random
|
||||
import pandas as pd
|
||||
import fire
|
||||
import subprocess
|
||||
import json
|
||||
import argparse
|
||||
|
||||
def get_split_from_path(path):
|
||||
"""
|
||||
根据文件路径的祖先目录判断 split 值。
|
||||
:param path: 文件的完整路径
|
||||
:return: split 值(train/test/other)
|
||||
"""
|
||||
# 逐级向上检查目录,直到根目录
|
||||
max_depth = 10
|
||||
depth = 0
|
||||
while path and depth < max_depth:
|
||||
parent_dir = os.path.basename(path)
|
||||
if parent_dir == "train":
|
||||
return "train"
|
||||
elif parent_dir == "test":
|
||||
return "test"
|
||||
path = os.path.dirname(path) # 向上一级目录
|
||||
depth += 1
|
||||
return "test" # 如果没有找到 train 或 test,则返回 test
|
||||
|
||||
def shuffle_and_merge_parquet_files(input_dir, output_file):
|
||||
"""
|
||||
将指定目录中的所有 Parquet 文件 shuffle 并合并成一个文件。
|
||||
|
||||
参数:
|
||||
input_dir (str): 包含 Parquet 文件的输入目录路径。
|
||||
output_file (str): 输出合并后的 Parquet 文件路径。
|
||||
"""
|
||||
# 检查输入目录是否存在
|
||||
if not os.path.isdir(input_dir):
|
||||
raise ValueError(f"输入目录不存在: {input_dir}")
|
||||
|
||||
# 获取目录中所有的 Parquet 文件
|
||||
parquet_files = [os.path.join(input_dir, f) for f in os.listdir(input_dir) if f.endswith('.parquet')]
|
||||
if not parquet_files:
|
||||
raise ValueError(f"输入目录中没有找到 Parquet 文件: {input_dir}")
|
||||
|
||||
# 读取所有 Parquet 文件并合并为一个 DataFrame
|
||||
dfs = []
|
||||
for file in parquet_files:
|
||||
df = pd.read_parquet(file)
|
||||
dfs.append(df)
|
||||
|
||||
combined_df = pd.concat(dfs, ignore_index=True)
|
||||
|
||||
# 对合并后的 DataFrame 进行 shuffle
|
||||
shuffled_df = combined_df.sample(frac=1, random_state=42)
|
||||
|
||||
# make sure the output directory exists
|
||||
if not os.path.exists(os.path.dirname(output_file)):
|
||||
os.makedirs(os.path.dirname(output_file))
|
||||
# 将 shuffle 后的数据写入输出文件
|
||||
shuffled_df.to_parquet(output_file, engine='pyarrow', index=False)
|
||||
|
||||
print(f"合并后的 Parquet 文件已保存到: {output_file}")
|
||||
|
||||
def convert_to_parquet(src_jsonl, tgt_parquet, split, shuffle=True):
|
||||
"""
|
||||
将 JSONL 文件转换为 Parquet 格式,并根据 split 设置额外信息。
|
||||
:param src_jsonl: 源 JSONL 文件路径
|
||||
:param tgt_parquet: 目标 Parquet 文件路径
|
||||
:param split: 数据集划分(train/test/other)
|
||||
"""
|
||||
# 用于存储转换后的数据
|
||||
data_list = []
|
||||
|
||||
# 读取 JSONL 文件并逐行处理
|
||||
with open(src_jsonl, 'r', encoding='utf-8') as f:
|
||||
# Shulffe 数据
|
||||
lines = list(f.readlines())
|
||||
if shuffle:
|
||||
random.shuffle(lines)
|
||||
for idx, line in enumerate(lines):
|
||||
try:
|
||||
# 解析每一行的 JSON 数据
|
||||
record = json.loads(line.strip())
|
||||
|
||||
# 提取所需字段
|
||||
data_source = record.get("data_source", "")
|
||||
prompt = record.get("prompt", "")
|
||||
ground_truth = record.get("ground_truth", "")
|
||||
|
||||
# 构造目标格式的数据结构
|
||||
formatted_data = {
|
||||
"data_source": 'bootcamp/' + data_source,
|
||||
"prompt": [{
|
||||
"role": "user",
|
||||
"content": prompt
|
||||
}],
|
||||
"reward_model": {
|
||||
"style": "rule",
|
||||
"ground_truth": json.dumps(ground_truth, ensure_ascii=False)
|
||||
},
|
||||
"extra_info": {
|
||||
'split': split, # 使用传入的 split 值
|
||||
'index': idx
|
||||
}
|
||||
}
|
||||
|
||||
# 将构造好的数据添加到列表中
|
||||
data_list.append(formatted_data)
|
||||
|
||||
except json.JSONDecodeError as e:
|
||||
print(f"Error decoding JSON at line {idx + 1}: {e}")
|
||||
continue
|
||||
|
||||
# 将数据列表转换为 Pandas DataFrame
|
||||
df = pd.DataFrame(data_list)
|
||||
|
||||
|
||||
|
||||
# 保存为 Parquet 文件
|
||||
df.to_parquet(tgt_parquet, index=False)
|
||||
|
||||
|
||||
def _main(src, tgt):
|
||||
"""递归处理目录或文件"""
|
||||
if os.path.isdir(src):
|
||||
# 如果是目录,创建对应的目标目录
|
||||
os.makedirs(tgt, exist_ok=True)
|
||||
|
||||
for sub in os.listdir(src):
|
||||
src_path = os.path.join(src, sub)
|
||||
tgt_path = os.path.join(tgt, sub)
|
||||
_main(src_path, tgt_path) # 递归调用
|
||||
elif src.endswith(".jsonl"):
|
||||
# 如果是 .jsonl 文件,添加 verl 前缀并进行转换
|
||||
base_name = os.path.basename(src)
|
||||
tgt_file_name = f"verl_{base_name}" # 添加 verl 前缀
|
||||
|
||||
# tgt 转为parquet后缀
|
||||
if ".jsonl" in tgt_file_name:
|
||||
tgt_file_name = tgt_file_name.replace(".jsonl", ".parquet")
|
||||
tgt_path = os.path.join(os.path.dirname(tgt), tgt_file_name)
|
||||
tmp_tgt = tgt_path + ".tmp"
|
||||
|
||||
# 获取当前文件所属的 split
|
||||
split = get_split_from_path(src) # 调用函数获取 split
|
||||
|
||||
try:
|
||||
convert_to_parquet(src, tmp_tgt, split) # 传递 split 参数
|
||||
subprocess.run(f"mv {tmp_tgt} {tgt_path}", shell=True, check=True)
|
||||
except Exception as e:
|
||||
print(f"Error processing {src}: {e}")
|
||||
subprocess.run(f"rm -f {tmp_tgt}", shell=True, check=True)
|
||||
|
||||
|
||||
def main(src, tgt=None):
|
||||
"""
|
||||
主函数,支持目录或文件作为输入
|
||||
:param src: 源文件或目录路径
|
||||
:param tgt: 目标文件或目录路径
|
||||
"""
|
||||
if not tgt and os.path.isdir(src):
|
||||
tgt = src + '_for_verl'
|
||||
|
||||
if not os.path.exists(src):
|
||||
raise ValueError(f"Source path does not exist: {src}")
|
||||
|
||||
if os.path.isfile(src) and not src.endswith(".jsonl"):
|
||||
raise ValueError(f"Source file is not a .jsonl file: {src}")
|
||||
|
||||
_main(src, tgt)
|
||||
|
||||
# merge N shuffle
|
||||
shuffle_and_merge_parquet_files(input_dir=os.path.join(tgt, 'train'), output_file=os.path.join(tgt + '_merged', 'bootcamps', 'train.parquet'))
|
||||
shuffle_and_merge_parquet_files(input_dir=os.path.join(tgt, 'test'), output_file=os.path.join(tgt + '_merged', 'bootcamps', 'test.parquet'))
|
||||
# shuffle_and_merge_parquet_files(input_dir=os.path.join(tgt, 'verified_bench'), output_file=os.path.join(tgt + '_merged', 'bootcamps', 'verified_bench.parquet'))
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
"""
|
||||
示例用法:
|
||||
python examples/verl_usage/verl_preprocess.py --src examples/bootcamp_generator_outputs/2025-03-07-16:48:28
|
||||
将 `v2_bootcamp_data` 目录下的所有 .jsonl 文件转换为 verl 格式 .jsonl,并保留目录结构输出到默认输出目录
|
||||
输出的 .jsonl 文件会带有 verl 前缀。
|
||||
"""
|
||||
fire.Fire(main)
|
||||
Loading…
Add table
Add a link
Reference in a new issue