BLEUBERI/training/README.md
2025-06-04 20:36:43 +00:00

52 lines
No EOL
2.6 KiB
Markdown

# GRPO and SFT training
## Creating training data
To create training data for a specific run based on the [50K data pool](https://huggingface.co/datasets/yapeichang/BLEUBERI-Tulu3-50k), we use [`create_training_data.py`](create_training_data.py).
This script supports two main commands: `grpo` and `sft`.
### `grpo` command
The `grpo` command prepares data for GRPO (Group Relative Policy Optimization) training. It can:
- Aggregate reference outputs from different models specified by `--ref_models`.
- Available options include: `gold` (default), `claude-3-7-sonnet@20250219`, `deepseek-chat-v3`, `gemini-2.5-pro-exp-03-25`, `o4-mini-2025-04-16`, `Llama-3.1-8B-Instruct`.
- Score model outputs using a specified `--metric` (e.g., `bleu`, `rm`, `rouge`) and a `--model` for generation if outputs are not cached.
- Select a subset of examples using `--selection_mode` (`random`, `easy`, `medium`, `hard`) and `--num_examples`.
- The default setting is `hard`, which will select the `num_examples` lowest scoring examples from the 50K data pool.
- Specify the source HuggingFace dataset with `--hf_dataset_path`. The default is `yapeichang/BLEUBERI-Tulu3-50k`.
- Control output directory with `--output_dir` and the output dataset name with `--output_dataset_name`.
Example usage:
```bash
python training/create_training_data.py grpo \
--hf_dataset_path yapeichang/BLEUBERI-Tulu3-50k \
--ref_models gold \
--selection_mode hard \
--num_examples 1000 \
--metric bleu \
--model Qwen/Qwen2.5-7B \
--output_dir ../data \
--output_dataset_name my_grpo_dataset
```
### `sft` command
The `sft` command prepares data for SFT training. It converts a dataset (often produced by the `grpo` command) into a series of prompt-response pairs.
- Specify the input data path (a directory containing a HuggingFace dataset) with `--input_data_path`.
- Control output directory with `--output_dir`.
Example usage:
```bash
python training/create_training_data.py sft \
--input_data_path ../data/data_grpo/my_grpo_dataset \
--output_dir ../data
```
The script also handles on-the-fly inference if model outputs are missing for scoring in the `grpo` command, using vLLM. Arguments like `--inference_max_new_tokens` and `--seed` can be used to control this process.
## GRPO training
To run GRPO training, see [`grpo.sh`](grpo.sh) for an example job script that covers training data creation, specifying DeepSpeed config, and launching the training job.
## SFT training
To run SFT training, see [`sft.sh`](sft.sh) for an example job script that covers training data creation, specifying DeepSpeed config, and launching the training job.