| .. | ||
| chat_templates.py | ||
| config.py | ||
| create_training_data.py | ||
| create_training_data.sh | ||
| data_loader.py | ||
| dataset.py | ||
| grpo.py | ||
| grpo.sh | ||
| README.md | ||
| run_inference.py | ||
| sft.py | ||
| sft.sh | ||
| utils.py | ||
GRPO and SFT training
Creating training data
To create training data for a specific run based on the 50K data pool, we use create_training_data.py.
This script supports two main commands: grpo and sft.
grpo command
The grpo command prepares data for GRPO (Group Relative Policy Optimization) training. It can:
- Aggregate reference outputs from different models specified by
--ref_models.- Available options include:
gold(default),claude-3-7-sonnet@20250219,deepseek-chat-v3,gemini-2.5-pro-exp-03-25,o4-mini-2025-04-16,Llama-3.1-8B-Instruct.
- Available options include:
- Score model outputs using a specified
--metric(e.g.,bleu,rm,rouge) and a--modelfor generation if outputs are not cached. - Select a subset of examples using
--selection_mode(random,easy,medium,hard) and--num_examples.- The default setting is
hard, which will select thenum_exampleslowest scoring examples from the 50K data pool.
- The default setting is
- Specify the source HuggingFace dataset with
--hf_dataset_path. The default isyapeichang/BLEUBERI-Tulu3-50k. - Control output directory with
--output_dirand the output dataset name with--output_dataset_name.
Example usage:
python training/create_training_data.py grpo \
--hf_dataset_path yapeichang/BLEUBERI-Tulu3-50k \
--ref_models gold \
--selection_mode hard \
--num_examples 1000 \
--metric bleu \
--model Qwen/Qwen2.5-7B \
--output_dir ../data \
--output_dataset_name my_grpo_dataset
sft command
The sft command prepares data for SFT training. It converts a dataset (often produced by the grpo command) into a series of prompt-response pairs.
- Specify the input data path (a directory containing a HuggingFace dataset) with
--input_data_path. - Control output directory with
--output_dir.
Example usage:
python training/create_training_data.py sft \
--input_data_path ../data/data_grpo/my_grpo_dataset \
--output_dir ../data
The script also handles on-the-fly inference if model outputs are missing for scoring in the grpo command, using vLLM. Arguments like --inference_max_new_tokens and --seed can be used to control this process.
GRPO training
To run GRPO training, see grpo.sh for an example job script that covers training data creation, specifying DeepSpeed config, and launching the training job.
SFT training
To run SFT training, see sft.sh for an example job script that covers training data creation, specifying DeepSpeed config, and launching the training job.