BLEUBERI/training/README.md
2025-06-04 20:36:43 +00:00

2.6 KiB

GRPO and SFT training

Creating training data

To create training data for a specific run based on the 50K data pool, we use create_training_data.py.

This script supports two main commands: grpo and sft.

grpo command

The grpo command prepares data for GRPO (Group Relative Policy Optimization) training. It can:

  • Aggregate reference outputs from different models specified by --ref_models.
    • Available options include: gold (default), claude-3-7-sonnet@20250219, deepseek-chat-v3, gemini-2.5-pro-exp-03-25, o4-mini-2025-04-16, Llama-3.1-8B-Instruct.
  • Score model outputs using a specified --metric (e.g., bleu, rm, rouge) and a --model for generation if outputs are not cached.
  • Select a subset of examples using --selection_mode (random, easy, medium, hard) and --num_examples.
    • The default setting is hard, which will select the num_examples lowest scoring examples from the 50K data pool.
  • Specify the source HuggingFace dataset with --hf_dataset_path. The default is yapeichang/BLEUBERI-Tulu3-50k.
  • Control output directory with --output_dir and the output dataset name with --output_dataset_name.

Example usage:

python training/create_training_data.py grpo \
    --hf_dataset_path yapeichang/BLEUBERI-Tulu3-50k \
    --ref_models gold \
    --selection_mode hard \
    --num_examples 1000 \
    --metric bleu \
    --model Qwen/Qwen2.5-7B \
    --output_dir ../data \
    --output_dataset_name my_grpo_dataset

sft command

The sft command prepares data for SFT training. It converts a dataset (often produced by the grpo command) into a series of prompt-response pairs.

  • Specify the input data path (a directory containing a HuggingFace dataset) with --input_data_path.
  • Control output directory with --output_dir.

Example usage:

python training/create_training_data.py sft \
    --input_data_path ../data/data_grpo/my_grpo_dataset \
    --output_dir ../data

The script also handles on-the-fly inference if model outputs are missing for scoring in the grpo command, using vLLM. Arguments like --inference_max_new_tokens and --seed can be used to control this process.

GRPO training

To run GRPO training, see grpo.sh for an example job script that covers training data creation, specifying DeepSpeed config, and launching the training job.

SFT training

To run SFT training, see sft.sh for an example job script that covers training data creation, specifying DeepSpeed config, and launching the training job.