mirror of https://github.com/lilakk/BLEUBERI.git synced 2026-04-19 12:58:12 +00:00

No description

Find a file

yapei 70a66e4a16 initial commit		2025-06-04 20:36:43 +00:00
arena_analysis	initial commit	2025-06-04 20:36:43 +00:00
assets	initial commit	2025-06-04 20:36:43 +00:00
data	initial commit	2025-06-04 20:36:43 +00:00
eval	initial commit	2025-06-04 20:36:43 +00:00
ref_outputs	initial commit	2025-06-04 20:36:43 +00:00
training	initial commit	2025-06-04 20:36:43 +00:00
.gitattributes	initial commit	2025-06-04 20:36:43 +00:00
.gitignore	initial commit	2025-06-04 20:36:43 +00:00
environment.yml	initial commit	2025-06-04 20:36:43 +00:00
README.md	initial commit	2025-06-04 20:36:43 +00:00

README.md

BLEUBERI: BLEU is a surprisingly effective reward for instruction following

This repository hosts the official code and data release for our preprint, BLEUBERI: BLEU is a surprisingly effective reward for instruction following.

TLDR > We extend RLVR beyond easily verifiable domains like math and code to the more open-ended setting of general instruction following. Surprisingly, we find that BLEU—a simple n-gram matching metric—when paired with high-quality references from strong LLMs, achieves human agreement comparable to 8B and 27B reward models on Chatbot Arena outputs. Based on this insight, we introduce BLEUBERI, which uses BLEU directly as a reward in GRPO training. BLEUBERI matches the performance of RM-guided GRPO across four instruction-following benchmarks and produces more factually grounded outputs, with human raters rating them on par with those from reward model-trained systems.

If you find our work useful, please cite:

@misc{chang2025bleuberibleusurprisinglyeffective,
      title={BLEUBERI: BLEU is a surprisingly effective reward for instruction following}, 
      author={Yapei Chang and Yekyung Kim and Michael Krumdick and Amir Zadeh and Chuan Li and Chris Tanner and Mohit Iyyer},
      year={2025},
      eprint={2505.11080},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.11080}, 
}

BLEU's surprisingly high agreement with human preferences

Figure 1: Human agreement rates for BLEU (with varying numbers of references), two reward models, and other reference-based metrics (with a single Claude reference). BLEU becomes more competitive with reward models as more references are provided, and combining BLEU with a reward model outperforms either alone.

The ChatBot Arena conversations dataset contains examples with an instruction, Model A response, Model B response, and a human preference label. In our analysis of 900 samples, we find that BLEU—when using responses from strong LLMs as references—achieves surprisingly strong alignment with human preferences. Its agreement rates are comparable to those of top-performing reward models like Skywork-RM-27B and Skywork-RM-8B, which currently rank 8th and 15th on the newly updated Reward Bench v2 (as of June 2025).

Check arena_analysis for how to replicate our results.

BLEUBERI: GRPO training with BLEU as the reward

Table 1: Results on four instruction-following benchmarks. For each model, the 'Base' row represents its pretrained checkpoint (for Llama-3.1-8B, this is our SFT-initialized model described in §3.2), while the 'Instruct' row is the official post-trained checkpoint. Despite the limitations of n-gram matching, BLEUBERI is competitive with both SFT and GRPO-RM across all models and benchmarks.

Building on insights from our previous analysis, we explore training models directly using BLEU as the reward, and find that the trained models are surprisingly competitive with models trained with RMs across four general instruction-following benchmarks.

Check training for details on creating training data and running GRPO / SFT training.

To replicate benchmark evaluation results, see eval.

Contact

If you have any questions, feel free to open a GitHub issue or email Yapei (yapeic@umd.edu).