Add BLEUBERI environment for reference-based RL

This commit is contained in:
Allan Niemerg 2025-06-08 18:02:33 -05:00
parent 3f6015e622
commit 5bb5bd2c3d
7 changed files with 948 additions and 0 deletions

View file

@ -0,0 +1,41 @@
# BLEUBERI Environment for Atropos
This environment implements the BLEUBERI approach for instruction-following using BLEU scores as rewards. BLEUBERI (BLEU-Based Enhanced Utility for Better Evaluating Reward in Instruction-following) demonstrates that BLEU scores, when paired with high-quality references from strong LLMs, can be highly effective rewards for training models to follow instructions.
## Overview
BLEUBERI uses BLEU scores (a simple n-gram matching metric) directly as rewards in a Group Relative Policy Optimization (GRPO) training framework. The approach:
1. Collects high-quality reference responses from top LLMs (Claude, Gemini, etc.)
2. Computes BLEU scores by comparing model outputs to these references
3. Uses these scores as rewards to train models through GRPO
## Features
- BLEU-based reward functions (with support for multiple reference models)
- Compatible with the Atropos asynchronous environment framework
- Support for both SFT and GRPO training approaches
- Evaluation on instruction-following benchmarks
## Usage
```bash
# Run the BLEUBERI environment
python -m atroposlib.cli.dpo --env-module environments.bleuberi.bleuberi_env
# Generate data with pre-collected references
python -m environments.bleuberi.bleuberi_env process --config environments/bleuberi/configs/default.yaml
```
## Configuration
See the `configs/` directory for example configurations. The environment supports:
- Using pre-collected references or generating references on-the-fly
- Multiple reference models for more robust BLEU scoring
- Various BLEU calculation parameters
- Different dataset sources (default: Tulu3 mixture)
## References
This implementation is based on the paper [BLEUBERI: BLEU is a surprisingly effective reward for instruction following](https://arxiv.org/abs/2505.11080) and its [original implementation](https://github.com/lilakk/BLEUBERI).