mirror of
https://github.com/NousResearch/atropos.git
synced 2026-04-28 17:29:30 +00:00
Add BLEUBERI environment for reference-based RL
This commit is contained in:
parent
3f6015e622
commit
5bb5bd2c3d
7 changed files with 948 additions and 0 deletions
41
environments/bleuberi/README.md
Normal file
41
environments/bleuberi/README.md
Normal file
|
|
@ -0,0 +1,41 @@
|
|||
# BLEUBERI Environment for Atropos
|
||||
|
||||
This environment implements the BLEUBERI approach for instruction-following using BLEU scores as rewards. BLEUBERI (BLEU-Based Enhanced Utility for Better Evaluating Reward in Instruction-following) demonstrates that BLEU scores, when paired with high-quality references from strong LLMs, can be highly effective rewards for training models to follow instructions.
|
||||
|
||||
## Overview
|
||||
|
||||
BLEUBERI uses BLEU scores (a simple n-gram matching metric) directly as rewards in a Group Relative Policy Optimization (GRPO) training framework. The approach:
|
||||
|
||||
1. Collects high-quality reference responses from top LLMs (Claude, Gemini, etc.)
|
||||
2. Computes BLEU scores by comparing model outputs to these references
|
||||
3. Uses these scores as rewards to train models through GRPO
|
||||
|
||||
## Features
|
||||
|
||||
- BLEU-based reward functions (with support for multiple reference models)
|
||||
- Compatible with the Atropos asynchronous environment framework
|
||||
- Support for both SFT and GRPO training approaches
|
||||
- Evaluation on instruction-following benchmarks
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
# Run the BLEUBERI environment
|
||||
python -m atroposlib.cli.dpo --env-module environments.bleuberi.bleuberi_env
|
||||
|
||||
# Generate data with pre-collected references
|
||||
python -m environments.bleuberi.bleuberi_env process --config environments/bleuberi/configs/default.yaml
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
See the `configs/` directory for example configurations. The environment supports:
|
||||
|
||||
- Using pre-collected references or generating references on-the-fly
|
||||
- Multiple reference models for more robust BLEU scoring
|
||||
- Various BLEU calculation parameters
|
||||
- Different dataset sources (default: Tulu3 mixture)
|
||||
|
||||
## References
|
||||
|
||||
This implementation is based on the paper [BLEUBERI: BLEU is a surprisingly effective reward for instruction following](https://arxiv.org/abs/2505.11080) and its [original implementation](https://github.com/lilakk/BLEUBERI).
|
||||
Loading…
Add table
Add a link
Reference in a new issue