mirror of
https://github.com/NousResearch/atropos.git
synced 2026-04-19 12:57:58 +00:00
- Moved environments/bleuberi to environments/community/bleuberi
- Updated .gitmodules to reflect new submodule path
- Fixed pre-commit formatting issues
- Cleaned up test output files
104 lines
4 KiB
Markdown
104 lines
4 KiB
Markdown
# BLEUBERI Environment for Atropos
|
|
|
|
This environment implements the BLEUBERI approach for instruction-following using BLEU scores as rewards. BLEUBERI (BLEU-Based Enhanced Utility for Better Evaluating Reward in Instruction-following) demonstrates that BLEU scores, when paired with high-quality references from strong LLMs, can be highly effective rewards for training models to follow instructions.
|
|
|
|
## Overview
|
|
|
|
BLEUBERI uses BLEU scores (a simple n-gram matching metric) directly as rewards in a Group Relative Policy Optimization (GRPO) training framework. The approach:
|
|
|
|
1. Collects high-quality reference responses from top LLMs (Claude, Gemini, etc.)
|
|
2. Computes BLEU scores by comparing model outputs to these references
|
|
3. Uses these scores as rewards to train models through GRPO
|
|
|
|
## Installation
|
|
|
|
Before using the BLEUBERI environment, you need to install its dependencies:
|
|
|
|
```bash
|
|
# Install the required dependencies
|
|
pip install -r environments/bleuberi/requirements.txt
|
|
```
|
|
|
|
The key dependencies include:
|
|
- `model2vec`: For embedding-based similarity metrics
|
|
- `bert-score`: For semantic similarity evaluation
|
|
- `sacrebleu`: For BLEU score calculation
|
|
- `evaluate`: For evaluation metrics
|
|
- `datasets`: For dataset handling
|
|
|
|
## Features
|
|
|
|
- BLEU-based reward functions (with support for multiple reference models)
|
|
- Compatible with the Atropos asynchronous environment framework
|
|
- Support for both SFT and GRPO training approaches
|
|
- Evaluation on instruction-following benchmarks
|
|
|
|
## Usage
|
|
|
|
```bash
|
|
# Run the BLEUBERI environment as a service
|
|
python -m environments.bleuberi.bleuberi_env serve --config environments/bleuberi/configs/default.yaml
|
|
|
|
# Generate data with pre-collected references (for testing and debugging)
|
|
python -m environments.bleuberi.bleuberi_env process --config environments/bleuberi/configs/default.yaml --env.data_path_to_save_groups bleuberi_rollouts.jsonl
|
|
```
|
|
|
|
## Testing with OpenAI API
|
|
|
|
The BLEUBERI environment can be tested with OpenAI API or any compatible API server. The API key is loaded securely from environment variables:
|
|
|
|
1. Set your OpenAI API key as an environment variable:
|
|
```bash
|
|
export OPENAI_API_KEY=your-api-key
|
|
```
|
|
|
|
2. Create or modify a configuration file for OpenAI (e.g., `environments/bleuberi/configs/openai.yaml`):
|
|
```yaml
|
|
env:
|
|
# Standard environment configuration
|
|
wandb_name: bleuberi
|
|
dataset_name: "allenai/tulu-3-sft-mixture"
|
|
reward_funcs:
|
|
- "bleu"
|
|
ref_models:
|
|
- "gold"
|
|
|
|
openai:
|
|
base_url: "https://api.openai.com/v1" # Or your custom server URL
|
|
model: "gpt-4o" # Or your preferred model
|
|
temperature: 0.7
|
|
max_tokens: 1024
|
|
top_p: 0.95
|
|
```
|
|
|
|
3. Run the environment in process mode to test with OpenAI:
|
|
```bash
|
|
python -m environments.bleuberi.bleuberi_env process \
|
|
--config environments/bleuberi/configs/openai.yaml \
|
|
--env.data_path_to_save_groups bleuberi_openai_test.jsonl
|
|
```
|
|
|
|
This will create two files:
|
|
- `bleuberi_openai_test.jsonl`: Raw data containing prompts, responses, and scores
|
|
- `bleuberi_openai_test.html`: A visual representation of the interactions for easy review
|
|
|
|
4. For local inference server testing:
|
|
- Set `base_url` to your local server (e.g., "http://localhost:8000/v1")
|
|
- Specify the model name as expected by your server
|
|
|
|
5. For custom reference models:
|
|
- Configure `ref_models` in the YAML to use specific models
|
|
- Available options include: gold (default), claude-3-7-sonnet@20250219, deepseek-chat-v3, gemini-2.5-pro-exp-03-25, o4-mini-2025-04-16, Llama-3.1-8B-Instruct
|
|
|
|
## Configuration
|
|
|
|
See the `configs/` directory for example configurations. The environment supports:
|
|
|
|
- Using pre-collected references or generating references on-the-fly
|
|
- Multiple reference models for more robust BLEU scoring
|
|
- Various BLEU calculation parameters
|
|
- Different dataset sources (default: Tulu3 mixture)
|
|
|
|
## References
|
|
|
|
This implementation is based on the paper [BLEUBERI: BLEU is a surprisingly effective reward for instruction following](https://arxiv.org/abs/2505.11080) and its [original implementation](https://github.com/lilakk/BLEUBERI).
|