- Moved environments/bleuberi to environments/community/bleuberi
- Updated .gitmodules to reflect new submodule path
- Fixed pre-commit formatting issues
- Cleaned up test output files
4 KiB
BLEUBERI Environment for Atropos
This environment implements the BLEUBERI approach for instruction-following using BLEU scores as rewards. BLEUBERI (BLEU-Based Enhanced Utility for Better Evaluating Reward in Instruction-following) demonstrates that BLEU scores, when paired with high-quality references from strong LLMs, can be highly effective rewards for training models to follow instructions.
Overview
BLEUBERI uses BLEU scores (a simple n-gram matching metric) directly as rewards in a Group Relative Policy Optimization (GRPO) training framework. The approach:
- Collects high-quality reference responses from top LLMs (Claude, Gemini, etc.)
- Computes BLEU scores by comparing model outputs to these references
- Uses these scores as rewards to train models through GRPO
Installation
Before using the BLEUBERI environment, you need to install its dependencies:
# Install the required dependencies
pip install -r environments/bleuberi/requirements.txt
The key dependencies include:
model2vec: For embedding-based similarity metricsbert-score: For semantic similarity evaluationsacrebleu: For BLEU score calculationevaluate: For evaluation metricsdatasets: For dataset handling
Features
- BLEU-based reward functions (with support for multiple reference models)
- Compatible with the Atropos asynchronous environment framework
- Support for both SFT and GRPO training approaches
- Evaluation on instruction-following benchmarks
Usage
# Run the BLEUBERI environment as a service
python -m environments.bleuberi.bleuberi_env serve --config environments/bleuberi/configs/default.yaml
# Generate data with pre-collected references (for testing and debugging)
python -m environments.bleuberi.bleuberi_env process --config environments/bleuberi/configs/default.yaml --env.data_path_to_save_groups bleuberi_rollouts.jsonl
Testing with OpenAI API
The BLEUBERI environment can be tested with OpenAI API or any compatible API server. The API key is loaded securely from environment variables:
-
Set your OpenAI API key as an environment variable:
export OPENAI_API_KEY=your-api-key -
Create or modify a configuration file for OpenAI (e.g.,
environments/bleuberi/configs/openai.yaml):env: # Standard environment configuration wandb_name: bleuberi dataset_name: "allenai/tulu-3-sft-mixture" reward_funcs: - "bleu" ref_models: - "gold" openai: base_url: "https://api.openai.com/v1" # Or your custom server URL model: "gpt-4o" # Or your preferred model temperature: 0.7 max_tokens: 1024 top_p: 0.95 -
Run the environment in process mode to test with OpenAI:
python -m environments.bleuberi.bleuberi_env process \ --config environments/bleuberi/configs/openai.yaml \ --env.data_path_to_save_groups bleuberi_openai_test.jsonl
This will create two files:
bleuberi_openai_test.jsonl: Raw data containing prompts, responses, and scoresbleuberi_openai_test.html: A visual representation of the interactions for easy review
-
For local inference server testing:
- Set
base_urlto your local server (e.g., "http://localhost:8000/v1") - Specify the model name as expected by your server
- Set
-
For custom reference models:
- Configure
ref_modelsin the YAML to use specific models - Available options include: gold (default), claude-3-7-sonnet@20250219, deepseek-chat-v3, gemini-2.5-pro-exp-03-25, o4-mini-2025-04-16, Llama-3.1-8B-Instruct
- Configure
Configuration
See the configs/ directory for example configurations. The environment supports:
- Using pre-collected references or generating references on-the-fly
- Multiple reference models for more robust BLEU scoring
- Various BLEU calculation parameters
- Different dataset sources (default: Tulu3 mixture)
References
This implementation is based on the paper BLEUBERI: BLEU is a surprisingly effective reward for instruction following and its original implementation.