# BLEUBERI Environment for Atropos This environment implements the BLEUBERI approach for instruction-following using BLEU scores as rewards. BLEUBERI (BLEU-Based Enhanced Utility for Better Evaluating Reward in Instruction-following) demonstrates that BLEU scores, when paired with high-quality references from strong LLMs, can be highly effective rewards for training models to follow instructions. ## Overview BLEUBERI uses BLEU scores (a simple n-gram matching metric) directly as rewards in a Group Relative Policy Optimization (GRPO) training framework. The approach: 1. Collects high-quality reference responses from top LLMs (Claude, Gemini, etc.) 2. Computes BLEU scores by comparing model outputs to these references 3. Uses these scores as rewards to train models through GRPO ## Installation Before using the BLEUBERI environment, you need to install its dependencies: ```bash # Install the required dependencies pip install -r environments/bleuberi/requirements.txt ``` The key dependencies include: - `model2vec`: For embedding-based similarity metrics - `bert-score`: For semantic similarity evaluation - `sacrebleu`: For BLEU score calculation - `evaluate`: For evaluation metrics - `datasets`: For dataset handling ## Features - BLEU-based reward functions (with support for multiple reference models) - Compatible with the Atropos asynchronous environment framework - Support for both SFT and GRPO training approaches - Evaluation on instruction-following benchmarks ## Usage ```bash # Run the BLEUBERI environment as a service python -m environments.bleuberi.bleuberi_env serve --config environments/bleuberi/configs/default.yaml # Generate data with pre-collected references (for testing and debugging) python -m environments.bleuberi.bleuberi_env process --config environments/bleuberi/configs/default.yaml --env.data_path_to_save_groups bleuberi_rollouts.jsonl ``` ## Testing with OpenAI API The BLEUBERI environment can be tested with OpenAI API or any compatible API server. The API key is loaded securely from environment variables: 1. Set your OpenAI API key as an environment variable: ```bash export OPENAI_API_KEY=your-api-key ``` 2. Create or modify a configuration file for OpenAI (e.g., `environments/bleuberi/configs/openai.yaml`): ```yaml env: # Standard environment configuration wandb_name: bleuberi dataset_name: "allenai/tulu-3-sft-mixture" reward_funcs: - "bleu" ref_models: - "gold" openai: base_url: "https://api.openai.com/v1" # Or your custom server URL model: "gpt-4o" # Or your preferred model temperature: 0.7 max_tokens: 1024 top_p: 0.95 ``` 3. Run the environment in process mode to test with OpenAI: ```bash python -m environments.bleuberi.bleuberi_env process \ --config environments/bleuberi/configs/openai.yaml \ --env.data_path_to_save_groups bleuberi_openai_test.jsonl ``` This will create two files: - `bleuberi_openai_test.jsonl`: Raw data containing prompts, responses, and scores - `bleuberi_openai_test.html`: A visual representation of the interactions for easy review 4. For local inference server testing: - Set `base_url` to your local server (e.g., "http://localhost:8000/v1") - Specify the model name as expected by your server 5. For custom reference models: - Configure `ref_models` in the YAML to use specific models - Available options include: gold (default), claude-3-7-sonnet@20250219, deepseek-chat-v3, gemini-2.5-pro-exp-03-25, o4-mini-2025-04-16, Llama-3.1-8B-Instruct ## Configuration See the `configs/` directory for example configurations. The environment supports: - Using pre-collected references or generating references on-the-fly - Multiple reference models for more robust BLEU scoring - Various BLEU calculation parameters - Different dataset sources (default: Tulu3 mixture) ## References This implementation is based on the paper [BLEUBERI: BLEU is a surprisingly effective reward for instruction following](https://arxiv.org/abs/2505.11080) and its [original implementation](https://github.com/lilakk/BLEUBERI).