mirror of https://github.com/NousResearch/atropos.git synced 2026-04-19 12:57:58 +00:00

Allan Niemerg 0f6c06bb56 Move BLEUBERI environment to community folder

- Moved environments/bleuberi to environments/community/bleuberi
      - Updated .gitmodules to reflect new submodule path
      - Fixed pre-commit formatting issues
      - Cleaned up test output files

2025-09-08 14:38:43 -05:00

4 KiB

Raw Blame History

BLEUBERI Environment for Atropos

This environment implements the BLEUBERI approach for instruction-following using BLEU scores as rewards. BLEUBERI (BLEU-Based Enhanced Utility for Better Evaluating Reward in Instruction-following) demonstrates that BLEU scores, when paired with high-quality references from strong LLMs, can be highly effective rewards for training models to follow instructions.

Overview

BLEUBERI uses BLEU scores (a simple n-gram matching metric) directly as rewards in a Group Relative Policy Optimization (GRPO) training framework. The approach:

Collects high-quality reference responses from top LLMs (Claude, Gemini, etc.)
Computes BLEU scores by comparing model outputs to these references
Uses these scores as rewards to train models through GRPO

Installation

Before using the BLEUBERI environment, you need to install its dependencies:

# Install the required dependencies
pip install -r environments/bleuberi/requirements.txt

The key dependencies include:

model2vec: For embedding-based similarity metrics
bert-score: For semantic similarity evaluation
sacrebleu: For BLEU score calculation
evaluate: For evaluation metrics
datasets: For dataset handling

Features

BLEU-based reward functions (with support for multiple reference models)
Compatible with the Atropos asynchronous environment framework
Support for both SFT and GRPO training approaches
Evaluation on instruction-following benchmarks

Usage

# Run the BLEUBERI environment as a service
python -m environments.bleuberi.bleuberi_env serve --config environments/bleuberi/configs/default.yaml

# Generate data with pre-collected references (for testing and debugging)
python -m environments.bleuberi.bleuberi_env process --config environments/bleuberi/configs/default.yaml --env.data_path_to_save_groups bleuberi_rollouts.jsonl

Testing with OpenAI API

The BLEUBERI environment can be tested with OpenAI API or any compatible API server. The API key is loaded securely from environment variables:

Set your OpenAI API key as an environment variable:
```
export OPENAI_API_KEY=your-api-key
```

Create or modify a configuration file for OpenAI (e.g., environments/bleuberi/configs/openai.yaml):

env:
  # Standard environment configuration
  wandb_name: bleuberi
  dataset_name: "allenai/tulu-3-sft-mixture"
  reward_funcs:
    - "bleu"
  ref_models:
    - "gold"

openai:
  base_url: "https://api.openai.com/v1"  # Or your custom server URL
  model: "gpt-4o"  # Or your preferred model
  temperature: 0.7
  max_tokens: 1024
  top_p: 0.95

Run the environment in process mode to test with OpenAI:

python -m environments.bleuberi.bleuberi_env process \
  --config environments/bleuberi/configs/openai.yaml \
  --env.data_path_to_save_groups bleuberi_openai_test.jsonl

This will create two files:

bleuberi_openai_test.jsonl: Raw data containing prompts, responses, and scores
bleuberi_openai_test.html: A visual representation of the interactions for easy review

For local inference server testing:
- Set base_url to your local server (e.g., "http://localhost:8000/v1")
- Specify the model name as expected by your server
For custom reference models:
- Configure ref_models in the YAML to use specific models
- Available options include: gold (default), claude-3-7-sonnet@20250219, deepseek-chat-v3, gemini-2.5-pro-exp-03-25, o4-mini-2025-04-16, Llama-3.1-8B-Instruct

Configuration

See the configs/ directory for example configurations. The environment supports:

Using pre-collected references or generating references on-the-fly
Multiple reference models for more robust BLEU scoring
Various BLEU calculation parameters
Different dataset sources (default: Tulu3 mixture)

References

This implementation is based on the paper BLEUBERI: BLEU is a surprisingly effective reward for instruction following and its original implementation.

4 KiB Raw Blame History