atropos/environments/community/metric_card_generator/README.md

# Metric Card Generator Environment

## Design and Motivation

This environment generates structured JSON configurations for Metric Card UI components for AI model evaluation dashboards. It demonstrates a closed-loop generation, evaluation, and visualization pipeline using Atropos.

The environment challenges language models to produce well-structured, valid JSON metric card configurations that can be directly used in front-end applications. This tests the model's ability to:
- Follow specific schema requirements
- Generate complex nested structures
- Maintain consistent JSON formatting
- Create semantically meaningful metric descriptions

## Quickstart

```bash
# Install dependencies
pip install -r requirements.txt

# Run the environment with process command to generate rollouts
python metric_card_generator.py process --env.data_path_to_save_groups artifacts/metric_rollouts.jsonl

# View the generated HTML visualization
# Open artifacts/metric_rollouts.html in a browser
```

## Environment Components

- **metric_card_generator.py**: Main environment implementation with prompting and evaluation logic
- **extract_metric_training.py**: Utility to extract high-quality examples for training
- **trainingDataScript.py**: Creates training datasets from collected examples
- **show_score_distribution.py**: Visualization tool for analyzing model performance

## Artifacts

The artifacts folder includes:
- **metric_rollouts.jsonl**: Raw model outputs with scores
- **metric_rollouts.html**: Visualization of model outputs and scores
- **metric_training.jsonl**: Processed examples suitable for fine-tuning
- **metric_training_high_quality.jsonl**: Filtered high-quality examples

## Evaluation Metrics

The environment evaluates model outputs on several dimensions:

- **JSON Validity**: Whether the output is valid, parseable JSON
- **Schema Compliance**: Whether the output follows the required structure
- **Semantic Quality**: Whether the metrics described make sense for the given context
- **Formatting**: Proper nesting, field types, and attribute consistency

## WandB Integration

Performance metrics are logged to Weights & Biases, including:
- Percent of valid JSON responses
- Average scores across evaluation criteria
- Token usage efficiency
- Examples of best and worst performing generations

## Use with Training

This environment can be integrated into the Atropos training loop to improve a model's ability to generate structured JSON output:

```bash
# Example training command
python example_trainer/trainer.py --environment metric_card_generator --model your_model --iterations 1000
```