atropos/environments/community/metric_card_generator/README.md

65 lines
2.6 KiB
Markdown

# Metric Card Generator Environment
## Design and Motivation
This environment generates structured JSON configurations for Metric Card UI components for AI model evaluation dashboards. It demonstrates a closed-loop generation, evaluation, and visualization pipeline using Atropos.
The environment challenges language models to produce well-structured, valid JSON metric card configurations that can be directly used in front-end applications. This tests the model's ability to:
- Follow specific schema requirements
- Generate complex nested structures
- Maintain consistent JSON formatting
- Create semantically meaningful metric descriptions
## Quickstart
```bash
# Install dependencies
pip install -r requirements.txt
# Run the environment with process command to generate rollouts
python metric_card_generator.py process --env.data_path_to_save_groups artifacts/metric_rollouts.jsonl
# View the generated HTML visualization
# Open artifacts/metric_rollouts.html in a browser
```
## Environment Components
- **metric_card_generator.py**: Main environment implementation with prompting and evaluation logic
- **extract_metric_training.py**: Utility to extract high-quality examples for training
- **trainingDataScript.py**: Creates training datasets from collected examples
- **show_score_distribution.py**: Visualization tool for analyzing model performance
## Artifacts
The artifacts folder includes:
- **metric_rollouts.jsonl**: Raw model outputs with scores
- **metric_rollouts.html**: Visualization of model outputs and scores
- **metric_training.jsonl**: Processed examples suitable for fine-tuning
- **metric_training_high_quality.jsonl**: Filtered high-quality examples
## Evaluation Metrics
The environment evaluates model outputs on several dimensions:
- **JSON Validity**: Whether the output is valid, parseable JSON
- **Schema Compliance**: Whether the output follows the required structure
- **Semantic Quality**: Whether the metrics described make sense for the given context
- **Formatting**: Proper nesting, field types, and attribute consistency
## WandB Integration
Performance metrics are logged to Weights & Biases, including:
- Percent of valid JSON responses
- Average scores across evaluation criteria
- Token usage efficiency
- Examples of best and worst performing generations
## Use with Training
This environment can be integrated into the Atropos training loop to improve a model's ability to generate structured JSON output:
```bash
# Example training command
python example_trainer/trainer.py --environment metric_card_generator --model your_model --iterations 1000
```