mirror of
https://github.com/NousResearch/atropos.git
synced 2026-04-19 12:57:58 +00:00
| .. | ||
| artifacts | ||
| extract_metric_training.py | ||
| metric_card_generator.py | ||
| README.md | ||
| requirements.txt | ||
| show_score_distribution.py | ||
| trainingDataScript.py | ||
Metric Card Generator Environment
Design and Motivation
This environment generates structured JSON configurations for Metric Card UI components for AI model evaluation dashboards. It demonstrates a closed-loop generation, evaluation, and visualization pipeline using Atropos.
The environment challenges language models to produce well-structured, valid JSON metric card configurations that can be directly used in front-end applications. This tests the model's ability to:
- Follow specific schema requirements
- Generate complex nested structures
- Maintain consistent JSON formatting
- Create semantically meaningful metric descriptions
Quickstart
# Install dependencies
pip install -r requirements.txt
# Run the environment with process command to generate rollouts
python metric_card_generator.py process --env.data_path_to_save_groups artifacts/metric_rollouts.jsonl
# View the generated HTML visualization
# Open artifacts/metric_rollouts.html in a browser
Environment Components
- metric_card_generator.py: Main environment implementation with prompting and evaluation logic
- extract_metric_training.py: Utility to extract high-quality examples for training
- trainingDataScript.py: Creates training datasets from collected examples
- show_score_distribution.py: Visualization tool for analyzing model performance
Artifacts
The artifacts folder includes:
- metric_rollouts.jsonl: Raw model outputs with scores
- metric_rollouts.html: Visualization of model outputs and scores
- metric_training.jsonl: Processed examples suitable for fine-tuning
- metric_training_high_quality.jsonl: Filtered high-quality examples
Evaluation Metrics
The environment evaluates model outputs on several dimensions:
- JSON Validity: Whether the output is valid, parseable JSON
- Schema Compliance: Whether the output follows the required structure
- Semantic Quality: Whether the metrics described make sense for the given context
- Formatting: Proper nesting, field types, and attribute consistency
WandB Integration
Performance metrics are logged to Weights & Biases, including:
- Percent of valid JSON responses
- Average scores across evaluation criteria
- Token usage efficiency
- Examples of best and worst performing generations
Use with Training
This environment can be integrated into the Atropos training loop to improve a model's ability to generate structured JSON output:
# Example training command
python example_trainer/trainer.py --environment metric_card_generator --model your_model --iterations 1000