atropos/environments/community/metric_card_generator
2025-05-23 15:17:27 +10:00
..
artifacts Add vivek100's Metric Card Generator Environment to community - Added comprehensive environment for JSON UI component generation with schema validation and multi-dimensional evaluation - Fixed all linting issues and updated community README with proper attribution 2025-05-23 15:11:03 +10:00
extract_metric_training.py Add vivek100's Metric Card Generator Environment to community - Added comprehensive environment for JSON UI component generation with schema validation and multi-dimensional evaluation - Fixed all linting issues and updated community README with proper attribution 2025-05-23 15:11:03 +10:00
metric_card_generator.py Add vivek100's Metric Card Generator Environment to community - Added comprehensive environment for JSON UI component generation with schema validation and multi-dimensional evaluation - Fixed all linting issues and updated community README with proper attribution 2025-05-23 15:11:03 +10:00
README.md Fix linting issues across repository - Install pre-commit hooks properly - Fix trailing whitespace and end-of-file issues in metric card generator README - Fix import sorting across multiple files to comply with isort --profile black 2025-05-23 15:17:27 +10:00
requirements.txt Add vivek100's Metric Card Generator Environment to community - Added comprehensive environment for JSON UI component generation with schema validation and multi-dimensional evaluation - Fixed all linting issues and updated community README with proper attribution 2025-05-23 15:11:03 +10:00
show_score_distribution.py Add vivek100's Metric Card Generator Environment to community - Added comprehensive environment for JSON UI component generation with schema validation and multi-dimensional evaluation - Fixed all linting issues and updated community README with proper attribution 2025-05-23 15:11:03 +10:00
trainingDataScript.py Add vivek100's Metric Card Generator Environment to community - Added comprehensive environment for JSON UI component generation with schema validation and multi-dimensional evaluation - Fixed all linting issues and updated community README with proper attribution 2025-05-23 15:11:03 +10:00

Metric Card Generator Environment

Design and Motivation

This environment generates structured JSON configurations for Metric Card UI components for AI model evaluation dashboards. It demonstrates a closed-loop generation, evaluation, and visualization pipeline using Atropos.

The environment challenges language models to produce well-structured, valid JSON metric card configurations that can be directly used in front-end applications. This tests the model's ability to:

  • Follow specific schema requirements
  • Generate complex nested structures
  • Maintain consistent JSON formatting
  • Create semantically meaningful metric descriptions

Quickstart

# Install dependencies
pip install -r requirements.txt

# Run the environment with process command to generate rollouts
python metric_card_generator.py process --env.data_path_to_save_groups artifacts/metric_rollouts.jsonl

# View the generated HTML visualization
# Open artifacts/metric_rollouts.html in a browser

Environment Components

  • metric_card_generator.py: Main environment implementation with prompting and evaluation logic
  • extract_metric_training.py: Utility to extract high-quality examples for training
  • trainingDataScript.py: Creates training datasets from collected examples
  • show_score_distribution.py: Visualization tool for analyzing model performance

Artifacts

The artifacts folder includes:

  • metric_rollouts.jsonl: Raw model outputs with scores
  • metric_rollouts.html: Visualization of model outputs and scores
  • metric_training.jsonl: Processed examples suitable for fine-tuning
  • metric_training_high_quality.jsonl: Filtered high-quality examples

Evaluation Metrics

The environment evaluates model outputs on several dimensions:

  • JSON Validity: Whether the output is valid, parseable JSON
  • Schema Compliance: Whether the output follows the required structure
  • Semantic Quality: Whether the metrics described make sense for the given context
  • Formatting: Proper nesting, field types, and attribute consistency

WandB Integration

Performance metrics are logged to Weights & Biases, including:

  • Percent of valid JSON responses
  • Average scores across evaluation criteria
  • Token usage efficiency
  • Examples of best and worst performing generations

Use with Training

This environment can be integrated into the Atropos training loop to improve a model's ability to generate structured JSON output:

# Example training command
python example_trainer/trainer.py --environment metric_card_generator --model your_model --iterations 1000