mirror of https://github.com/NousResearch/atropos.git synced 2026-04-19 12:57:58 +00:00

History

Dakota 6386a5e185 fix precommit		2025-07-08 13:46:35 -05:00
..
agent_cards_config.py	[pre-commit.ci] auto fixes from pre-commit.com hooks	2025-07-07 21:53:41 +00:00
pay_to_play_env.py	fix precommit	2025-07-08 13:46:35 -05:00
README.md	[pre-commit.ci] auto fixes from pre-commit.com hooks	2025-07-07 21:53:41 +00:00
requirements.txt	[pre-commit.ci] auto fixes from pre-commit.com hooks	2025-07-07 21:53:41 +00:00
secrets.json.template	[pre-commit.ci] auto fixes from pre-commit.com hooks	2025-07-07 21:53:41 +00:00

README.md

Pay-to-Play Environment with Mixture of Judges

A reinforcement learning environment where an AI agent must strategically select and pay specialized agent cards before each evaluation, implementing economic constraints and strategic decision-making in AI training.

This environment builds upon recent advances in RLHF with AI feedback (Lee et al., 2023) and mixture of judges approaches (Xu et al., 2024) to create a training paradigm that combines economic incentives with multi-agent evaluation.

🎯 Overview

This environment transforms the traditional AI evaluation process by introducing:

Economic Constraints: Real USDC payments on Base blockchain (or simulated)
Strategic Agent Card Selection: Agent chooses from multiple specialized agent cards with different expertise and prices
Budget Management: Agent must balance cost vs. quality across training iterations
Performance Tracking: Historical data informs future agent card selection decisions

🎯 Architecture

Separated Configuration Design

The system uses a clean separation of concerns:

agent_cards_config.py: Agent card definitions, pricing, specialties, and system prompts
secrets.json: Wallet addresses and private keys
pay_to_play_env.py: Main environment logic and orchestration

This design allows:

Easy addition of new agent cards without touching wallet credentials
Secure management of private keys separate from code
Version control safety (secrets.json should never be committed)

Agent Card System

The environment includes three specialized agent cards defined in agent_cards_config.py:

Agent Card	Price	Specialties	Description
Technical Expert	$0.03	Technical Accuracy, Reasoning Logic	Premium agent card for STEM questions, complex reasoning, factual correctness
Communication Specialist	$0.02	Clarity Communication	Mid-tier agent card focusing on readability, structure, and clear explanations
Creative Thinker	$0.01	Creative Thinking	Budget agent card for creativity, originality, and innovative solutions

Adding New Agent Cards

To add a new agent card:

Add to agent_cards_config.py:

"new_agent_card": AgentCardConfig(
    name="New Agent Card Name",
    price_usd=Decimal("0.025"),
    specialties=[AgentCardSpecialty.FACTUAL_CORRECTNESS],
    description="Agent card description here",
    system_prompt="Your system prompt here..."
),

Add wallet to secrets.json:

"new_agent_card": {
  "address": "0x...",
  "private_key": "0x..."
}

The environment will automatically load and initialize the new agent card.

Agent Decision Process

Question Analysis: Agent directly analyzes the question and determines what expertise is needed
Agent Card Evaluation: Assess available agent cards based on:
- Specialty matching to the question requirements
- Historical performance
- Budget constraints
- Cost-effectiveness
Strategic Selection: Choose 1-3 agent cards based on configuration
Payment Execution: Make USDC payments to selected agent cards
Evaluation: Receive scores from multiple agent cards and aggregate results

Budget Tracking

The BudgetTracker class monitors:

Current balance and total spending
Per-agent-card spending breakdown
Average cost per evaluation
Affordability checks

📊 Configuration

class PayToPlayConfig(BaseEnvConfig):
    testing_mode: bool = False  # True for simulated payments
    initial_budget_usd: float = 1.0  # Starting budget
    min_judges_per_eval: int = 1  # Minimum agent cards required
    max_judges_per_eval: int = 3  # Maximum agent cards allowed

🚀 Usage

Initial Setup

Copy wallet template:

cd src/environments/pay_to_play/
cp secrets.json.template secrets.json

Configure your wallets by editing secrets.json with real addresses and private keys
Add to gitignore (should already be there):

echo "src/environments/pay_to_play/secrets.json" >> .gitignore

Basic Setup

from environments.pay_to_play.pay_to_play_env import PayToPlayEnv, PayToPlayConfig
from atroposlib.envs.base import APIServerConfig

# Configuration
config = PayToPlayConfig(
    testing_mode=True,  # Set to False for real blockchain payments
    initial_budget_usd=0.50,
    min_judges_per_eval=1,
    max_judges_per_eval=2,
    tokenizer_name="gpt2",
    use_wandb=True
)

server_configs = [
    APIServerConfig(
        model_name="your-model",
        base_url="http://localhost:9001/v1",
        api_key="your-key",
        num_requests_for_eval=64,
    ),
]

# Initialize environment
env = PayToPlayEnv(config, server_configs, testing=True)

Training Loop

async def training_loop():
    await env.setup()

    for step in range(config.total_steps):
        # Get next question
        question = await env.get_next_item()

        # Agent selects agent cards and makes payments
        # Evaluates response and gets training signal
        scored_data, _ = await env.collect_trajectories(question)

        # Log metrics
        await env.wandb_log()

💰 Economic Model

Pricing Strategy

Technical Expert ($0.03): Premium pricing reflects high accuracy and specialized knowledge
Communication Specialist ($0.02): Mid-tier pricing for clarity and accessibility focus
Creative Thinker ($0.01): Budget option encouraging creativity and innovation

Budget Scenarios

Budget	Strategy	Evaluations Possible
$0.02	Use only Creative Thinker	2 evaluations
$0.05	Mixed strategy possible	3-5 evaluations
$0.10+	Full flexibility	10+ evaluations

🧠 Agent Intelligence

The agent makes strategic decisions considering:

Dynamic Question Analysis: Agent analyzes each question to understand what type of expertise is required
Agent Card Specialty Matching: Matches question requirements to available agent card specialties without rigid categorization
Budget Conservation: Early training uses cheaper agent cards, later phases invest in quality
Performance History: Agent cards with better past performance get preference
Cost-Effectiveness: Balance between agent card quality and budget constraints for each specific question

Intelligent Selection Algorithm

async def _agent_select_judges(self, question: str) -> JudgeSelection:
    # 1. Agent analyzes question directly (no pre-categorization)
    # 2. Get agent card performance stats
    judge_stats = self._get_judge_performance_stats()

    # 3. AI agent makes strategic decision with full context
    selection_response = await self.server.chat_completion(
        messages=[
            {"role": "system", "content": "Analyze each question and match to agent card specialties..."},
            {"role": "user", "content": selection_prompt}
        ]
    )

    # 4. Validate and execute selection
    return validated_selection

The agent receives detailed information about each agent card's specialties, past performance, pricing, and makes intelligent decisions about which combination of agent cards will provide the best evaluation for each specific question.

📈 Metrics and Monitoring

Weights & Biases Integration

The environment logs comprehensive metrics:

Budget Metrics:

Current balance and total spent
Budget utilization percentage
Average cost per evaluation
Per-agent-card spending breakdown

Agent Card Performance:

Individual agent card scoring history
Agent satisfaction ratings
Selection frequency analysis
Performance consistency tracking

Agent Decisions:

Agent card selection reasoning
Question analysis by the agent
Strategic decision patterns

Example Wandb Dashboard

budget/current_balance: 0.85
budget/total_spent: 0.15
budget/utilization_percent: 15.0
payments/success_rate: 1.0
judge_performance/technical_expert_avg_score: 0.85
selection_frequency/creative_thinker_percent: 60.0

🔧 Testing

The system includes comprehensive testing for:

Agent card initialization and metadata
Agent question analysis (no rigid categorization)
Budget tracking functionality
Strategic selection logic
Pricing strategy analysis

🌐 Blockchain Integration

Wallet Configuration

⚠️ SECURITY IMPORTANT: Never commit secrets.json with real private keys to version control!

Copy the template:

cp secrets.json.template secrets.json

Update secrets.json with your wallet credentials
Add to .gitignore:

echo "src/environments/pay_to_play/secrets.json" >> .gitignore

Security Best Practices

Use separate wallets for each agent card to maintain clear accounting
Test with small amounts first in testing mode
Monitor wallet balances regularly
Use hardware wallets for production deployments
Never share private keys or commit them to version control
Consider using environment variables for production deployments

Real Payments

Set testing_mode=False for real USDC payments on Base blockchain:

Requires funded agent wallet with USDC
Payments are permanent and irreversible
Monitor gas costs for small transactions
Ensure all agent card wallets are properly configured

🎖️ Key Features

✅ Multiple Specialized Agent Cards: Different expertise areas and pricing tiers ✅ Intelligent Agent Selection: AI-driven agent card selection with dynamic question analysis ✅ Budget Awareness: Real economic constraints drive efficient learning ✅ Performance Tracking: Historical data informs future decisions ✅ Blockchain Integration: Real USDC payments on Base network ✅ Comprehensive Monitoring: Detailed metrics and decision analysis ✅ Fallback Mechanisms: Robust handling of budget constraints ✅ Testing Framework: Simulation mode for development and testing

🚧 Future Enhancements

Dynamic Pricing: Agent card prices adjust based on demand and performance
Agent Card Reputation System: Community-driven agent card quality ratings
Multi-Round Evaluation: Iterative feedback and improvement cycles
Agent Card Specialization: More granular specialty categories
Economic Incentives: Reward mechanisms for high-performing agent cards

📋 Requirements

Python 3.11+
Web3.py for blockchain interaction
AtroposLib for base environment
USDC on Base blockchain (for real payments)
Weights & Biases account (optional, for monitoring)

🤝 Contributing

When adding new agent cards or features:

Update AgentCardSpecialty enum for new specialties
Add agent card configuration in agent_cards_config.py
Update wallet configuration in secrets.json
Add appropriate test cases
Update documentation

📚 References

This environment builds upon recent advances in reinforcement learning from AI feedback:

RLAIF vs. RLHF: Lee, H., et al. (2023). "RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback." arXiv preprint arXiv:2309.00267. https://arxiv.org/abs/2309.00267
Mixture of Judges: Xu, T., et al. (2024). "The Perfect Blend: Redefining RLHF with Mixture of Judges." arXiv preprint arXiv:2409.20370. https://arxiv.org/abs/2409.20370