| .. | ||
| agent_cards_config.py | ||
| pay_to_play_env.py | ||
| README.md | ||
| requirements.txt | ||
| secrets.json.template | ||
Pay-to-Play Environment with Mixture of Judges
A reinforcement learning environment where an AI agent must strategically select and pay specialized agent cards before each evaluation, implementing economic constraints and strategic decision-making in AI training.
This environment builds upon recent advances in RLHF with AI feedback (Lee et al., 2023) and mixture of judges approaches (Xu et al., 2024) to create a training paradigm that combines economic incentives with multi-agent evaluation.
🎯 Overview
This environment transforms the traditional AI evaluation process by introducing:
- Economic Constraints: Real USDC payments on Base blockchain (or simulated)
- Strategic Agent Card Selection: Agent chooses from multiple specialized agent cards with different expertise and prices
- Budget Management: Agent must balance cost vs. quality across training iterations
- Performance Tracking: Historical data informs future agent card selection decisions
🎯 Architecture
Separated Configuration Design
The system uses a clean separation of concerns:
agent_cards_config.py: Agent card definitions, pricing, specialties, and system promptssecrets.json: Wallet addresses and private keyspay_to_play_env.py: Main environment logic and orchestration
This design allows:
- Easy addition of new agent cards without touching wallet credentials
- Secure management of private keys separate from code
- Version control safety (secrets.json should never be committed)
Agent Card System
The environment includes three specialized agent cards defined in agent_cards_config.py:
| Agent Card | Price | Specialties | Description |
|---|---|---|---|
| Technical Expert | $0.03 | Technical Accuracy, Reasoning Logic | Premium agent card for STEM questions, complex reasoning, factual correctness |
| Communication Specialist | $0.02 | Clarity Communication | Mid-tier agent card focusing on readability, structure, and clear explanations |
| Creative Thinker | $0.01 | Creative Thinking | Budget agent card for creativity, originality, and innovative solutions |
Adding New Agent Cards
To add a new agent card:
- Add to
agent_cards_config.py:
"new_agent_card": AgentCardConfig(
name="New Agent Card Name",
price_usd=Decimal("0.025"),
specialties=[AgentCardSpecialty.FACTUAL_CORRECTNESS],
description="Agent card description here",
system_prompt="Your system prompt here..."
),
- Add wallet to
secrets.json:
"new_agent_card": {
"address": "0x...",
"private_key": "0x..."
}
The environment will automatically load and initialize the new agent card.
Agent Decision Process
- Question Analysis: Agent directly analyzes the question and determines what expertise is needed
- Agent Card Evaluation: Assess available agent cards based on:
- Specialty matching to the question requirements
- Historical performance
- Budget constraints
- Cost-effectiveness
- Strategic Selection: Choose 1-3 agent cards based on configuration
- Payment Execution: Make USDC payments to selected agent cards
- Evaluation: Receive scores from multiple agent cards and aggregate results
Budget Tracking
The BudgetTracker class monitors:
- Current balance and total spending
- Per-agent-card spending breakdown
- Average cost per evaluation
- Affordability checks
📊 Configuration
class PayToPlayConfig(BaseEnvConfig):
testing_mode: bool = False # True for simulated payments
initial_budget_usd: float = 1.0 # Starting budget
min_judges_per_eval: int = 1 # Minimum agent cards required
max_judges_per_eval: int = 3 # Maximum agent cards allowed
🚀 Usage
Initial Setup
- Copy wallet template:
cd src/environments/pay_to_play/
cp secrets.json.template secrets.json
-
Configure your wallets by editing
secrets.jsonwith real addresses and private keys -
Add to gitignore (should already be there):
echo "src/environments/pay_to_play/secrets.json" >> .gitignore
Basic Setup
from environments.pay_to_play.pay_to_play_env import PayToPlayEnv, PayToPlayConfig
from atroposlib.envs.base import APIServerConfig
# Configuration
config = PayToPlayConfig(
testing_mode=True, # Set to False for real blockchain payments
initial_budget_usd=0.50,
min_judges_per_eval=1,
max_judges_per_eval=2,
tokenizer_name="gpt2",
use_wandb=True
)
server_configs = [
APIServerConfig(
model_name="your-model",
base_url="http://localhost:9001/v1",
api_key="your-key",
num_requests_for_eval=64,
),
]
# Initialize environment
env = PayToPlayEnv(config, server_configs, testing=True)
Training Loop
async def training_loop():
await env.setup()
for step in range(config.total_steps):
# Get next question
question = await env.get_next_item()
# Agent selects agent cards and makes payments
# Evaluates response and gets training signal
scored_data, _ = await env.collect_trajectories(question)
# Log metrics
await env.wandb_log()
💰 Economic Model
Pricing Strategy
- Technical Expert ($0.03): Premium pricing reflects high accuracy and specialized knowledge
- Communication Specialist ($0.02): Mid-tier pricing for clarity and accessibility focus
- Creative Thinker ($0.01): Budget option encouraging creativity and innovation
Budget Scenarios
| Budget | Strategy | Evaluations Possible |
|---|---|---|
| $0.02 | Use only Creative Thinker | 2 evaluations |
| $0.05 | Mixed strategy possible | 3-5 evaluations |
| $0.10+ | Full flexibility | 10+ evaluations |
🧠 Agent Intelligence
The agent makes strategic decisions considering:
- Dynamic Question Analysis: Agent analyzes each question to understand what type of expertise is required
- Agent Card Specialty Matching: Matches question requirements to available agent card specialties without rigid categorization
- Budget Conservation: Early training uses cheaper agent cards, later phases invest in quality
- Performance History: Agent cards with better past performance get preference
- Cost-Effectiveness: Balance between agent card quality and budget constraints for each specific question
Intelligent Selection Algorithm
async def _agent_select_judges(self, question: str) -> JudgeSelection:
# 1. Agent analyzes question directly (no pre-categorization)
# 2. Get agent card performance stats
judge_stats = self._get_judge_performance_stats()
# 3. AI agent makes strategic decision with full context
selection_response = await self.server.chat_completion(
messages=[
{"role": "system", "content": "Analyze each question and match to agent card specialties..."},
{"role": "user", "content": selection_prompt}
]
)
# 4. Validate and execute selection
return validated_selection
The agent receives detailed information about each agent card's specialties, past performance, pricing, and makes intelligent decisions about which combination of agent cards will provide the best evaluation for each specific question.
📈 Metrics and Monitoring
Weights & Biases Integration
The environment logs comprehensive metrics:
Budget Metrics:
- Current balance and total spent
- Budget utilization percentage
- Average cost per evaluation
- Per-agent-card spending breakdown
Agent Card Performance:
- Individual agent card scoring history
- Agent satisfaction ratings
- Selection frequency analysis
- Performance consistency tracking
Agent Decisions:
- Agent card selection reasoning
- Question analysis by the agent
- Strategic decision patterns
Example Wandb Dashboard
budget/current_balance: 0.85
budget/total_spent: 0.15
budget/utilization_percent: 15.0
payments/success_rate: 1.0
judge_performance/technical_expert_avg_score: 0.85
selection_frequency/creative_thinker_percent: 60.0
🔧 Testing
The system includes comprehensive testing for:
- Agent card initialization and metadata
- Agent question analysis (no rigid categorization)
- Budget tracking functionality
- Strategic selection logic
- Pricing strategy analysis
🌐 Blockchain Integration
Wallet Configuration
⚠️ SECURITY IMPORTANT: Never commit secrets.json with real private keys to version control!
- Copy the template:
cp secrets.json.template secrets.json
-
Update
secrets.jsonwith your wallet credentials -
Add to
.gitignore:
echo "src/environments/pay_to_play/secrets.json" >> .gitignore
Security Best Practices
- Use separate wallets for each agent card to maintain clear accounting
- Test with small amounts first in testing mode
- Monitor wallet balances regularly
- Use hardware wallets for production deployments
- Never share private keys or commit them to version control
- Consider using environment variables for production deployments
Real Payments
Set testing_mode=False for real USDC payments on Base blockchain:
- Requires funded agent wallet with USDC
- Payments are permanent and irreversible
- Monitor gas costs for small transactions
- Ensure all agent card wallets are properly configured
🎖️ Key Features
✅ Multiple Specialized Agent Cards: Different expertise areas and pricing tiers ✅ Intelligent Agent Selection: AI-driven agent card selection with dynamic question analysis ✅ Budget Awareness: Real economic constraints drive efficient learning ✅ Performance Tracking: Historical data informs future decisions ✅ Blockchain Integration: Real USDC payments on Base network ✅ Comprehensive Monitoring: Detailed metrics and decision analysis ✅ Fallback Mechanisms: Robust handling of budget constraints ✅ Testing Framework: Simulation mode for development and testing
🚧 Future Enhancements
- Dynamic Pricing: Agent card prices adjust based on demand and performance
- Agent Card Reputation System: Community-driven agent card quality ratings
- Multi-Round Evaluation: Iterative feedback and improvement cycles
- Agent Card Specialization: More granular specialty categories
- Economic Incentives: Reward mechanisms for high-performing agent cards
📋 Requirements
- Python 3.11+
- Web3.py for blockchain interaction
- AtroposLib for base environment
- USDC on Base blockchain (for real payments)
- Weights & Biases account (optional, for monitoring)
🤝 Contributing
When adding new agent cards or features:
- Update
AgentCardSpecialtyenum for new specialties - Add agent card configuration in
agent_cards_config.py - Update wallet configuration in
secrets.json - Add appropriate test cases
- Update documentation
📚 References
This environment builds upon recent advances in reinforcement learning from AI feedback:
-
RLAIF vs. RLHF: Lee, H., et al. (2023). "RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback." arXiv preprint arXiv:2309.00267. https://arxiv.org/abs/2309.00267
-
Mixture of Judges: Xu, T., et al. (2024). "The Perfect Blend: Redefining RLHF with Mixture of Judges." arXiv preprint arXiv:2409.20370. https://arxiv.org/abs/2409.20370