ExamCraft: Working adaptive teacher environment with demo

2026-05-02 17:45:50 +00:00 · 2025-05-18 16:33:01 -07:00 · 2025-05-18 16:33:01 -07:00 · 0b2baffab5
commit 0b2baffab5
parent c189fc3351
3 changed files with 798 additions and 0 deletions
--- a/environments/hack0/examcraft/README.md
+++ b/environments/hack0/examcraft/README.md
@ -0,0 +1,116 @@
+# 🎓 ExamCraft: Adaptive LLM Teacher Training Environment
+
+**Hackathon Submission**: Train language models to become adaptive teachers through reinforcement learning.
+
+## 🌟 Overview
+
+ExamCraft trains LLMs to be better teachers by generating adaptive questions, providing explanations, and creating personalized lesson plans. The environment rewards effective teaching strategies and penalizes poor ones.
+
+### Key Features
+- **Adaptive Question Generation**: Targets student weak areas automatically
+- **Real-time Difficulty Adjustment**: Matches challenge level to student ability
+- **Comprehensive Teaching Actions**: Questions, explanations, and lesson plans
+- **Sophisticated Reward System**: Multi-factor scoring for teaching effectiveness
+- **Student Learning Simulation**: Realistic proficiency progression
+
+## 🚀 Quick Start
+
+### Prerequisites
+```bash
+pip install -r requirements.txt
+```
+
+### Running the Environment
+
+1. **Start Atropos trajectory API**:
+```bash
+run-api
+```
+
+2. **Run environment in serve mode**:
+```bash
+python examcraft_server.py serve --slurm false
+```
+
+3. **Generate inference-only rollouts**:
+```bash
+python examcraft_server.py process --env.data_path_to_save_groups demo_output.jsonl
+```
+
+### Training with Atropos
+```bash
+# Generate SFT data
+atropos-sft-gen examcraft_sft.jsonl --tokenizer NousResearch/Hermes-3-Llama-3.1-8B
+
+# Generate DPO data  
+atropos-dpo-gen examcraft_dpo.jsonl --tokenizer NousResearch/Hermes-3-Llama-3.1-8B
+```
+
+## 🎯 Environment Design
+
+### Teaching Actions
+- **QUESTION**: Generate adaptive multiple-choice questions
+- **EXPLANATION**: Provide detailed concept explanations
+- **LESSON_PLAN**: Create personalized study plans
+
+### Reward Components
+1. **Correctness Reward**: Base reward for student getting questions right
+2. **Targeting Bonus**: Extra points for focusing on weak topics  
+3. **Difficulty Appropriateness**: Rewards for matching difficulty to ability
+4. **Quality Bonus**: Higher scores for detailed, well-structured content
+5. **Learning Impact**: Bonuses for explanations that boost understanding
+
+### Student Simulation
+- Probabilistic responses based on topic proficiency
+- Dynamic learning from good teaching
+- Realistic difficulty sensitivity
+- Session momentum effects
+
+## 📊 Example Metrics
+
+The environment tracks:
+- Student accuracy improvement across topics
+- Teaching effectiveness scores
+- Adaptive difficulty selection
+- Content quality metrics
+- Learning progression over time
+
+## 🏆 Why This Matters
+
+Adaptive AI tutoring can revolutionize education by:
+- Personalizing learning experiences at scale
+- Identifying knowledge gaps automatically
+- Providing instant, detailed feedback
+- Making quality education globally accessible
+
+## 🔧 Configuration
+
+### Student Profile Format
+```json
+{
+  "student_id": "student001",
+  "target_grade": "11th grade",
+  "learning_goal": "Master linear algebra basics",
+  "current_avg_score": 73,
+  "topics": [
+    {"name": "vectors", "proficiency": 0.65},
+    {"name": "matrices", "proficiency": 0.50}
+  ],
+  "preferred_learning_style": "visual"
+}
+```
+
+### Environment Parameters
+- `max_questions_per_episode`: 8
+- `student_learning_rate`: 0.03
+- `enable_lesson_plans`: true
+
+## 📈 Results Preview
+
+After training, teachers learn to:
+- Prioritize topics where students struggle most
+- Adapt question difficulty based on recent performance  
+- Generate detailed explanations that boost understanding
+- Create comprehensive lesson plans targeting weak areas
+
+Built for the **Nous Research RL Environments Hackathon** 🚀
--- a/environments/hack0/examcraft/examcraft_server.py
+++ b/environments/hack0/examcraft/examcraft_server.py
@ -0,0 +1,681 @@
+#!/usr/bin/env python3
+"""
+ExamCraft: Adaptive LLM Teacher Training Environment for Atropos
+
+This environment trains language models to become better teachers by generating
+adaptive questions, providing explanations, and creating personalized lesson plans.
+"""
+
+import os
+import sys
+import json
+import asyncio
+import argparse
+from typing import Dict, List, Any, Tuple, Optional
+from dataclasses import dataclass
+
+# Try different Atropos import patterns
+try:
+    from atroposlib.envs import BaseLanguageEnv
+    from atroposlib.api import BaseEnvConfig, APIServerConfig
+    from atroposlib.utils import parse_args_and_command
+    ATROPOS_AVAILABLE = True
+    print("✅ Successfully imported Atropos classes")
+except ImportError as e1:
+    try:
+        from atroposlib.envs.base import BaseLanguageEnv
+        from atroposlib.api.config import BaseEnvConfig, APIServerConfig
+        from atroposlib.utils import parse_args_and_command
+        ATROPOS_AVAILABLE = True
+        print("✅ Successfully imported Atropos classes (alternative path)")
+    except ImportError as e2:
+        try:
+            # Check what's actually available
+            import atroposlib.envs
+            import atroposlib.api
+            print("Available in atroposlib.envs:", dir(atroposlib.envs))
+            print("Available in atroposlib.api:", dir(atroposlib.api))
+            raise ImportError("Could not find correct imports")
+        except ImportError as e3:
+            print(f"Warning: Could not import Atropos classes. Errors: {e1}, {e2}, {e3}")
+            print("Running in standalone mode...")
+            ATROPOS_AVAILABLE = False
+            
+            # Create mock classes for testing
+            class BaseLanguageEnv:
+                def __init__(self, config):
+                    self.config = config
+                    
+            class BaseEnvConfig:
+                def __init__(self, **kwargs):
+                    for k, v in kwargs.items():
+                        setattr(self, k, v)
+                        
+            class APIServerConfig:
+                def __init__(self, **kwargs):
+                    for k, v in kwargs.items():
+                        setattr(self, k, v)
+
+
+@dataclass
+class ExamCraftConfig(BaseEnvConfig):
+    """Configuration for ExamCraft environment."""
+    profile_path: str = "example_profile.json"
+    max_questions_per_episode: int = 8
+    student_learning_rate: float = 0.03
+    enable_lesson_plans: bool = True
+    difficulty_adaptation_rate: float = 0.1
+
+
+class ExamCraftEnv(BaseLanguageEnv):
+    """
+    ExamCraft: Trains LLMs to be adaptive teachers.
+    
+    Features:
+    - Adaptive question generation based on student proficiency
+    - Multi-topic curriculum support
+    - Real-time difficulty adjustment
+    - Comprehensive reward system for teaching effectiveness
+    - Lesson plan generation and evaluation
+    """
+    
+    def __init__(self, config: ExamCraftConfig):
+        super().__init__(config)
+        self.config = config
+        
+        # Load student profile
+        if os.path.exists(config.profile_path):
+            with open(config.profile_path, 'r') as file:
+                self.profile = json.load(file)
+        else:
+            # Default profile if file not found
+            self.profile = self._create_default_profile()
+        
+        # Initialize metrics
+        self.reset_metrics()
+        
+        # Track teaching session
+        self.session_history = []
+        self.current_episode = 0
+    
+    def _create_default_profile(self) -> Dict[str, Any]:
+        """Create a default student profile if none exists."""
+        return {
+            "student_id": "default_001",
+            "target_grade": "11th grade", 
+            "learning_goal": "Master foundational mathematics",
+            "current_avg_score": 65,
+            "topics": [
+                {"name": "algebra", "proficiency": 0.4},
+                {"name": "geometry", "proficiency": 0.6},
+                {"name": "statistics", "proficiency": 0.3},
+                {"name": "calculus", "proficiency": 0.2}
+            ],
+            "preferred_learning_style": "visual"
+        }
+    
+    def reset_metrics(self):
+        """Reset student metrics for a new episode."""
+        self.student_metrics = {
+            "overall_accuracy": self.profile.get("current_avg_score", 65) / 100,
+            "topic_accuracies": {},
+            "difficulty_preferences": {}
+        }
+        
+        # Initialize from profile
+        for topic in self.profile.get("topics", []):
+            topic_name = topic.get("name")
+            proficiency = topic.get("proficiency", 0.5)
+            self.student_metrics["topic_accuracies"][topic_name] = proficiency
+            self.student_metrics["difficulty_preferences"][topic_name] = "medium"
+        
+        # Reset session tracking
+        self.question_count = 0
+        self.correct_count = 0
+        self.session_history = []
+    
+    def get_system_message(self) -> str:
+        """System message defining the teacher's role and capabilities."""
+        topics_str = "\n".join([
+            f"- {topic['name']}: {topic['proficiency']:.1%} proficiency" 
+            for topic in self.profile.get('topics', [])
+        ])
+        
+        return f"""You are ExamCraft, an adaptive AI teacher specializing in {self.profile.get('target_grade', 'high school')} education.
+
+STUDENT PROFILE:
+{topics_str}
+Learning Goal: {self.profile.get('learning_goal', 'Academic improvement')}
+Preferred Style: {self.profile.get('preferred_learning_style', 'mixed')}
+Current Average: {self.profile.get('current_avg_score', 65)}%
+
+TEACHING CAPABILITIES:
+1. QUESTION: Generate adaptive multiple-choice questions
+2. EXPLANATION: Provide detailed explanations for concepts
+3. LESSON_PLAN: Create personalized study plans
+
+RESPONSE FORMAT - Return valid JSON only:
+{{
+    "action_type": "QUESTION|EXPLANATION|LESSON_PLAN",
+    "topic": "topic_name",
+    "difficulty": "easy|medium|hard",
+    "content": {{
+        "question": "Clear, engaging question text",
+        "options": {{
+            "A": "First option",
+            "B": "Second option", 
+            "C": "Third option",
+            "D": "Fourth option"
+        }},
+        "correct_answer": "A|B|C|D",
+        "explanation": "Detailed explanation for the correct answer and why others are wrong",
+        "learning_objective": "What this question teaches"
+    }}
+}}
+
+TEACHING STRATEGY:
+- Prioritize topics with low proficiency scores
+- Adapt difficulty based on recent performance
+- Provide detailed explanations that build understanding
+- Focus on the student's learning goal and preferred style"""
+    
+    def get_user_message(self) -> str:
+        """Generate context-aware prompt for the teacher."""
+        if not self.session_history:
+            return f"""NEW TEACHING SESSION STARTED
+
+Student needs help with: {self.profile.get('learning_goal', 'general concepts')}
+
+Current topic proficiencies:
+{json.dumps(self.student_metrics['topic_accuracies'], indent=2)}
+
+Please begin with an appropriate question for this student. Target their weakest areas while maintaining appropriate challenge level."""
+        
+        # Analyze recent performance
+        recent_correct = sum(1 for item in self.session_history[-3:] 
+                           if item.get('was_correct', False))
+        recent_total = len(self.session_history[-3:])
+        
+        last_interaction = self.session_history[-1]
+        performance_trend = "improving" if recent_correct > recent_total/2 else "struggling"
+        
+        # Identify struggling topics
+        weak_topics = [topic for topic, acc in self.student_metrics['topic_accuracies'].items() 
+                      if acc < 0.5]
+        
+        return f"""TEACHING SESSION UPDATE
+
+Recent Performance: {recent_correct}/{recent_total} correct - student is {performance_trend}
+Last Question: {last_interaction.get('topic', 'unknown')} ({last_interaction.get('difficulty', 'unknown')}) - {'✓' if last_interaction.get('was_correct') else '✗'}
+
+Current Status:
+- Questions asked: {self.question_count}
+- Overall accuracy: {self.student_metrics['overall_accuracy']:.1%}
+- Topics needing attention: {', '.join(weak_topics) if weak_topics else 'None'}
+
+Updated proficiencies:
+{json.dumps(self.student_metrics['topic_accuracies'], indent=2)}
+
+Please select your next teaching action based on this student's progress."""
+    
+    async def step(self, response: str) -> Tuple[str, Dict[str, Any]]:
+        """Process teacher's response and simulate student interaction."""
+        try:
+            # Clean and parse JSON response
+            response = response.strip()
+            if response.startswith("```json"):
+                response = response[7:]
+            if response.endswith("```"):
+                response = response[:-3]
+            
+            teacher_action = json.loads(response.strip())
+            
+            # Validate required fields
+            if not all(key in teacher_action for key in ["action_type", "content"]):
+                return await self._handle_error("Missing required fields in response")
+            
+            action_type = teacher_action["action_type"].upper()
+            
+            if action_type == "QUESTION":
+                return await self._handle_question(teacher_action)
+            elif action_type == "EXPLANATION":
+                return await self._handle_explanation(teacher_action)
+            elif action_type == "LESSON_PLAN":
+                return await self._handle_lesson_plan(teacher_action)
+            else:
+                return await self._handle_error(f"Invalid action_type: {action_type}")
+                
+        except json.JSONDecodeError as e:
+            return await self._handle_error(f"JSON parsing failed: {str(e)}")
+        except Exception as e:
+            return await self._handle_error(f"Unexpected error: {str(e)}")
+    
+    async def _handle_question(self, action: Dict[str, Any]) -> Tuple[str, Dict[str, Any]]:
+        """Process a question from the teacher."""
+        content = action["content"]
+        topic = action.get("topic", "general")
+        difficulty = action.get("difficulty", "medium")
+        
+        # Validate question format
+        required_fields = ["question", "options", "correct_answer"]
+        if not all(field in content for field in required_fields):
+            return await self._handle_error("Question missing required fields")
+        
+        if len(content["options"]) != 4:
+            return await self._handle_error("Question must have exactly 4 options")
+        
+        # Simulate student response
+        student_answer, is_correct = self._simulate_student_response(topic, difficulty, content["correct_answer"])
+        
+        # Update metrics
+        self.question_count += 1
+        if is_correct:
+            self.correct_count += 1
+        
+        # Apply learning effect
+        self._update_student_learning(topic, difficulty, is_correct)
+        
+        # Calculate reward
+        score = self._calculate_question_reward(topic, difficulty, is_correct, content)
+        
+        # Record interaction
+        interaction = {
+            "type": "question",
+            "topic": topic,
+            "difficulty": difficulty,
+            "question": content["question"],
+            "student_answer": student_answer,
+            "correct_answer": content["correct_answer"],
+            "was_correct": is_correct,
+            "score": score,
+            "timestamp": self.question_count
+        }
+        self.session_history.append(interaction)
+        
+        # Check episode completion
+        episode_done = self.question_count >= self.config.max_questions_per_episode
+        
+        info = {
+            "score": score,
+            "episode_done": episode_done,
+            "info": interaction,
+            "student_metrics": dict(self.student_metrics)
+        }
+        
+        return self.get_user_message(), info
+    
+    async def _handle_explanation(self, action: Dict[str, Any]) -> Tuple[str, Dict[str, Any]]:
+        """Process an explanation from the teacher."""
+        content = action["content"]
+        explanation = content.get("explanation", "")
+        topic = action.get("topic", "general")
+        
+        # Score explanation quality
+        score = self._score_explanation(explanation, topic)
+        
+        # Apply learning boost from good explanations
+        if len(self.session_history) > 0:
+            last_topic = self.session_history[-1].get("topic")
+            if last_topic and score > 0.7:
+                current_prof = self.student_metrics["topic_accuracies"].get(last_topic, 0.5)
+                boost = min(0.1, score * 0.05)  # Max 0.1 boost from excellent explanation
+                self.student_metrics["topic_accuracies"][last_topic] = min(1.0, current_prof + boost)
+        
+        # Record interaction
+        interaction = {
+            "type": "explanation", 
+            "topic": topic,
+            "explanation_length": len(explanation.split()),
+            "score": score
+        }
+        self.session_history.append(interaction)
+        
+        info = {
+            "score": score,
+            "episode_done": False,
+            "info": interaction
+        }
+        
+        return self.get_user_message(), info
+    
+    async def _handle_lesson_plan(self, action: Dict[str, Any]) -> Tuple[str, Dict[str, Any]]:
+        """Process a lesson plan from the teacher."""
+        content = action["content"]
+        
+        # Score lesson plan quality
+        score = self._score_lesson_plan(content)
+        
+        # Record final interaction
+        interaction = {
+            "type": "lesson_plan",
+            "score": score,
+            "final_performance": dict(self.student_metrics),
+            "total_questions": self.question_count,
+            "overall_accuracy": self.correct_count / max(1, self.question_count)
+        }
+        self.session_history.append(interaction)
+        
+        info = {
+            "score": score,
+            "episode_done": True,  # Always end after lesson plan
+            "info": interaction,
+            "session_summary": self._generate_session_summary()
+        }
+        
+        return "", info  # Empty string indicates episode completion
+    
+    async def _handle_error(self, error_msg: str) -> Tuple[str, Dict[str, Any]]:
+        """Handle errors in teacher responses."""
+        score = -1.0
+        info = {
+            "score": score,
+            "episode_done": False,
+            "info": {"error": error_msg, "type": "error"}
+        }
+        return self.get_user_message(), info
+    
+    def _simulate_student_response(self, topic: str, difficulty: str, correct_answer: str) -> Tuple[str, bool]:
+        """Simulate student's answer based on proficiency and difficulty."""
+        import random
+        
+        # Get base proficiency
+        base_proficiency = self.student_metrics["topic_accuracies"].get(topic, 0.5)
+        
+        # Adjust for difficulty
+        difficulty_modifiers = {
+            "easy": 0.25,
+            "medium": 0.0, 
+            "hard": -0.25
+        }
+        modifier = difficulty_modifiers.get(difficulty, 0.0)
+        
+        # Calculate success probability
+        success_prob = max(0.1, min(0.9, base_proficiency + modifier))
+        
+        # Add some session momentum (recent success increases confidence)
+        if len(self.session_history) >= 2:
+            recent_correct = sum(1 for item in self.session_history[-2:] 
+                               if item.get('was_correct', False))
+            momentum = (recent_correct - 1) * 0.1  # ±0.1 based on recent performance
+            success_prob = max(0.1, min(0.9, success_prob + momentum))
+        
+        # Determine correct/incorrect
+        is_correct = random.random() < success_prob
+        
+        if is_correct:
+            return correct_answer, True
+        else:
+            # Choose random incorrect answer
+            options = ["A", "B", "C", "D"]
+            if correct_answer in options:
+                options.remove(correct_answer)
+            return random.choice(options), False
+    
+    def _update_student_learning(self, topic: str, difficulty: str, is_correct: bool):
+        """Update student metrics based on teaching interaction."""
+        # Update overall accuracy
+        self.student_metrics["overall_accuracy"] = self.correct_count / self.question_count
+        
+        # Update topic-specific learning
+        current_prof = self.student_metrics["topic_accuracies"].get(topic, 0.5)
+        
+        if is_correct:
+            # Small improvement from correct answers
+            improvement = self.config.student_learning_rate * 0.5
+        else:
+            # Larger improvement from learning from mistakes (assumes good explanations follow)
+            improvement = self.config.student_learning_rate
+        
+        # Difficulty affects learning rate
+        difficulty_multipliers = {"easy": 0.5, "medium": 1.0, "hard": 1.5}
+        multiplier = difficulty_multipliers.get(difficulty, 1.0)
+        
+        new_prof = current_prof + (improvement * multiplier)
+        self.student_metrics["topic_accuracies"][topic] = max(0.0, min(1.0, new_prof))
+    
+    def _calculate_question_reward(self, topic: str, difficulty: str, is_correct: bool, content: Dict[str, Any]) -> float:
+        """Calculate reward for a question based on multiple factors."""
+        # Base reward for correctness
+        base_reward = 1.5 if is_correct else -0.3
+        
+        # Topic targeting bonus (higher reward for focusing on weak areas)
+        topic_prof = self.student_metrics["topic_accuracies"].get(topic, 0.5)
+        targeting_bonus = (1.0 - topic_prof) * 0.8  # Up to 0.8 bonus for weakest topics
+        
+        # Difficulty appropriateness (reward for matching difficulty to ability)
+        difficulty_values = {"easy": 0.3, "medium": 0.6, "hard": 0.9}
+        target_difficulty = difficulty_values.get(difficulty, 0.6)
+        difficulty_appropriateness = 1.0 - abs(target_difficulty - topic_prof) * 2
+        difficulty_bonus = max(0.0, difficulty_appropriateness) * 0.5
+        
+        # Question quality bonus
+        question_text = content.get("question", "")
+        explanation = content.get("explanation", "")
+        quality_bonus = min(0.3, len(question_text.split()) / 50 + len(explanation.split()) / 100)
+        
+        # Learning objective bonus
+        if "learning_objective" in content and len(content["learning_objective"]) > 10:
+            objective_bonus = 0.2
+        else:
+            objective_bonus = 0.0
+        
+        total_reward = base_reward + targeting_bonus + difficulty_bonus + quality_bonus + objective_bonus
+        return round(total_reward, 3)
+    
+    def _score_explanation(self, explanation: str, topic: str) -> float:
+        """Score the quality of an explanation."""
+        if not explanation:
+            return 0.0
+        
+        words = explanation.split()
+        
+        # Length scoring (optimal around 50-150 words)
+        word_count = len(words)
+        if word_count < 20:
+            length_score = word_count / 20 * 0.5
+        elif word_count <= 150:
+            length_score = 0.5 + (word_count - 20) / 130 * 0.5
+        else:
+            length_score = max(0.3, 1.0 - (word_count - 150) / 200)
+        
+        # Content quality indicators
+        quality_indicators = [
+            "because", "therefore", "however", "for example", "this means",
+            "in other words", "specifically", "notice that", "remember"
+        ]
+        quality_score = min(0.5, sum(0.1 for indicator in quality_indicators 
+                                   if indicator in explanation.lower()))
+        
+        return min(1.0, length_score + quality_score)
+    
+    def _score_lesson_plan(self, content: Dict[str, Any]) -> float:
+        """Score the quality of a lesson plan."""
+        base_score = 1.0
+        
+        # Check for key lesson plan elements
+        if "prioritized_topics" in content:
+            priorities = content["prioritized_topics"]
+            if isinstance(priorities, list) and len(priorities) > 0:
+                # Bonus for prioritizing weak topics
+                weak_topics = [topic for topic, acc in self.student_metrics["topic_accuracies"].items() 
+                             if acc < 0.5]
+                weak_coverage = sum(1 for topic in priorities if topic in weak_topics)
+                base_score += weak_coverage * 0.3
+        
+        if "study_activities" in content:
+            activities = content["study_activities"]
+            if isinstance(activities, (list, dict)) and len(activities) > 0:
+                base_score += 0.4
+        
+        if "timeline" in content:
+            base_score += 0.3
+        
+        if "learning_objectives" in content:
+            objectives = content["learning_objectives"]
+            if isinstance(objectives, list) and len(objectives) > 0:
+                base_score += 0.5
+        
+        return min(2.5, base_score)
+    
+    def _generate_session_summary(self) -> Dict[str, Any]:
+        """Generate a summary of the teaching session."""
+        topics_covered = {}
+        for interaction in self.session_history:
+            if interaction.get("type") == "question":
+                topic = interaction.get("topic")
+                if topic:
+                    if topic not in topics_covered:
+                        topics_covered[topic] = {"total": 0, "correct": 0}
+                    topics_covered[topic]["total"] += 1
+                    if interaction.get("was_correct"):
+                        topics_covered[topic]["correct"] += 1
+        
+        # Calculate topic accuracies for this session
+        topic_accuracies = {}
+        for topic, stats in topics_covered.items():
+            topic_accuracies[topic] = stats["correct"] / stats["total"] if stats["total"] > 0 else 0
+        
+        return {
+            "total_questions": self.question_count,
+            "overall_accuracy": self.correct_count / max(1, self.question_count),
+            "topics_covered": list(topics_covered.keys()),
+            "topic_accuracies": topic_accuracies,
+            "improvement": {
+                topic: self.student_metrics["topic_accuracies"][topic] - 
+                       next((t["proficiency"] for t in self.profile["topics"] if t["name"] == topic), 0.5)
+                for topic in topics_covered.keys()
+            },
+            "final_proficiencies": dict(self.student_metrics["topic_accuracies"])
+        }
+    
+    def run_standalone_demo(self, output_file: str = "demo_output.jsonl"):
+        """Run a standalone demo without full Atropos integration."""
+        print("🎓 ExamCraft: Standalone Demo Mode")
+        print("Simulating teacher-student interactions...")
+        
+        # Reset environment
+        self.reset_metrics()
+        
+        # Simulate a teaching session
+        outputs = []
+        
+        for episode in range(3):  # Run 3 episodes
+            print(f"\n--- Episode {episode + 1} ---")
+            episode_data = {
+                "episode": episode + 1,
+                "initial_state": dict(self.student_metrics),
+                "interactions": []
+            }
+            
+            # Simulate questions
+            for question_num in range(self.config.max_questions_per_episode):
+                # Create a mock teacher response (since we don't have an LLM)
+                weakest_topic = min(self.student_metrics["topic_accuracies"].items(), 
+                                  key=lambda x: x[1])[0]
+                
+                mock_response = {
+                    "action_type": "QUESTION",
+                    "topic": weakest_topic,
+                    "difficulty": "medium",
+                    "content": {
+                        "question": f"Sample question about {weakest_topic}",
+                        "options": {
+                            "A": "Option A",
+                            "B": "Option B",
+                            "C": "Option C",
+                            "D": "Option D"
+                        },
+                        "correct_answer": "B",
+                        "explanation": f"This is a detailed explanation about {weakest_topic}.",
+                        "learning_objective": f"Understand key concepts in {weakest_topic}"
+                    }
+                }
+                
+                # Process the mock response
+                try:
+                    result = asyncio.run(self.step(json.dumps(mock_response)))
+                    next_prompt, info = result
+                    episode_data["interactions"].append({
+                        "question_num": question_num + 1,
+                        "teacher_action": mock_response,
+                        "result": info
+                    })
+                    print(f"  Q{question_num + 1}: {info['info'].get('topic')} - {'✓' if info['info'].get('was_correct') else '✗'} (Score: {info['score']:.2f})")
+                except Exception as e:
+                    print(f"  Error processing question {question_num + 1}: {e}")
+                    break
+                
+                if info.get("episode_done", False):
+                    break
+            
+            episode_data["final_state"] = dict(self.student_metrics)
+            outputs.append(episode_data)
+        
+        # Save outputs
+        with open(output_file, 'w') as f:
+            for output in outputs:
+                f.write(json.dumps(output) + '\n')
+        
+        print(f"\n✅ Demo completed! Results saved to {output_file}")
+        print(f"Final student proficiencies:")
+        for topic, prof in self.student_metrics["topic_accuracies"].items():
+            print(f"  {topic}: {prof:.1%}")
+    
+    @classmethod
+    def config_init(cls) -> Tuple[ExamCraftConfig, List[APIServerConfig]]:
+        """Initialize configuration for ExamCraft environment."""
+        env_config = ExamCraftConfig(
+            tokenizer_name="NousResearch/Hermes-3-Llama-3.1-8B",
+            group_size=6,
+            use_wandb=True,
+            rollout_server_url="http://localhost:8000",
+            total_steps=150,
+            batch_size=12,
+            steps_per_eval=30,
+            max_token_length=3072,
+            wandb_name="examcraft-adaptive-teacher",
+            profile_path="example_profile.json",
+            max_questions_per_episode=8,
+            student_learning_rate=0.03,
+            enable_lesson_plans=True,
+        )
+        
+        # API server configuration
+        server_configs = [
+            APIServerConfig(
+                model_name="hermes-3-405b-instruct",
+                base_url="https://api.nousresearch.com/v1",
+                api_key=os.environ.get("NOUS_API_KEY"),
+                num_requests_for_eval=64,
+            ),
+        ]
+        
+        return env_config, server_configs
+
+
+def main():
+    """Main entry point for ExamCraft environment."""
+    parser = argparse.ArgumentParser(description="ExamCraft: Adaptive LLM Teacher Training Environment")
+    parser.add_argument('command', choices=['process', 'serve', 'demo'], 
+                       help='Command to run (demo for standalone mode)')
+    parser.add_argument('--env.data_path_to_save_groups', type=str, default='demo_output.jsonl',
+                       help='Path to save demo output')
+    
+    args = parser.parse_args()
+    
+    if args.command == 'demo':
+        # Run standalone demo
+        config = ExamCraftConfig()
+        env = ExamCraftEnv(config)
+        output_file = getattr(args, 'env.data_path_to_save_groups', 'demo_output.jsonl')
+        env.run_standalone_demo(output_file)
+    elif ATROPOS_AVAILABLE:
+        # Run with Atropos
+        parse_args_and_command(ExamCraftEnv)
+    else:
+        print("Atropos not available. Use 'python examcraft_server.py demo' for standalone mode.")
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    main()
--- a/environments/hack0/examcraft/requirements.txt
+++ b/environments/hack0/examcraft/requirements.txt
@ -0,0 +1 @@
+atroposlib>=0.2.1