Integrate aniemerg wikipedia (#143)

* initial commit * initial draft of wikipedia article creation environment * add openai for rollouts, update requirements, create script to run, etc. * add configuration, add debugging, fix tool calls, prevent wikipedia access * now creates html file * fix output for html page * check in Claude plan * fixed formatting and other issues * add zip file * update README * linting, moved to community folder * linting * linting * linting * linting --------- Co-authored-by: Allan Niemerg <niemerg@gmail.com>
2026-04-22 16:48:57 +00:00 · 2025-05-28 10:22:11 +10:00 · 2025-05-28 10:22:11 +10:00 · f21154ff49
commit f21154ff49
parent b774e97215
14 changed files with 4480 additions and 0 deletions
--- a/environments/community/wikipedia_research/multi_step_rollout_plan_openai.md
+++ b/environments/community/wikipedia_research/multi_step_rollout_plan_openai.md
@ -0,0 +1,282 @@
+# Wikipedia Article Creator: OpenAI Integration and Environment Setup Plan
+
+## 1. Overview
+
+This document outlines the plan for integrating OpenAI models and proper environment configuration with the existing WikipediaArticleCreatorEnv implementation. The goal is to create a robust setup that can leverage OpenAI's models (like GPT-4, GPT-3.5-turbo) while maintaining a secure configuration for API credentials through environment variables.
+
+## 2. Key Components and Changes
+
+### 2.1 Environment Variables (.env) Setup
+
+We need to properly load and manage API credentials from a .env file:
+
+```python
+# At the top of wikipedia_article_creator.py
+import os
+from dotenv import load_dotenv
+
+# Load environment variables from .env file
+load_dotenv()
+
+# Later, access environment variables
+openai_api_key = os.environ.get("OPENAI_API_KEY")
+tavily_api_key = os.environ.get("TAVILY_API_KEY")
+```
+
+### 2.2 OpenAI API Integration
+
+Update the `config_init` method in the `WikipediaArticleCreatorEnv` class to support OpenAI models:
+
+```python
+@classmethod
+def config_init(cls) -> Tuple[WikipediaArticleCreatorConfig, List[APIServerConfig]]:
+    env_config = WikipediaArticleCreatorConfig(
+        tokenizer_name="gpt-4-turbo",  # Use any OpenAI model name or a different tokenizer for local processing
+        group_size=4,
+        use_wandb=True,
+        rollout_server_url="http://localhost:8000",
+        total_steps=1000,
+        batch_size=128,
+        steps_per_eval=20,
+        max_token_length=1024 * 16,
+        inference_weight=1.0,
+        wandb_name="wikipedia_article_creator",
+        eval_handling=EvalHandlingEnum.LIMIT_TRAIN,
+        eval_limit_ratio=0.1,
+        max_steps=10,
+        temperature=0.7,
+        thinking_active=True,
+        eval_topics=5,
+        tool_timeout=15.0,
+        tavily_api_key=None,  # Falls back to environment variable
+        min_article_sections=3,
+        max_article_tokens=2048,
+        topics_file="topics.json",
+        logging_active=True,
+    )
+
+    # Configure OpenAI server
+    server_configs = [
+        APIServerConfig(
+            model_name="gpt-4o",  # or any other OpenAI model
+            base_url=None,  # Use default OpenAI base URL
+            api_key=os.environ.get("OPENAI_API_KEY"),
+            num_max_requests_at_once=4,
+            num_requests_for_eval=16,
+        ),
+    ]
+
+    return env_config, server_configs
+```
+
+### 2.3 OpenAI Token Length Handling
+
+Since OpenAI models handle token length limits differently, we need to make some adjustments to ensure proper tokenization and context management:
+
+```python
+async def _get_model_response(self, messages: List[Dict]) -> str:
+    """Get a response from the model for the current conversation state"""
+    try:
+        # For OpenAI models, we pass the messages directly
+        if self.server.config.base_url is None or 'openai' in (self.server.config.base_url or ''):
+            completion = await self.server.chat_completion(
+                messages=messages,
+                temperature=self.config.temperature,
+                max_tokens=min(4096, self.config.max_token_length),  # Ensure within OpenAI limits
+            )
+            # Extract the text from response
+            return completion.choices[0].message.content
+        else:
+            # For non-OpenAI models (e.g., local models via vLLM)
+            prompt = self.tokenizer.apply_chat_template(messages, tokenize=False)
+            completion = await self.server.completion(
+                prompt=prompt,
+                n=1,
+                max_tokens=self.config.max_token_length,
+                temperature=self.config.temperature,
+            )
+            return completion.choices[0].text
+    except Exception as e:
+        logger.error(f"Error getting model response: {e}")
+        return ""
+```
+
+### 2.4 Testing Runner Script
+
+Create a simple runner script to test the environment with OpenAI models:
+
+```python
+#!/usr/bin/env python3
+"""
+Test runner for Wikipedia Article Creator with OpenAI models
+"""
+import asyncio
+import logging
+import os
+from dotenv import load_dotenv
+
+from environments.hack0.wikipedia.wikipedia_article_creator import WikipediaArticleCreatorEnv
+
+# Set up logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+# Load environment variables
+load_dotenv()
+
+async def main():
+    """Run the environment with a specified model and topic"""
+    # Create and initialize the environment
+    env_config, server_configs = WikipediaArticleCreatorEnv.config_init()
+
+    # Override with command-line arguments if needed
+    # env_config.max_steps = 5
+    # server_configs[0].model_name = "gpt-3.5-turbo"
+
+    # Initialize the environment
+    env = WikipediaArticleCreatorEnv(env_config, server_configs, slurm=False, testing=True)
+    await env.setup()
+
+    # Run a single episode with a specific topic
+    topic = "Climate change in Antarctica"
+    episode_id = 1
+
+    logger.info(f"Starting research on topic: {topic}")
+    episode = env._get_or_create_episode(episode_id, topic)
+
+    # Run until terminal state
+    while not episode.is_terminal:
+        is_terminal, step_data = await env._next_step(episode)
+        response = step_data.get("response", "")
+        tool_calls = step_data.get("tool_calls", [])
+
+        logger.info(f"Step {episode.steps_taken} completed")
+        logger.info(f"Tool calls: {len(tool_calls)}")
+
+        if is_terminal and episode.final_article:
+            logger.info("Article generated successfully")
+            logger.info(f"Article length: {len(episode.final_article)} characters")
+
+            # Save the article to a file
+            with open(f"article_{topic.replace(' ', '_')}.md", "w") as f:
+                f.write(episode.final_article)
+
+            # Evaluate article quality
+            quality_metrics = env._assess_article_quality(
+                episode.final_article, episode.research_facts
+            )
+            logger.info(f"Article quality metrics: {quality_metrics}")
+        elif is_terminal:
+            logger.info("Episode terminated without article")
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+## 3. Required Environment Setup
+
+### 3.1 .env File Structure
+
+Create a `.env` file in the project root with the necessary API keys:
+
+```
+# OpenAI API Key for model access
+OPENAI_API_KEY=sk-...your-openai-key...
+
+# Tavily API Key for web search and content extraction
+TAVILY_API_KEY=tvly-...your-tavily-key...
+
+# Optional WandB configuration
+WANDB_API_KEY=...your-wandb-key...
+WANDB_PROJECT=wikipedia_article_creator
+```
+
+### 3.2 Dependencies
+
+Ensure the following packages are installed:
+
+```
+python-dotenv>=1.0.0
+openai>=1.10.0
+tavily-python>=0.2.2
+wandb>=0.16.0
+```
+
+## 4. Implementation Plan
+
+### 4.1 Phase 1: Environment Setup and .env Integration
+
+1. Add the python-dotenv package to project dependencies
+2. Create a `.env.template` file with placeholder values
+3. Update README to explain the required API keys and environment setup
+4. Modify `wikipedia_article_creator.py` to load environment variables
+
+### 4.2 Phase 2: OpenAI Integration
+
+1. Update the `config_init` method to use OpenAI models
+2. Modify the `_get_model_response` method to handle OpenAI models properly
+3. Adjust tokenization and context length handling for OpenAI models
+4. Test with smaller models (GPT-3.5-turbo) before moving to GPT-4
+
+### 4.3 Phase 3: Testing and Validation
+
+1. Create a test runner script to validate the environment with OpenAI models
+2. Test with a variety of topics to ensure the environment works correctly
+3. Monitor API usage and optimize requests to minimize token usage
+4. Validate article quality with different models and parameters
+
+### 4.4 Phase 4: Performance Optimization
+
+1. Implement caching for API responses to reduce duplicate requests
+2. Optimize API call batching for evaluation runs
+3. Add error handling and retry logic for API rate limits
+4. Document best practices for working with OpenAI models in this environment
+
+## 5. Potential Challenges and Solutions
+
+### 5.1 API Rate Limits
+
+**Challenge:** OpenAI API has rate limits that could affect research speed.
+**Solution:** Implement exponential backoff retry logic and request batching.
+
+### 5.2 Token Context Length
+
+**Challenge:** Different OpenAI models have different context length limits.
+**Solution:** Implement model-specific context management to stay within limits.
+
+### 5.3 Cost Management
+
+**Challenge:** API usage costs can accumulate quickly with multiple research steps.
+**Solution:** Implement cost tracking, efficient caching, and configurable limits.
+
+### 5.4 Model Differences
+
+**Challenge:** Different models may have different capabilities and response formats.
+**Solution:** Add model-specific parsing and prompt adjustments.
+
+## 6. Testing Metrics
+
+1. **Research Efficiency:**
+   - Average tool calls per completed article
+   - Percentage of relevant vs. irrelevant searches
+   - Time to completion
+
+2. **Article Quality:**
+   - Structure compliance (section count, references)
+   - Content relevance to topic
+   - Factual accuracy
+   - Information density
+
+3. **API Usage:**
+   - Total tokens per article
+   - Cost per article
+   - Rate limit errors encountered
+
+## 7. Next Steps
+
+After successful implementation of OpenAI integration and environment setup:
+
+1. Optimize prompts specifically for OpenAI models
+2. Implement comparative evaluation across different models
+3. Add support for multi-model fallback (e.g., start with GPT-3.5, escalate to GPT-4 for complex topics)
+4. Create a comprehensive benchmark suite for Wikipedia article creation