atropos/environments/community/wikipedia_research/multi_step_rollout_plan_openai.md
shannonsands f21154ff49
Integrate aniemerg wikipedia (#143)
* initial commit

* initial draft of wikipedia article creation environment

* add openai for rollouts, update requirements, create script to run, etc.

* add configuration, add debugging, fix tool calls, prevent wikipedia access

* now creates html file

* fix output for html page

* check in Claude plan

* fixed formatting and other issues

* add zip file

* update README

* linting, moved to community folder

* linting

* linting

* linting

* linting

---------

Co-authored-by: Allan Niemerg <niemerg@gmail.com>
2025-05-28 10:22:11 +10:00

9.3 KiB

Wikipedia Article Creator: OpenAI Integration and Environment Setup Plan

1. Overview

This document outlines the plan for integrating OpenAI models and proper environment configuration with the existing WikipediaArticleCreatorEnv implementation. The goal is to create a robust setup that can leverage OpenAI's models (like GPT-4, GPT-3.5-turbo) while maintaining a secure configuration for API credentials through environment variables.

2. Key Components and Changes

2.1 Environment Variables (.env) Setup

We need to properly load and manage API credentials from a .env file:

# At the top of wikipedia_article_creator.py
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Later, access environment variables
openai_api_key = os.environ.get("OPENAI_API_KEY")
tavily_api_key = os.environ.get("TAVILY_API_KEY")

2.2 OpenAI API Integration

Update the config_init method in the WikipediaArticleCreatorEnv class to support OpenAI models:

@classmethod
def config_init(cls) -> Tuple[WikipediaArticleCreatorConfig, List[APIServerConfig]]:
    env_config = WikipediaArticleCreatorConfig(
        tokenizer_name="gpt-4-turbo",  # Use any OpenAI model name or a different tokenizer for local processing
        group_size=4,
        use_wandb=True,
        rollout_server_url="http://localhost:8000",
        total_steps=1000,
        batch_size=128,
        steps_per_eval=20,
        max_token_length=1024 * 16,
        inference_weight=1.0,
        wandb_name="wikipedia_article_creator",
        eval_handling=EvalHandlingEnum.LIMIT_TRAIN,
        eval_limit_ratio=0.1,
        max_steps=10,
        temperature=0.7,
        thinking_active=True,
        eval_topics=5,
        tool_timeout=15.0,
        tavily_api_key=None,  # Falls back to environment variable
        min_article_sections=3,
        max_article_tokens=2048,
        topics_file="topics.json",
        logging_active=True,
    )

    # Configure OpenAI server
    server_configs = [
        APIServerConfig(
            model_name="gpt-4o",  # or any other OpenAI model
            base_url=None,  # Use default OpenAI base URL
            api_key=os.environ.get("OPENAI_API_KEY"),
            num_max_requests_at_once=4,
            num_requests_for_eval=16,
        ),
    ]

    return env_config, server_configs

2.3 OpenAI Token Length Handling

Since OpenAI models handle token length limits differently, we need to make some adjustments to ensure proper tokenization and context management:

async def _get_model_response(self, messages: List[Dict]) -> str:
    """Get a response from the model for the current conversation state"""
    try:
        # For OpenAI models, we pass the messages directly
        if self.server.config.base_url is None or 'openai' in (self.server.config.base_url or ''):
            completion = await self.server.chat_completion(
                messages=messages,
                temperature=self.config.temperature,
                max_tokens=min(4096, self.config.max_token_length),  # Ensure within OpenAI limits
            )
            # Extract the text from response
            return completion.choices[0].message.content
        else:
            # For non-OpenAI models (e.g., local models via vLLM)
            prompt = self.tokenizer.apply_chat_template(messages, tokenize=False)
            completion = await self.server.completion(
                prompt=prompt,
                n=1,
                max_tokens=self.config.max_token_length,
                temperature=self.config.temperature,
            )
            return completion.choices[0].text
    except Exception as e:
        logger.error(f"Error getting model response: {e}")
        return ""

2.4 Testing Runner Script

Create a simple runner script to test the environment with OpenAI models:

#!/usr/bin/env python3
"""
Test runner for Wikipedia Article Creator with OpenAI models
"""
import asyncio
import logging
import os
from dotenv import load_dotenv

from environments.hack0.wikipedia.wikipedia_article_creator import WikipediaArticleCreatorEnv

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Load environment variables
load_dotenv()

async def main():
    """Run the environment with a specified model and topic"""
    # Create and initialize the environment
    env_config, server_configs = WikipediaArticleCreatorEnv.config_init()

    # Override with command-line arguments if needed
    # env_config.max_steps = 5
    # server_configs[0].model_name = "gpt-3.5-turbo"

    # Initialize the environment
    env = WikipediaArticleCreatorEnv(env_config, server_configs, slurm=False, testing=True)
    await env.setup()

    # Run a single episode with a specific topic
    topic = "Climate change in Antarctica"
    episode_id = 1

    logger.info(f"Starting research on topic: {topic}")
    episode = env._get_or_create_episode(episode_id, topic)

    # Run until terminal state
    while not episode.is_terminal:
        is_terminal, step_data = await env._next_step(episode)
        response = step_data.get("response", "")
        tool_calls = step_data.get("tool_calls", [])

        logger.info(f"Step {episode.steps_taken} completed")
        logger.info(f"Tool calls: {len(tool_calls)}")

        if is_terminal and episode.final_article:
            logger.info("Article generated successfully")
            logger.info(f"Article length: {len(episode.final_article)} characters")

            # Save the article to a file
            with open(f"article_{topic.replace(' ', '_')}.md", "w") as f:
                f.write(episode.final_article)

            # Evaluate article quality
            quality_metrics = env._assess_article_quality(
                episode.final_article, episode.research_facts
            )
            logger.info(f"Article quality metrics: {quality_metrics}")
        elif is_terminal:
            logger.info("Episode terminated without article")

if __name__ == "__main__":
    asyncio.run(main())

3. Required Environment Setup

3.1 .env File Structure

Create a .env file in the project root with the necessary API keys:

# OpenAI API Key for model access
OPENAI_API_KEY=sk-...your-openai-key...

# Tavily API Key for web search and content extraction
TAVILY_API_KEY=tvly-...your-tavily-key...

# Optional WandB configuration
WANDB_API_KEY=...your-wandb-key...
WANDB_PROJECT=wikipedia_article_creator

3.2 Dependencies

Ensure the following packages are installed:

python-dotenv>=1.0.0
openai>=1.10.0
tavily-python>=0.2.2
wandb>=0.16.0

4. Implementation Plan

4.1 Phase 1: Environment Setup and .env Integration

  1. Add the python-dotenv package to project dependencies
  2. Create a .env.template file with placeholder values
  3. Update README to explain the required API keys and environment setup
  4. Modify wikipedia_article_creator.py to load environment variables

4.2 Phase 2: OpenAI Integration

  1. Update the config_init method to use OpenAI models
  2. Modify the _get_model_response method to handle OpenAI models properly
  3. Adjust tokenization and context length handling for OpenAI models
  4. Test with smaller models (GPT-3.5-turbo) before moving to GPT-4

4.3 Phase 3: Testing and Validation

  1. Create a test runner script to validate the environment with OpenAI models
  2. Test with a variety of topics to ensure the environment works correctly
  3. Monitor API usage and optimize requests to minimize token usage
  4. Validate article quality with different models and parameters

4.4 Phase 4: Performance Optimization

  1. Implement caching for API responses to reduce duplicate requests
  2. Optimize API call batching for evaluation runs
  3. Add error handling and retry logic for API rate limits
  4. Document best practices for working with OpenAI models in this environment

5. Potential Challenges and Solutions

5.1 API Rate Limits

Challenge: OpenAI API has rate limits that could affect research speed. Solution: Implement exponential backoff retry logic and request batching.

5.2 Token Context Length

Challenge: Different OpenAI models have different context length limits. Solution: Implement model-specific context management to stay within limits.

5.3 Cost Management

Challenge: API usage costs can accumulate quickly with multiple research steps. Solution: Implement cost tracking, efficient caching, and configurable limits.

5.4 Model Differences

Challenge: Different models may have different capabilities and response formats. Solution: Add model-specific parsing and prompt adjustments.

6. Testing Metrics

  1. Research Efficiency:

    • Average tool calls per completed article
    • Percentage of relevant vs. irrelevant searches
    • Time to completion
  2. Article Quality:

    • Structure compliance (section count, references)
    • Content relevance to topic
    • Factual accuracy
    • Information density
  3. API Usage:

    • Total tokens per article
    • Cost per article
    • Rate limit errors encountered

7. Next Steps

After successful implementation of OpenAI integration and environment setup:

  1. Optimize prompts specifically for OpenAI models
  2. Implement comparative evaluation across different models
  3. Add support for multi-model fallback (e.g., start with GPT-3.5, escalate to GPT-4 for complex topics)
  4. Create a comprehensive benchmark suite for Wikipedia article creation