mirror of
https://github.com/NousResearch/atropos.git
synced 2026-04-19 12:57:58 +00:00
* initial commit * initial draft of wikipedia article creation environment * add openai for rollouts, update requirements, create script to run, etc. * add configuration, add debugging, fix tool calls, prevent wikipedia access * now creates html file * fix output for html page * check in Claude plan * fixed formatting and other issues * add zip file * update README * linting, moved to community folder * linting * linting * linting * linting --------- Co-authored-by: Allan Niemerg <niemerg@gmail.com> |
||
|---|---|---|
| .. | ||
| tools | ||
| article_evaluator.py | ||
| evaluation_plan.md | ||
| get_examples.py | ||
| multi_step_rollout_plan.md | ||
| multi_step_rollout_plan_openai.md | ||
| README.md | ||
| requirements.txt | ||
| run_with_openai.py | ||
| tool_calling_server.py | ||
| wikipedia-output.zip | ||
| wikipedia_article_creator.py | ||
| wikipedia_config.yaml | ||
Wikipedia Article Research Environment
This environment trains LLMs to research and create Wikipedia-style articles on arbitrary topics using web search and content extraction tools.
Overview
The Wikipedia Article Research Environment provides a comprehensive framework for training language models to conduct multi-step research and generate high-quality, factually accurate Wikipedia-style articles. The environment combines web search capabilities with content extraction tools to enable thorough research processes.
Core Features
Multi-Step Research Process
- Web Search Integration: Uses Tavily API for comprehensive web search capabilities
- Content Extraction: Extracts full content from specific webpages for detailed analysis
- Research Fact Tracking: Automatically tracks and stores important facts discovered during research
- Wikipedia Blocking: Prevents direct access to Wikipedia to encourage diverse source usage
Article Quality Assessment
- Structure Scoring: Evaluates article organization, section structure, and references
- Comprehensiveness Scoring: Assesses coverage of important topic aspects
- Fact Usage Scoring: Measures effective incorporation of researched facts
- Factual Accuracy Evaluation: Optional OpenAI-powered line-by-line fact-checking against reference articles
Advanced Evaluation System
- Dual Scoring Mechanisms: Combines structural quality with factual accuracy
- Line-by-Line Analysis: Categorizes statements as CORRECT, INCORRECT, or UNKNOWN
- Reference Comparison: Compares generated articles against real Wikipedia content
- Comprehensive Metrics: Provides detailed accuracy statistics and combined scores
Technical Implementation
Environment Configuration
- Environment Name:
WikipediaArticleCreator - Base Class:
BaseEnvfrom atroposlib - Tool Integration: Tavily search and extraction tools
- Evaluation: OpenAI-powered factual accuracy assessment
Key Components
- Episode Management: Tracks research sessions with conversation history
- Tool Execution: Handles web search and content extraction with error handling
- Quality Metrics: Multi-dimensional article assessment framework
- W&B Integration: Comprehensive logging and visualization support
Research Tools
- Web Search (
web_search): Searches the web with configurable result limits and year filtering - Page Extraction (
visit_page): Extracts content from specific URLs with error handling
Setup and Configuration
Environment Variables
# Required for web research
export TAVILY_API_KEY="your_tavily_api_key"
# Required for LLM access
export OPENAI_API_KEY="your_openai_api_key"
# Optional configuration
export MODEL_NAME="gpt-4o"
export MAX_STEPS="10"
export TEMPERATURE="0.7"
Dependencies
pip install openai tavily-python python-dotenv smolagents pandas pyyaml
Usage Examples
Training Mode
python -m atroposlib.cli.dpo \
--env-module "environments.community.wikipedia_research.wikipedia_article_creator" \
--wandb-mode online
Evaluation Mode
python -m atroposlib.cli.sft \
--eval-only \
--env-module "environments.community.wikipedia_research.wikipedia_article_creator"
Direct Usage
cd environments/community/wikipedia_research
python run_with_openai.py --topic "Climate change in Antarctica" --model "gpt-4o" --max-steps 10
Evaluation Metrics
Quality Metrics (0-1 scale)
- Structure Score: Article organization and section quality
- Comprehensiveness: Coverage of important topic aspects
- Fact Usage: Effective incorporation of researched information
- Overall Quality: Combined structural and content quality
Factual Accuracy Metrics
- Correct Statements: Percentage verified as factually accurate
- Incorrect Statements: Percentage contradicting reference sources
- Unknown Statements: Percentage that cannot be verified
- Accuracy Score: Net accuracy in [-1, 1] range
Combined Metrics
- Overall Article Score: Comprehensive quality + accuracy metric in [-1, 1] range
- Research Efficiency: Steps taken vs. article quality achieved
- Tool Usage Effectiveness: Success rate of research tool calls
Configuration Parameters
Core Settings
max_steps: Maximum research steps per article (default: 10)temperature: Sampling temperature for generation (default: 0.7)eval_topics: Number of topics for evaluation (default: 30)tool_timeout: Timeout for tool execution in seconds (default: 15.0)
Quality Thresholds
min_article_sections: Minimum sections required (default: 3)max_article_tokens: Maximum article length (default: 2048)
Advanced Options
thinking_active: Enable reasoning tags (default: True)logging_active: Enable detailed logging (default: True)include_messages: Include conversation history in outputs (default: True)
Research Workflow
- Topic Assignment: Model receives a research topic
- Research Planning: Model develops research strategy using
<think>tags - Information Gathering: Uses
web_searchandvisit_pagetools iteratively - Fact Extraction: Environment tracks important facts from tool results
- Article Generation: Model synthesizes research into Wikipedia-style article
- Quality Assessment: Environment evaluates structure, comprehensiveness, and accuracy
- Factual Verification: Optional comparison against reference Wikipedia articles
Output Format
Article Structure
Articles must be formatted as:
Final Step: ```markdown
# Article Title
## Introduction
[Content...]
## Section 1
[Content...]
## References
[Sources...]
Tool Call Format
<tool_call>
{"name": "web_search", "arguments": {"query": "search terms", "num_results": 5}}
</tool_call>
<tool_call>
{"name": "visit_page", "arguments": {"url": "https://example.com"}}
</tool_call>
Performance Characteristics
Computational Requirements
- Memory: ~1-2 GB RAM for typical usage
- API Calls: 10-50 tool calls per article depending on complexity
- Processing Time: 2-10 minutes per article with OpenAI models
- Storage: Minimal local storage requirements
Scalability
- Concurrent Episodes: Supports multiple parallel research sessions
- Batch Processing: Configurable batch sizes for training
- Tool Rate Limiting: Built-in respect for API rate limits
- Error Recovery: Robust error handling for network issues
Integration Features
W&B Logging
- Conversation Tracking: Complete research session histories
- Quality Metrics: Detailed article assessment data
- Tool Usage Analytics: Search and extraction success rates
- Accuracy Statistics: Factual verification results
HTML Rendering
- Research Visualization: Complete conversation flows with tool results
- Article Presentation: Formatted final articles with metadata
- Quality Dashboards: Interactive metric displays
This environment provides a comprehensive framework for training LLMs to conduct thorough research and generate high-quality, factually accurate articles while maintaining transparency in the research process.