Integrate aniemerg wikipedia (#143)

* initial commit * initial draft of wikipedia article creation environment * add openai for rollouts, update requirements, create script to run, etc. * add configuration, add debugging, fix tool calls, prevent wikipedia access * now creates html file * fix output for html page * check in Claude plan * fixed formatting and other issues * add zip file * update README * linting, moved to community folder * linting * linting * linting * linting --------- Co-authored-by: Allan Niemerg <niemerg@gmail.com>
2026-04-30 17:40:36 +00:00 · 2025-05-28 10:22:11 +10:00 · 2025-05-28 10:22:11 +10:00 · f21154ff49
commit f21154ff49
parent b774e97215
14 changed files with 4480 additions and 0 deletions
--- a/environments/community/wikipedia_research/evaluation_plan.md
+++ b/environments/community/wikipedia_research/evaluation_plan.md
@ -0,0 +1,203 @@
+# Wikipedia Article Evaluation System Plan
+
+## Overview
+
+This document outlines the plan to implement an evaluation system that uses OpenAI models to assess AI-generated Wikipedia articles against reference articles from the existing JSON data. This system will be integrated directly into the `score()` function of the `WikipediaArticleCreatorEnv` class.
+
+## Core Components
+
+### 1. Data Access Module
+
+**Purpose**: Access reference Wikipedia articles from the existing JSON data.
+
+**Implementation Details**:
+- Utilize the existing JSON loading functionality in `wikipedia_article_creator.py`
+- Access reference article content via the "plain_text" key already available in the JSON
+- Match generated articles to reference articles by title
+
+```python
+def get_reference_article(self, topic: str) -> str:
+    """
+    Retrieves the reference article text for a given topic from the loaded JSON data.
+    """
+    # Access the article JSON that's already loaded by _load_topics()
+    # Return the "plain_text" content for the matching article
+    pass
+```
+
+### 2. Content Preparation Module
+
+**Purpose**: Prepare AI-generated articles for evaluation against reference content.
+
+**Implementation Details**:
+- Split AI-generated article into numbered lines for granular assessment
+- No need to normalize reference text - the OpenAI model can work with raw text
+
+```python
+def prepare_article_for_evaluation(self, article_content: str) -> Tuple[str, List[str]]:
+    """
+    Prepares an AI-generated article for evaluation.
+    Returns both the numbered version (for the prompt) and the original lines (for scoring).
+    """
+    # Split article into lines
+    # Add line numbers
+    # Return both formatted text and original lines
+    pass
+```
+
+### 3. Evaluation Engine
+
+**Purpose**: Compare AI-generated article against the reference using OpenAI models.
+
+**Implementation Details**:
+- Create a focused prompt for the OpenAI model
+- Generate YAML-formatted assessment of each line
+- Categorize statements as CORRECT, INCORRECT, or UNKNOWN
+- Include brief justification for each classification
+
+```python
+async def evaluate_article_accuracy(
+    self,
+    reference_content: str,
+    generated_article: str
+) -> Dict:
+    """
+    Evaluates the factual accuracy of a generated article against a reference.
+    Returns structured accuracy data.
+    """
+    # Format the prompt with reference and generated content
+    # Call the OpenAI API
+    # Parse YAML response
+    # Return structured accuracy data
+    pass
+```
+
+### 4. Scoring Integration
+
+**Purpose**: Calculate accuracy score and integrate with existing scoring mechanism.
+
+**Implementation Details**:
+- Convert evaluation results into a normalized score
+- Integrate with existing article quality metrics
+- Add accuracy metrics to wandb logging
+
+```python
+def calculate_accuracy_score(self, evaluation_data: Dict) -> float:
+    """
+    Calculates a normalized accuracy score from evaluation data.
+    Returns a score between -1 and 1 for compatibility with existing scoring.
+    """
+    # Calculate percentage of CORRECT, INCORRECT, and UNKNOWN statements
+    # Convert to a normalized score in the range [-1, 1]
+    # More CORRECT = higher score, more INCORRECT = lower score
+    pass
+```
+
+## Integration with Existing Environment
+
+### Updating the `score()` Function
+
+```python
+async def score(self, rollout_group_data: List[ScoredDataGroup]) -> List[ScoredDataGroup]:
+    """
+    Enhanced scoring function that incorporates factual accuracy evaluation.
+    """
+    # For each terminal step with a final article:
+    #   1. Get the corresponding topic
+    #   2. Retrieve reference article from JSON data
+    #   3. Evaluate article accuracy
+    #   4. Calculate accuracy score
+    #   5. Combine with existing quality metrics
+    #   6. Update the score in the ScoredDataGroup
+
+    # Add accuracy metrics to article_quality_metrics for wandb logging
+
+    return rollout_group_data
+```
+
+## OpenAI Prompt Design
+
+```
+You are an expert fact-checker comparing an AI-generated article with a reference Wikipedia article.
+
+# Classification Criteria
+- CORRECT: The statement is accurate and verifiable in the reference article
+- INCORRECT: The statement contradicts information in the reference article
+- UNKNOWN: The reference doesn't mention this information or provides insufficient details to verify
+
+# Output Format
+You must produce valid YAML with this exact structure for each numbered line:
+1:
+  analysis: "Brief analysis of line 1"
+  accuracy: "CORRECT|INCORRECT|UNKNOWN"
+2:
+  analysis: "Brief analysis of line 2"
+  accuracy: "CORRECT|INCORRECT|UNKNOWN"
+...
+
+# REFERENCE ARTICLE:
+{wiki_content}
+
+# AI-GENERATED ARTICLE (NUMBERED LINES):
+{numbered_ai_content}
+```
+
+## Implementation Steps
+
+1. Implement the `get_reference_article()` function to extract reference text from JSON
+2. Create the `prepare_article_for_evaluation()` function to number article lines
+3. Develop the `evaluate_article_accuracy()` function with OpenAI integration
+4. Implement the `calculate_accuracy_score()` function
+5. Update the `score()` method to incorporate accuracy evaluation
+6. Extend `_assess_article_quality()` to include the new accuracy metrics
+7. Update wandb logging to include accuracy statistics
+
+## Accuracy Scoring Formula
+
+The accuracy score will be calculated as follows:
+
+```python
+# Example scoring formula
+def calculate_accuracy_score(evaluation_data):
+    total_lines = len(evaluation_data)
+    correct_count = sum(1 for item in evaluation_data.values() if item['accuracy'] == 'CORRECT')
+    incorrect_count = sum(1 for item in evaluation_data.values() if item['accuracy'] == 'INCORRECT')
+
+    # Calculate percentages
+    pct_correct = correct_count / total_lines if total_lines > 0 else 0
+    pct_incorrect = incorrect_count / total_lines if total_lines > 0 else 0
+
+    # Convert to score between -1 and 1
+    # Formula: correct% * 2 - 1 with adjustment for incorrect%
+    score = pct_correct * 2 - 1 - (pct_incorrect * 0.5)
+
+    # Ensure score is within [-1, 1] range
+    return max(-1, min(1, score))
+```
+
+## Updates to wandb Logging
+
+The existing wandb logging will be extended to include:
+
+```python
+# Add to article_quality_metrics
+accuracy_metrics = {
+    "pct_correct": percentage_correct,
+    "pct_incorrect": percentage_incorrect,
+    "pct_unknown": percentage_unknown,
+    "accuracy_score": accuracy_score
+}
+self.article_quality_metrics[-1].update(accuracy_metrics)
+
+# Add to wandb metrics table
+wandb_metrics["train/article_quality"] = table.add_column("factual_accuracy", [
+    m["accuracy_score"] for m in self.article_quality_metrics
+])
+```
+
+## Expected Benefits
+
+1. More comprehensive evaluation of generated articles
+2. Better feedback for the model on factual accuracy
+3. Improved ability to detect hallucinations or fabricated information
+4. Enhanced scoring mechanism that values factual correctness