mirror of
https://github.com/NousResearch/atropos.git
synced 2026-04-30 17:40:36 +00:00
Integrate aniemerg wikipedia (#143)
* initial commit * initial draft of wikipedia article creation environment * add openai for rollouts, update requirements, create script to run, etc. * add configuration, add debugging, fix tool calls, prevent wikipedia access * now creates html file * fix output for html page * check in Claude plan * fixed formatting and other issues * add zip file * update README * linting, moved to community folder * linting * linting * linting * linting --------- Co-authored-by: Allan Niemerg <niemerg@gmail.com>
This commit is contained in:
parent
b774e97215
commit
f21154ff49
14 changed files with 4480 additions and 0 deletions
203
environments/community/wikipedia_research/evaluation_plan.md
Normal file
203
environments/community/wikipedia_research/evaluation_plan.md
Normal file
|
|
@ -0,0 +1,203 @@
|
|||
# Wikipedia Article Evaluation System Plan
|
||||
|
||||
## Overview
|
||||
|
||||
This document outlines the plan to implement an evaluation system that uses OpenAI models to assess AI-generated Wikipedia articles against reference articles from the existing JSON data. This system will be integrated directly into the `score()` function of the `WikipediaArticleCreatorEnv` class.
|
||||
|
||||
## Core Components
|
||||
|
||||
### 1. Data Access Module
|
||||
|
||||
**Purpose**: Access reference Wikipedia articles from the existing JSON data.
|
||||
|
||||
**Implementation Details**:
|
||||
- Utilize the existing JSON loading functionality in `wikipedia_article_creator.py`
|
||||
- Access reference article content via the "plain_text" key already available in the JSON
|
||||
- Match generated articles to reference articles by title
|
||||
|
||||
```python
|
||||
def get_reference_article(self, topic: str) -> str:
|
||||
"""
|
||||
Retrieves the reference article text for a given topic from the loaded JSON data.
|
||||
"""
|
||||
# Access the article JSON that's already loaded by _load_topics()
|
||||
# Return the "plain_text" content for the matching article
|
||||
pass
|
||||
```
|
||||
|
||||
### 2. Content Preparation Module
|
||||
|
||||
**Purpose**: Prepare AI-generated articles for evaluation against reference content.
|
||||
|
||||
**Implementation Details**:
|
||||
- Split AI-generated article into numbered lines for granular assessment
|
||||
- No need to normalize reference text - the OpenAI model can work with raw text
|
||||
|
||||
```python
|
||||
def prepare_article_for_evaluation(self, article_content: str) -> Tuple[str, List[str]]:
|
||||
"""
|
||||
Prepares an AI-generated article for evaluation.
|
||||
Returns both the numbered version (for the prompt) and the original lines (for scoring).
|
||||
"""
|
||||
# Split article into lines
|
||||
# Add line numbers
|
||||
# Return both formatted text and original lines
|
||||
pass
|
||||
```
|
||||
|
||||
### 3. Evaluation Engine
|
||||
|
||||
**Purpose**: Compare AI-generated article against the reference using OpenAI models.
|
||||
|
||||
**Implementation Details**:
|
||||
- Create a focused prompt for the OpenAI model
|
||||
- Generate YAML-formatted assessment of each line
|
||||
- Categorize statements as CORRECT, INCORRECT, or UNKNOWN
|
||||
- Include brief justification for each classification
|
||||
|
||||
```python
|
||||
async def evaluate_article_accuracy(
|
||||
self,
|
||||
reference_content: str,
|
||||
generated_article: str
|
||||
) -> Dict:
|
||||
"""
|
||||
Evaluates the factual accuracy of a generated article against a reference.
|
||||
Returns structured accuracy data.
|
||||
"""
|
||||
# Format the prompt with reference and generated content
|
||||
# Call the OpenAI API
|
||||
# Parse YAML response
|
||||
# Return structured accuracy data
|
||||
pass
|
||||
```
|
||||
|
||||
### 4. Scoring Integration
|
||||
|
||||
**Purpose**: Calculate accuracy score and integrate with existing scoring mechanism.
|
||||
|
||||
**Implementation Details**:
|
||||
- Convert evaluation results into a normalized score
|
||||
- Integrate with existing article quality metrics
|
||||
- Add accuracy metrics to wandb logging
|
||||
|
||||
```python
|
||||
def calculate_accuracy_score(self, evaluation_data: Dict) -> float:
|
||||
"""
|
||||
Calculates a normalized accuracy score from evaluation data.
|
||||
Returns a score between -1 and 1 for compatibility with existing scoring.
|
||||
"""
|
||||
# Calculate percentage of CORRECT, INCORRECT, and UNKNOWN statements
|
||||
# Convert to a normalized score in the range [-1, 1]
|
||||
# More CORRECT = higher score, more INCORRECT = lower score
|
||||
pass
|
||||
```
|
||||
|
||||
## Integration with Existing Environment
|
||||
|
||||
### Updating the `score()` Function
|
||||
|
||||
```python
|
||||
async def score(self, rollout_group_data: List[ScoredDataGroup]) -> List[ScoredDataGroup]:
|
||||
"""
|
||||
Enhanced scoring function that incorporates factual accuracy evaluation.
|
||||
"""
|
||||
# For each terminal step with a final article:
|
||||
# 1. Get the corresponding topic
|
||||
# 2. Retrieve reference article from JSON data
|
||||
# 3. Evaluate article accuracy
|
||||
# 4. Calculate accuracy score
|
||||
# 5. Combine with existing quality metrics
|
||||
# 6. Update the score in the ScoredDataGroup
|
||||
|
||||
# Add accuracy metrics to article_quality_metrics for wandb logging
|
||||
|
||||
return rollout_group_data
|
||||
```
|
||||
|
||||
## OpenAI Prompt Design
|
||||
|
||||
```
|
||||
You are an expert fact-checker comparing an AI-generated article with a reference Wikipedia article.
|
||||
|
||||
# Classification Criteria
|
||||
- CORRECT: The statement is accurate and verifiable in the reference article
|
||||
- INCORRECT: The statement contradicts information in the reference article
|
||||
- UNKNOWN: The reference doesn't mention this information or provides insufficient details to verify
|
||||
|
||||
# Output Format
|
||||
You must produce valid YAML with this exact structure for each numbered line:
|
||||
1:
|
||||
analysis: "Brief analysis of line 1"
|
||||
accuracy: "CORRECT|INCORRECT|UNKNOWN"
|
||||
2:
|
||||
analysis: "Brief analysis of line 2"
|
||||
accuracy: "CORRECT|INCORRECT|UNKNOWN"
|
||||
...
|
||||
|
||||
# REFERENCE ARTICLE:
|
||||
{wiki_content}
|
||||
|
||||
# AI-GENERATED ARTICLE (NUMBERED LINES):
|
||||
{numbered_ai_content}
|
||||
```
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
1. Implement the `get_reference_article()` function to extract reference text from JSON
|
||||
2. Create the `prepare_article_for_evaluation()` function to number article lines
|
||||
3. Develop the `evaluate_article_accuracy()` function with OpenAI integration
|
||||
4. Implement the `calculate_accuracy_score()` function
|
||||
5. Update the `score()` method to incorporate accuracy evaluation
|
||||
6. Extend `_assess_article_quality()` to include the new accuracy metrics
|
||||
7. Update wandb logging to include accuracy statistics
|
||||
|
||||
## Accuracy Scoring Formula
|
||||
|
||||
The accuracy score will be calculated as follows:
|
||||
|
||||
```python
|
||||
# Example scoring formula
|
||||
def calculate_accuracy_score(evaluation_data):
|
||||
total_lines = len(evaluation_data)
|
||||
correct_count = sum(1 for item in evaluation_data.values() if item['accuracy'] == 'CORRECT')
|
||||
incorrect_count = sum(1 for item in evaluation_data.values() if item['accuracy'] == 'INCORRECT')
|
||||
|
||||
# Calculate percentages
|
||||
pct_correct = correct_count / total_lines if total_lines > 0 else 0
|
||||
pct_incorrect = incorrect_count / total_lines if total_lines > 0 else 0
|
||||
|
||||
# Convert to score between -1 and 1
|
||||
# Formula: correct% * 2 - 1 with adjustment for incorrect%
|
||||
score = pct_correct * 2 - 1 - (pct_incorrect * 0.5)
|
||||
|
||||
# Ensure score is within [-1, 1] range
|
||||
return max(-1, min(1, score))
|
||||
```
|
||||
|
||||
## Updates to wandb Logging
|
||||
|
||||
The existing wandb logging will be extended to include:
|
||||
|
||||
```python
|
||||
# Add to article_quality_metrics
|
||||
accuracy_metrics = {
|
||||
"pct_correct": percentage_correct,
|
||||
"pct_incorrect": percentage_incorrect,
|
||||
"pct_unknown": percentage_unknown,
|
||||
"accuracy_score": accuracy_score
|
||||
}
|
||||
self.article_quality_metrics[-1].update(accuracy_metrics)
|
||||
|
||||
# Add to wandb metrics table
|
||||
wandb_metrics["train/article_quality"] = table.add_column("factual_accuracy", [
|
||||
m["accuracy_score"] for m in self.article_quality_metrics
|
||||
])
|
||||
```
|
||||
|
||||
## Expected Benefits
|
||||
|
||||
1. More comprehensive evaluation of generated articles
|
||||
2. Better feedback for the model on factual accuracy
|
||||
3. Improved ability to detect hallucinations or fabricated information
|
||||
4. Enhanced scoring mechanism that values factual correctness
|
||||
Loading…
Add table
Add a link
Reference in a new issue