atropos/environments/smolagents_integration/README.md
2025-05-27 11:00:35 -05:00

8.2 KiB

Atropos-SmolaGents Integration

This integration enables the use of SmolaGents' agent capabilities with Atropos' server-based LLM architecture for high-quality training data generation.

Overview

The integration consists of:

  1. SmolagentsEnv: A full-fledged Atropos environment for generating high-quality agent trajectories
  2. Process-based execution: Robust parallel execution of agents in isolated processes
  3. Tools: File manipulation and web searching tools for the agent to use
  4. Scoring system: Automatic evaluation of agent responses based on correctness and efficiency

Files

  • smolagents_env.py: The complete Atropos environment for generating training data.
  • agent_process_runner.py: Module for running agents in separate processes.
  • server_proxy.py: Proxy mechanism for communication between processes and Atropos server.
  • smolagents_model.py: Process-safe Atropos server model implementation for SmolaGents.
  • download_gaia.py: Script to download and set up the GAIA benchmark dataset.

Tools:

  • tools/file_tools.py: Tools for reading, writing, and appending to files
  • tools/tavily_tools.py: Web search and page extraction tools powered by Tavily

Installation

  1. First, make sure you have Atropos installed.
  2. Install SmolaGents:
    pip install smolagents
    
  3. Install the GAIA benchmark dependencies:
    pip install datasets pandas huggingface_hub
    
  4. For web tools, install Tavily:
    pip install tavily-python
    
  5. Download the GAIA benchmark dataset:
    python -m environments.smolagents_integration.download_gaia
    

Environment Variables

The integration uses the following environment variables:

  • OPENAI_API_KEY: Required for OpenAI API access when using LiteLLM model (test mode) or when using OpenAI models with Atropos.
  • TAVILY_API_KEY: Required for web search and page extraction tools. You can get a key from Tavily.

If you need to use a different API key per run, you can also provide them as command-line arguments:

--api-key your_api_key

Using the Integration

Using SmolagentsEnv for Training Data Generation

Generate SFT training data with the following command:

atropos-sft-gen output.jsonl --tokenizer NousResearch/DeepHermes-3-Llama-3-8B-Preview \
  --save-messages --env smolagents

For local testing using OpenAI API directly:

# Minimal test (processes just 2 examples)
python -m environments.smolagents_integration.smolagents_env process \
  --env.data_path_to_save_groups output/gaia/smolagents_output.jsonl \
  --env.total_steps 1 \
  --env.group_size 2 \
  --env.include_messages true \
  --env.max_concurrent_processes 8 \
  --env.use_chat_completion true \
  --openai.model_name "gpt-4o" \
  --openai.base_url "https://api.openai.com/v1" \
  --openai.api_key x
# Standard run (processes 10 groups of 2 examples each = 20 total examples)
python -m environments.smolagents_integration.smolagents_env process \
  --env.data_path_to_save_groups output/gaia/smolagents_output.jsonl \
  --env.total_steps 10 \
  --env.group_size 2 \
  --env.include_messages true \
  --env.max_concurrent_processes 8 \
  --env.use_chat_completion true \
  --openai.model_name "gpt-4o" \
  --openai.base_url "https://api.openai.com/v1" \
  --openai.api_key x

Note: The command syntax uses dots (.) to separate namespaces. Also, the OpenAI API key should be set in your environment variables as OPENAI_API_KEY or in a .env file in the project root.

If you want to use a local server instead of OpenAI:

python -m environments.smolagents_integration.smolagents_env process \
  --env.data_path_to_save_groups output/gaia/smolagents_output.jsonl \
  --env.total_steps 10 \
  --env.group_size 2 \
  --env.include_messages true \
  --env.max_concurrent_processes 8 \
  --env.use_chat_completion true \
  --openai.model_name "your-model-name" \
  --openai.base_url "http://localhost:8000/v1"

To serve the environment for a trainer:

python -m environments.smolagents_integration.smolagents_env serve \
  --env.rollout_server_url "http://localhost:8000" \
  --env.use_chat_completion true \
  --env.max_concurrent_processes 5 \
  --env.group_size 8 \
  --openai.model_name "your-model-name" \
  --openai.base_url "http://localhost:8000/v1"

How It Works

SmolagentsEnv

The SmolagentsEnv class provides a complete environment for generating high-quality agent trajectories:

  1. Loads tasks from the GAIA benchmark dataset
  2. Creates a process-safe model implementation for SmolaGents
  3. Initializes a CodeAgent with configurable tools
  4. Manages agent execution and trajectory collection in parallel processes
  5. Scores trajectories based on correctness, efficiency, and reasoning quality
  6. Integrates with Atropos SFT generation pipeline

Process-Based Isolation

The SmolaGents integration supports true parallel execution of agent processes using multiprocessing. This allows for significantly better performance when running multiple agents simultaneously.

How It Works

The process-based isolation implementation:

  1. Creates a server proxy mechanism to communicate with the Atropos server from child processes
  2. Spawns separate Python processes for each agent execution
  3. Manages inter-process communication through queues
  4. Collects and processes results from all agents

Configuration Options

The process-based isolation can be configured through the following options:

# Set the maximum number of concurrent processes (default: 8)
--env.max_concurrent_processes=8

# Set the timeout for agent processes in seconds (default: 240)
--env.process_timeout=240

GAIA Benchmark Dataset

The SmolaGents integration uses the GAIA benchmark dataset for generating high-quality agent trajectories. The dataset includes a variety of tasks with file attachments that test reasoning, problem-solving, and tool usage capabilities.

Downloading the Dataset

The integration includes a dedicated script for downloading and setting up the GAIA dataset:

# Download to the default location (data/gaia)
python -m environments.smolagents_integration.download_gaia

# Specify a custom output directory
python -m environments.smolagents_integration.download_gaia --output-dir /path/to/custom/dir

# Use the raw dataset version instead of the annotated version
python -m environments.smolagents_integration.download_gaia --use-raw

This script:

  1. Downloads the appropriate GAIA dataset from HuggingFace
  2. Creates a GAIA.py loader file that works with the HuggingFace datasets library
  3. Validates that all necessary files were downloaded correctly

To use the dataset, you need to have access to the GAIA repository on HuggingFace. The script will automatically handle authentication using your HuggingFace credentials.

Dataset Structure

The GAIA dataset is organized into:

  • test/ - Tasks for evaluation
  • validation/ - Additional tasks that can be used for development or validation

Each task includes:

  • A question prompt
  • A ground truth answer
  • Associated files (images, documents, spreadsheets, etc.)

Troubleshooting

  • Process-related errors: When using process-based isolation, ensure your code is serializable for multiprocessing. Also, check that proxy communication is working properly.
  • Message format errors: Check that message conversions between SmolaGents and Atropos formats are correct.
  • Missing GAIA data: Make sure you've downloaded the GAIA benchmark data correctly. If needed, run python -m environments.smolagents_integration.download_gaia. If you encounter permission errors, you might need to request access to the HuggingFace dataset.
  • Web tool errors: If Tavily tools aren't working, make sure you have set the TAVILY_API_KEY environment variable and have installed the tavily-python package.
  • Tool import errors: If you see errors about missing tool modules, ensure your working directory allows proper imports of the tools folder.
  • Permission errors with file tools: Ensure your process has the correct permissions to read/write files in the directories being accessed.
  • Memory issues: If you encounter memory usage problems, try lowering the max_concurrent_processes parameter.