mirror of https://github.com/NousResearch/atropos.git synced 2026-04-19 12:57:58 +00:00

Allan Niemerg 013090579d fix imports and style issues

2025-05-27 11:00:35 -05:00

8.2 KiB

Raw Blame History

Atropos-SmolaGents Integration

This integration enables the use of SmolaGents' agent capabilities with Atropos' server-based LLM architecture for high-quality training data generation.

Overview

The integration consists of:

SmolagentsEnv: A full-fledged Atropos environment for generating high-quality agent trajectories
Process-based execution: Robust parallel execution of agents in isolated processes
Tools: File manipulation and web searching tools for the agent to use
Scoring system: Automatic evaluation of agent responses based on correctness and efficiency

Files

smolagents_env.py: The complete Atropos environment for generating training data.
agent_process_runner.py: Module for running agents in separate processes.
server_proxy.py: Proxy mechanism for communication between processes and Atropos server.
smolagents_model.py: Process-safe Atropos server model implementation for SmolaGents.
download_gaia.py: Script to download and set up the GAIA benchmark dataset.

Tools:

tools/file_tools.py: Tools for reading, writing, and appending to files
tools/tavily_tools.py: Web search and page extraction tools powered by Tavily

Installation

First, make sure you have Atropos installed.
Install SmolaGents:
```
pip install smolagents
```

Install the GAIA benchmark dependencies:

pip install datasets pandas huggingface_hub

For web tools, install Tavily:
```
pip install tavily-python
```

Download the GAIA benchmark dataset:

python -m environments.smolagents_integration.download_gaia

Environment Variables

The integration uses the following environment variables:

OPENAI_API_KEY: Required for OpenAI API access when using LiteLLM model (test mode) or when using OpenAI models with Atropos.
TAVILY_API_KEY: Required for web search and page extraction tools. You can get a key from Tavily.

If you need to use a different API key per run, you can also provide them as command-line arguments:

--api-key your_api_key

Using the Integration

Using SmolagentsEnv for Training Data Generation

Generate SFT training data with the following command:

atropos-sft-gen output.jsonl --tokenizer NousResearch/DeepHermes-3-Llama-3-8B-Preview \
  --save-messages --env smolagents

For local testing using OpenAI API directly:

# Minimal test (processes just 2 examples)
python -m environments.smolagents_integration.smolagents_env process \
  --env.data_path_to_save_groups output/gaia/smolagents_output.jsonl \
  --env.total_steps 1 \
  --env.group_size 2 \
  --env.include_messages true \
  --env.max_concurrent_processes 8 \
  --env.use_chat_completion true \
  --openai.model_name "gpt-4o" \
  --openai.base_url "https://api.openai.com/v1" \
  --openai.api_key x

# Standard run (processes 10 groups of 2 examples each = 20 total examples)
python -m environments.smolagents_integration.smolagents_env process \
  --env.data_path_to_save_groups output/gaia/smolagents_output.jsonl \
  --env.total_steps 10 \
  --env.group_size 2 \
  --env.include_messages true \
  --env.max_concurrent_processes 8 \
  --env.use_chat_completion true \
  --openai.model_name "gpt-4o" \
  --openai.base_url "https://api.openai.com/v1" \
  --openai.api_key x

Note: The command syntax uses dots (.) to separate namespaces. Also, the OpenAI API key should be set in your environment variables as OPENAI_API_KEY or in a .env file in the project root.

If you want to use a local server instead of OpenAI:

python -m environments.smolagents_integration.smolagents_env process \
  --env.data_path_to_save_groups output/gaia/smolagents_output.jsonl \
  --env.total_steps 10 \
  --env.group_size 2 \
  --env.include_messages true \
  --env.max_concurrent_processes 8 \
  --env.use_chat_completion true \
  --openai.model_name "your-model-name" \
  --openai.base_url "http://localhost:8000/v1"

To serve the environment for a trainer:

python -m environments.smolagents_integration.smolagents_env serve \
  --env.rollout_server_url "http://localhost:8000" \
  --env.use_chat_completion true \
  --env.max_concurrent_processes 5 \
  --env.group_size 8 \
  --openai.model_name "your-model-name" \
  --openai.base_url "http://localhost:8000/v1"

How It Works

SmolagentsEnv

The SmolagentsEnv class provides a complete environment for generating high-quality agent trajectories:

Loads tasks from the GAIA benchmark dataset
Creates a process-safe model implementation for SmolaGents
Initializes a CodeAgent with configurable tools
Manages agent execution and trajectory collection in parallel processes
Scores trajectories based on correctness, efficiency, and reasoning quality
Integrates with Atropos SFT generation pipeline

Process-Based Isolation

The SmolaGents integration supports true parallel execution of agent processes using multiprocessing. This allows for significantly better performance when running multiple agents simultaneously.

How It Works

The process-based isolation implementation:

Creates a server proxy mechanism to communicate with the Atropos server from child processes
Spawns separate Python processes for each agent execution
Manages inter-process communication through queues
Collects and processes results from all agents

Configuration Options

The process-based isolation can be configured through the following options:

# Set the maximum number of concurrent processes (default: 8)
--env.max_concurrent_processes=8

# Set the timeout for agent processes in seconds (default: 240)
--env.process_timeout=240

GAIA Benchmark Dataset

The SmolaGents integration uses the GAIA benchmark dataset for generating high-quality agent trajectories. The dataset includes a variety of tasks with file attachments that test reasoning, problem-solving, and tool usage capabilities.

Downloading the Dataset

The integration includes a dedicated script for downloading and setting up the GAIA dataset:

# Download to the default location (data/gaia)
python -m environments.smolagents_integration.download_gaia

# Specify a custom output directory
python -m environments.smolagents_integration.download_gaia --output-dir /path/to/custom/dir

# Use the raw dataset version instead of the annotated version
python -m environments.smolagents_integration.download_gaia --use-raw

This script:

Downloads the appropriate GAIA dataset from HuggingFace
Creates a GAIA.py loader file that works with the HuggingFace datasets library
Validates that all necessary files were downloaded correctly

To use the dataset, you need to have access to the GAIA repository on HuggingFace. The script will automatically handle authentication using your HuggingFace credentials.

Dataset Structure

The GAIA dataset is organized into:

test/ - Tasks for evaluation
validation/ - Additional tasks that can be used for development or validation

Each task includes:

A question prompt
A ground truth answer
Associated files (images, documents, spreadsheets, etc.)

Troubleshooting

Process-related errors: When using process-based isolation, ensure your code is serializable for multiprocessing. Also, check that proxy communication is working properly.
Message format errors: Check that message conversions between SmolaGents and Atropos formats are correct.
Missing GAIA data: Make sure you've downloaded the GAIA benchmark data correctly. If needed, run python -m environments.smolagents_integration.download_gaia. If you encounter permission errors, you might need to request access to the HuggingFace dataset.
Web tool errors: If Tavily tools aren't working, make sure you have set the TAVILY_API_KEY environment variable and have installed the tavily-python package.
Tool import errors: If you see errors about missing tool modules, ensure your working directory allows proper imports of the tools folder.
Permission errors with file tools: Ensure your process has the correct permissions to read/write files in the directories being accessed.
Memory issues: If you encounter memory usage problems, try lowering the max_concurrent_processes parameter.

8.2 KiB Raw Blame History

Atropos-SmolaGents Integration

Overview

Files

Installation

Environment Variables

Using the Integration

Using SmolagentsEnv for Training Data Generation

How It Works

SmolagentsEnv

Process-Based Isolation

How It Works

Configuration Options

GAIA Benchmark Dataset

Downloading the Dataset

Dataset Structure

Troubleshooting

8.2 KiB

Raw Blame History