mirror of https://github.com/NousResearch/atropos.git synced 2026-04-19 12:57:58 +00:00

pre-commit-ci[bot] d932d9c03b [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

2025-12-24 21:27:53 +00:00

7.6 KiB

Raw Blame History

Code Execution Environment

A comprehensive environment for training language models to solve coding problems through code generation and execution. This environment evaluates models on their ability to generate correct Python code that passes test cases using a Modal endpoint to validate LLM-generated code.

🎯 Overview

The Code Execution Environment evaluates models on:

Generating correct Python code solutions for coding problems
Passing test cases through actual code execution
Handling both function-based and stdin/stdout problem types
Supporting multiple datasets (RLVR, DeepMind Code Contests, LCB Test)
Providing comprehensive evaluation metrics (pass@1, pass@k, difficulty breakdowns)

Key Philosophy: This environment scores based on code correctness (1.0 for passing all tests, -1.0 for failing). Models must generate syntactically valid, executable Python code that produces the correct outputs for given test cases.

✨ Key Features

🔧 Code Execution & Testing

Real code execution via Modal endpoint (lcb_modal_endpoint.py)
Support for function-based problems (LeetCode-style)
Support for stdin/stdout problems (competitive programming style)
Automatic test case validation
Error handling (timeouts, runtime errors, wrong answers)

📊 Comprehensive Evaluation

Pass@1 metric calculation using combinatorial estimation
Pass@group_size evaluation
Difficulty-level breakdowns (easy/medium/hard)
Completion length tracking (correct vs incorrect solutions)
Overlong ratio tracking (solutions exceeding token limits)

📚 Dataset Support

RLVR_Coding_Problems: Training dataset for reinforcement learning
DeepMind Code Contests: Alternative training dataset
LCB Test: Evaluation benchmark from LiveCodeBench
Automatic dataset format handling and conversion

⚙️ Safety & Reliability

Sandboxed code execution via Modal
Timeout protection
Resource limits (memory, CPU)
Reliability guards preventing destructive operations
Segmentation fault handling

🚀 Quick Start

Basic Configuration

from atropos.environments.code_execution_server import CodingEnv, CodeConfig

# Configuration
config = CodeConfig(
    dataset_name="normal",  # or "deepmind"
    temperature=1.0,
    eval_temperature=0.6,
    top_p=1.0,
    eval_top_p=0.95,
    start_idx=0,
    max_eval_token_length=40960,
)

# Initialize environment
env = CodingEnv(config, server_configs)

Problem Types

Function-Based Problems

Problems where code defines a function that is called with test inputs:

# Problem specification includes function signature
def solve(nums: List[int]) -> int:
    # Model generates function body
    return max(nums)

# Tests call the function directly
tests = {
    "fn_name": "solve",
    "input": [[1, 2, 3], [5, 4, 3]],
    "output": [3, 5]
}

Standard Input/Output Problems

Problems where code reads from stdin and writes to stdout:

# Problem: Read integers, output their sum
# Model generates:
a = int(input())
b = int(input())
print(a + b)

# Tests provide stdin inputs and expected stdout outputs
tests = {
    "fn_name": "none",
    "input": ["5\n3", "10\n20"],
    "output": ["8", "30"]
}

📊 Evaluation Metrics

Pass@1 (Estimated)

Uses combinatorial estimation to calculate the probability of at least one correct solution in a single attempt:

pass@1 = mean(1 - C(n-c, 1) / C(n, 1)) across all problems
where n = group_size, c = number of correct solutions

Pass@group_size

Fraction of problems where at least one solution is correct:

pass@group_size = mean(num_correct > 0)

Difficulty Breakdowns

Separate metrics for easy, medium, and hard problems:

eval/easy_pass_1
eval/medium_pass_1
eval/hard_pass_1

Completion Analysis

eval/completion_length: Average completion length
eval/correct_completion_length: Average length of correct solutions
eval/incorrect_completion_length: Average length of incorrect solutions
eval/overlong_ratio: Fraction of solutions exceeding token limits

Training Metrics

train_rewards/rewards: Average reward (correctness rate)
train_rewards/pass@group_size: Training pass rate
train_rewards/overlong_ratio: Training overlong ratio
train/completion_lengths: Training completion statistics

🔍 Code Extraction & Scoring

Code Extraction

The environment extracts Python code from model responses using regex:

Looks for code blocks in markdown format: ```python ... ```
Takes the last code block if multiple are present
Returns None if no code block is found (scores -1.0)

Scoring Logic

if code is None:
    score = -1.0  # No code extracted
elif all_tests_pass(code, test_cases):
    score = 1.0   # All tests pass
else:
    score = -1.0  # Tests fail or error occurs

Test Execution

Code is executed via Modal endpoint with:

Timeout protection (default 15 seconds per test case)
Memory limits (5GB)
Reliability guards (disables file system, network, process operations)
Error categorization (Wrong Answer, Time Limit Exceeded, Runtime Error)

🛠️ Advanced Features

Offline Filtering

Utility to identify problems where the model achieves perfect scores:

await env.offline_filter()
# Saves perfect problem indices to perfect_indices.txt

Blacklist System

Problems in perfect_indices.txt are automatically blacklisted during training to focus on harder problems.

Data Logging

Comprehensive logging to files:

Short logs: qwen_data_dump_{timestamp}.txt - Basic stats per problem
Long logs: qwen_data_dump_long_{timestamp}.txt - Includes code, errors, full outputs
Separate directories: train_logs/ and eval_logs/

Example Log Entry

{
    "cur_id": 42,
    "num_correct": 5,
    "total": 8,
    "scores": [1.0, 1.0, 1.0, 1.0, 1.0, -1.0, -1.0, -1.0],
    "lengths": [1200, 1180, 1220, 1190, 1210, 800, 750, 820],
    "errors": [...],
    "codes": [...],
    "gen": "assistant message content"
}

📝 Response Format Requirements

Expected Format

Model responses should contain Python code in markdown code blocks:

Here's my solution:

```python
def solve(nums):
    return max(nums)
```

🔐 Code Execution Endpoint (`lcb_modal_endpoint.py`)

The Modal endpoint handles safe execution of generated Python code with comprehensive sandboxing and test validation. Most of the code is adapted from the LiveCodeBench and rllm repositories.

Problem Type Support

Call-Based: Function calls with JSON-serialized inputs/outputs (LeetCode-style)
Standard I/O: stdin/stdout problems (competitive programming style)

The code execution endpoint (lcb_modal_endpoint.py) implements extensive safety measures:

Reliability Guards

Disables file system operations (os.remove, os.chdir, etc.)
Blocks process operations (os.fork, subprocess.Popen)
Prevents network access
Limits memory usage (5GB default)
Sets recursion limit to prevent stack overflow

Error Handling

Timeout exceptions (SIGALRM)
Segmentation fault detection (faulthandler)
Runtime error capture
Wrong answer detection with detailed comparison

Execution Environment

Isolated Modal containers
Signal-based timeouts
Resource limits (CPU, memory)
Sandboxed imports (whitelisted standard library modules)

📄 License

This environment is part of the Atropos training framework. See the main repository for license information.

7.6 KiB Raw Blame History