7.6 KiB
Code Execution Environment
A comprehensive environment for training language models to solve coding problems through code generation and execution. This environment evaluates models on their ability to generate correct Python code that passes test cases using a Modal endpoint to validate LLM-generated code.
🎯 Overview
The Code Execution Environment evaluates models on:
- Generating correct Python code solutions for coding problems
- Passing test cases through actual code execution
- Handling both function-based and stdin/stdout problem types
- Supporting multiple datasets (RLVR, DeepMind Code Contests, LCB Test)
- Providing comprehensive evaluation metrics (pass@1, pass@k, difficulty breakdowns)
Key Philosophy: This environment scores based on code correctness (1.0 for passing all tests, -1.0 for failing). Models must generate syntactically valid, executable Python code that produces the correct outputs for given test cases.
✨ Key Features
🔧 Code Execution & Testing
- Real code execution via Modal endpoint (
lcb_modal_endpoint.py) - Support for function-based problems (LeetCode-style)
- Support for stdin/stdout problems (competitive programming style)
- Automatic test case validation
- Error handling (timeouts, runtime errors, wrong answers)
📊 Comprehensive Evaluation
- Pass@1 metric calculation using combinatorial estimation
- Pass@group_size evaluation
- Difficulty-level breakdowns (easy/medium/hard)
- Completion length tracking (correct vs incorrect solutions)
- Overlong ratio tracking (solutions exceeding token limits)
📚 Dataset Support
- RLVR_Coding_Problems: Training dataset for reinforcement learning
- DeepMind Code Contests: Alternative training dataset
- LCB Test: Evaluation benchmark from LiveCodeBench
- Automatic dataset format handling and conversion
⚙️ Safety & Reliability
- Sandboxed code execution via Modal
- Timeout protection
- Resource limits (memory, CPU)
- Reliability guards preventing destructive operations
- Segmentation fault handling
🚀 Quick Start
Basic Configuration
from atropos.environments.code_execution_server import CodingEnv, CodeConfig
# Configuration
config = CodeConfig(
dataset_name="normal", # or "deepmind"
temperature=1.0,
eval_temperature=0.6,
top_p=1.0,
eval_top_p=0.95,
start_idx=0,
max_eval_token_length=40960,
)
# Initialize environment
env = CodingEnv(config, server_configs)
Problem Types
Function-Based Problems
Problems where code defines a function that is called with test inputs:
# Problem specification includes function signature
def solve(nums: List[int]) -> int:
# Model generates function body
return max(nums)
# Tests call the function directly
tests = {
"fn_name": "solve",
"input": [[1, 2, 3], [5, 4, 3]],
"output": [3, 5]
}
Standard Input/Output Problems
Problems where code reads from stdin and writes to stdout:
# Problem: Read integers, output their sum
# Model generates:
a = int(input())
b = int(input())
print(a + b)
# Tests provide stdin inputs and expected stdout outputs
tests = {
"fn_name": "none",
"input": ["5\n3", "10\n20"],
"output": ["8", "30"]
}
📊 Evaluation Metrics
Pass@1 (Estimated)
Uses combinatorial estimation to calculate the probability of at least one correct solution in a single attempt:
pass@1 = mean(1 - C(n-c, 1) / C(n, 1)) across all problems
where n = group_size, c = number of correct solutions
Pass@group_size
Fraction of problems where at least one solution is correct:
pass@group_size = mean(num_correct > 0)
Difficulty Breakdowns
Separate metrics for easy, medium, and hard problems:
eval/easy_pass_1eval/medium_pass_1eval/hard_pass_1
Completion Analysis
eval/completion_length: Average completion lengtheval/correct_completion_length: Average length of correct solutionseval/incorrect_completion_length: Average length of incorrect solutionseval/overlong_ratio: Fraction of solutions exceeding token limits
Training Metrics
train_rewards/rewards: Average reward (correctness rate)train_rewards/pass@group_size: Training pass ratetrain_rewards/overlong_ratio: Training overlong ratiotrain/completion_lengths: Training completion statistics
🔍 Code Extraction & Scoring
Code Extraction
The environment extracts Python code from model responses using regex:
- Looks for code blocks in markdown format:
```python ... ``` - Takes the last code block if multiple are present
- Returns
Noneif no code block is found (scores -1.0)
Scoring Logic
if code is None:
score = -1.0 # No code extracted
elif all_tests_pass(code, test_cases):
score = 1.0 # All tests pass
else:
score = -1.0 # Tests fail or error occurs
Test Execution
Code is executed via Modal endpoint with:
- Timeout protection (default 15 seconds per test case)
- Memory limits (5GB)
- Reliability guards (disables file system, network, process operations)
- Error categorization (Wrong Answer, Time Limit Exceeded, Runtime Error)
🛠️ Advanced Features
Offline Filtering
Utility to identify problems where the model achieves perfect scores:
await env.offline_filter()
# Saves perfect problem indices to perfect_indices.txt
Blacklist System
Problems in perfect_indices.txt are automatically blacklisted during training to focus on harder problems.
Data Logging
Comprehensive logging to files:
- Short logs:
qwen_data_dump_{timestamp}.txt- Basic stats per problem - Long logs:
qwen_data_dump_long_{timestamp}.txt- Includes code, errors, full outputs - Separate directories:
train_logs/andeval_logs/
Example Log Entry
{
"cur_id": 42,
"num_correct": 5,
"total": 8,
"scores": [1.0, 1.0, 1.0, 1.0, 1.0, -1.0, -1.0, -1.0],
"lengths": [1200, 1180, 1220, 1190, 1210, 800, 750, 820],
"errors": [...],
"codes": [...],
"gen": "assistant message content"
}
📝 Response Format Requirements
Expected Format
Model responses should contain Python code in markdown code blocks:
Here's my solution:
```python
def solve(nums):
return max(nums)
```
🔐 Code Execution Endpoint (lcb_modal_endpoint.py)
The Modal endpoint handles safe execution of generated Python code with comprehensive sandboxing and test validation. Most of the code is adapted from the LiveCodeBench and rllm repositories.
Problem Type Support
- Call-Based: Function calls with JSON-serialized inputs/outputs (LeetCode-style)
- Standard I/O: stdin/stdout problems (competitive programming style)
The code execution endpoint (lcb_modal_endpoint.py) implements extensive safety measures:
Reliability Guards
- Disables file system operations (
os.remove,os.chdir, etc.) - Blocks process operations (
os.fork,subprocess.Popen) - Prevents network access
- Limits memory usage (5GB default)
- Sets recursion limit to prevent stack overflow
Error Handling
- Timeout exceptions (SIGALRM)
- Segmentation fault detection (faulthandler)
- Runtime error capture
- Wrong answer detection with detailed comparison
Execution Environment
- Isolated Modal containers
- Signal-based timeouts
- Resource limits (CPU, memory)
- Sandboxed imports (whitelisted standard library modules)
📄 License
This environment is part of the Atropos training framework. See the main repository for license information.