mirror of
https://github.com/NousResearch/atropos.git
synced 2026-04-19 12:57:58 +00:00
240 lines
7.6 KiB
Markdown
240 lines
7.6 KiB
Markdown
# Code Execution Environment
|
|
|
|
A comprehensive environment for training language models to solve coding problems through **code generation and execution**. This environment evaluates models on their ability to generate correct Python code that passes test cases using a Modal endpoint to validate LLM-generated code.
|
|
|
|
## 🎯 Overview
|
|
|
|
The Code Execution Environment evaluates models on:
|
|
- Generating correct Python code solutions for coding problems
|
|
- Passing test cases through actual code execution
|
|
- Handling both function-based and stdin/stdout problem types
|
|
- Supporting multiple datasets (RLVR, DeepMind Code Contests, LCB Test)
|
|
- Providing comprehensive evaluation metrics (pass@1, pass@k, difficulty breakdowns)
|
|
|
|
**Key Philosophy**: This environment scores based on **code correctness** (1.0 for passing all tests, -1.0 for failing). Models must generate syntactically valid, executable Python code that produces the correct outputs for given test cases.
|
|
|
|
## ✨ Key Features
|
|
|
|
### 🔧 **Code Execution & Testing**
|
|
- Real code execution via Modal endpoint (`lcb_modal_endpoint.py`)
|
|
- Support for function-based problems (LeetCode-style)
|
|
- Support for stdin/stdout problems (competitive programming style)
|
|
- Automatic test case validation
|
|
- Error handling (timeouts, runtime errors, wrong answers)
|
|
|
|
### 📊 **Comprehensive Evaluation**
|
|
- Pass@1 metric calculation using combinatorial estimation
|
|
- Pass@group_size evaluation
|
|
- Difficulty-level breakdowns (easy/medium/hard)
|
|
- Completion length tracking (correct vs incorrect solutions)
|
|
- Overlong ratio tracking (solutions exceeding token limits)
|
|
|
|
### 📚 **Dataset Support**
|
|
- **RLVR_Coding_Problems**: Training dataset for reinforcement learning
|
|
- **DeepMind Code Contests**: Alternative training dataset
|
|
- **LCB Test**: Evaluation benchmark from LiveCodeBench
|
|
- Automatic dataset format handling and conversion
|
|
|
|
### ⚙️ **Safety & Reliability**
|
|
- Sandboxed code execution via Modal
|
|
- Timeout protection
|
|
- Resource limits (memory, CPU)
|
|
- Reliability guards preventing destructive operations
|
|
- Segmentation fault handling
|
|
|
|
## 🚀 Quick Start
|
|
|
|
### Basic Configuration
|
|
|
|
```python
|
|
from atropos.environments.code_execution_server import CodingEnv, CodeConfig
|
|
|
|
# Configuration
|
|
config = CodeConfig(
|
|
dataset_name="normal", # or "deepmind"
|
|
temperature=1.0,
|
|
eval_temperature=0.6,
|
|
top_p=1.0,
|
|
eval_top_p=0.95,
|
|
start_idx=0,
|
|
max_eval_token_length=40960,
|
|
)
|
|
|
|
# Initialize environment
|
|
env = CodingEnv(config, server_configs)
|
|
```
|
|
|
|
### Problem Types
|
|
|
|
#### **Function-Based Problems**
|
|
Problems where code defines a function that is called with test inputs:
|
|
```python
|
|
# Problem specification includes function signature
|
|
def solve(nums: List[int]) -> int:
|
|
# Model generates function body
|
|
return max(nums)
|
|
|
|
# Tests call the function directly
|
|
tests = {
|
|
"fn_name": "solve",
|
|
"input": [[1, 2, 3], [5, 4, 3]],
|
|
"output": [3, 5]
|
|
}
|
|
```
|
|
|
|
#### **Standard Input/Output Problems**
|
|
Problems where code reads from stdin and writes to stdout:
|
|
```python
|
|
# Problem: Read integers, output their sum
|
|
# Model generates:
|
|
a = int(input())
|
|
b = int(input())
|
|
print(a + b)
|
|
|
|
# Tests provide stdin inputs and expected stdout outputs
|
|
tests = {
|
|
"fn_name": "none",
|
|
"input": ["5\n3", "10\n20"],
|
|
"output": ["8", "30"]
|
|
}
|
|
```
|
|
|
|
## 📊 Evaluation Metrics
|
|
|
|
### **Pass@1 (Estimated)**
|
|
Uses combinatorial estimation to calculate the probability of at least one correct solution in a single attempt:
|
|
```
|
|
pass@1 = mean(1 - C(n-c, 1) / C(n, 1)) across all problems
|
|
where n = group_size, c = number of correct solutions
|
|
```
|
|
|
|
### **Pass@group_size**
|
|
Fraction of problems where at least one solution is correct:
|
|
```
|
|
pass@group_size = mean(num_correct > 0)
|
|
```
|
|
|
|
### **Difficulty Breakdowns**
|
|
Separate metrics for easy, medium, and hard problems:
|
|
- `eval/easy_pass_1`
|
|
- `eval/medium_pass_1`
|
|
- `eval/hard_pass_1`
|
|
|
|
### **Completion Analysis**
|
|
- `eval/completion_length`: Average completion length
|
|
- `eval/correct_completion_length`: Average length of correct solutions
|
|
- `eval/incorrect_completion_length`: Average length of incorrect solutions
|
|
- `eval/overlong_ratio`: Fraction of solutions exceeding token limits
|
|
|
|
### **Training Metrics**
|
|
- `train_rewards/rewards`: Average reward (correctness rate)
|
|
- `train_rewards/pass@group_size`: Training pass rate
|
|
- `train_rewards/overlong_ratio`: Training overlong ratio
|
|
- `train/completion_lengths`: Training completion statistics
|
|
|
|
|
|
## 🔍 Code Extraction & Scoring
|
|
|
|
### **Code Extraction**
|
|
The environment extracts Python code from model responses using regex:
|
|
- Looks for code blocks in markdown format: ` ```python ... ``` `
|
|
- Takes the last code block if multiple are present
|
|
- Returns `None` if no code block is found (scores -1.0)
|
|
|
|
### **Scoring Logic**
|
|
```python
|
|
if code is None:
|
|
score = -1.0 # No code extracted
|
|
elif all_tests_pass(code, test_cases):
|
|
score = 1.0 # All tests pass
|
|
else:
|
|
score = -1.0 # Tests fail or error occurs
|
|
```
|
|
|
|
### **Test Execution**
|
|
Code is executed via Modal endpoint with:
|
|
- Timeout protection (default 15 seconds per test case)
|
|
- Memory limits (5GB)
|
|
- Reliability guards (disables file system, network, process operations)
|
|
- Error categorization (Wrong Answer, Time Limit Exceeded, Runtime Error)
|
|
|
|
## 🛠️ Advanced Features
|
|
|
|
### **Offline Filtering**
|
|
Utility to identify problems where the model achieves perfect scores:
|
|
```python
|
|
await env.offline_filter()
|
|
# Saves perfect problem indices to perfect_indices.txt
|
|
```
|
|
|
|
### **Blacklist System**
|
|
Problems in `perfect_indices.txt` are automatically blacklisted during training to focus on harder problems.
|
|
|
|
### **Data Logging**
|
|
Comprehensive logging to files:
|
|
- **Short logs**: `qwen_data_dump_{timestamp}.txt` - Basic stats per problem
|
|
- **Long logs**: `qwen_data_dump_long_{timestamp}.txt` - Includes code, errors, full outputs
|
|
- Separate directories: `train_logs/` and `eval_logs/`
|
|
|
|
### **Example Log Entry**
|
|
```json
|
|
{
|
|
"cur_id": 42,
|
|
"num_correct": 5,
|
|
"total": 8,
|
|
"scores": [1.0, 1.0, 1.0, 1.0, 1.0, -1.0, -1.0, -1.0],
|
|
"lengths": [1200, 1180, 1220, 1190, 1210, 800, 750, 820],
|
|
"errors": [...],
|
|
"codes": [...],
|
|
"gen": "assistant message content"
|
|
}
|
|
```
|
|
|
|
|
|
## 📝 Response Format Requirements
|
|
|
|
### **Expected Format**
|
|
Model responses should contain Python code in markdown code blocks:
|
|
````
|
|
Here's my solution:
|
|
|
|
```python
|
|
def solve(nums):
|
|
return max(nums)
|
|
```
|
|
````
|
|
|
|
## 🔐 Code Execution Endpoint (`lcb_modal_endpoint.py`)
|
|
|
|
The Modal endpoint handles safe execution of generated Python code with comprehensive sandboxing and test validation. Most of the code is adapted from the [LiveCodeBench](https://github.com/LiveCodeBench/LiveCodeBench) and [rllm](https://github.com/rllm-org/rllm) repositories.
|
|
|
|
#### **Problem Type Support**
|
|
- **Call-Based**: Function calls with JSON-serialized inputs/outputs (LeetCode-style)
|
|
- **Standard I/O**: stdin/stdout problems (competitive programming style)
|
|
|
|
The code execution endpoint (`lcb_modal_endpoint.py`) implements extensive safety measures:
|
|
|
|
### **Reliability Guards**
|
|
- Disables file system operations (`os.remove`, `os.chdir`, etc.)
|
|
- Blocks process operations (`os.fork`, `subprocess.Popen`)
|
|
- Prevents network access
|
|
- Limits memory usage (5GB default)
|
|
- Sets recursion limit to prevent stack overflow
|
|
|
|
### **Error Handling**
|
|
- Timeout exceptions (SIGALRM)
|
|
- Segmentation fault detection (faulthandler)
|
|
- Runtime error capture
|
|
- Wrong answer detection with detailed comparison
|
|
|
|
### **Execution Environment**
|
|
- Isolated Modal containers
|
|
- Signal-based timeouts
|
|
- Resource limits (CPU, memory)
|
|
- Sandboxed imports (whitelisted standard library modules)
|
|
|
|
|
|
## 📄 License
|
|
|
|
This environment is part of the Atropos training framework. See the main repository for license information.
|
|
|