This commit is contained in:
JoeLi12345 2025-12-24 21:22:07 +00:00
parent 2fd888cdb0
commit 3348e31a29

View file

@ -0,0 +1,240 @@
# Code Execution Environment
A comprehensive environment for training language models to solve coding problems through **code generation and execution**. This environment evaluates models on their ability to generate correct Python code that passes test cases using a Modal endpoint to validate LLM-generated code.
## 🎯 Overview
The Code Execution Environment evaluates models on:
- Generating correct Python code solutions for coding problems
- Passing test cases through actual code execution
- Handling both function-based and stdin/stdout problem types
- Supporting multiple datasets (RLVR, DeepMind Code Contests, LCB Test)
- Providing comprehensive evaluation metrics (pass@1, pass@k, difficulty breakdowns)
**Key Philosophy**: This environment scores based on **code correctness** (1.0 for passing all tests, -1.0 for failing). Models must generate syntactically valid, executable Python code that produces the correct outputs for given test cases.
## ✨ Key Features
### 🔧 **Code Execution & Testing**
- Real code execution via Modal endpoint (`lcb_modal_endpoint.py`)
- Support for function-based problems (LeetCode-style)
- Support for stdin/stdout problems (competitive programming style)
- Automatic test case validation
- Error handling (timeouts, runtime errors, wrong answers)
### 📊 **Comprehensive Evaluation**
- Pass@1 metric calculation using combinatorial estimation
- Pass@group_size evaluation
- Difficulty-level breakdowns (easy/medium/hard)
- Completion length tracking (correct vs incorrect solutions)
- Overlong ratio tracking (solutions exceeding token limits)
### 📚 **Dataset Support**
- **RLVR_Coding_Problems**: Training dataset for reinforcement learning
- **DeepMind Code Contests**: Alternative training dataset
- **LCB Test**: Evaluation benchmark from LiveCodeBench
- Automatic dataset format handling and conversion
### ⚙️ **Safety & Reliability**
- Sandboxed code execution via Modal
- Timeout protection
- Resource limits (memory, CPU)
- Reliability guards preventing destructive operations
- Segmentation fault handling
## 🚀 Quick Start
### Basic Configuration
```python
from atropos.environments.code_execution_server import CodingEnv, CodeConfig
# Configuration
config = CodeConfig(
dataset_name="normal", # or "deepmind"
temperature=1.0,
eval_temperature=0.6,
top_p=1.0,
eval_top_p=0.95,
start_idx=0,
max_eval_token_length=40960,
)
# Initialize environment
env = CodingEnv(config, server_configs)
```
### Problem Types
#### **Function-Based Problems**
Problems where code defines a function that is called with test inputs:
```python
# Problem specification includes function signature
def solve(nums: List[int]) -> int:
# Model generates function body
return max(nums)
# Tests call the function directly
tests = {
"fn_name": "solve",
"input": [[1, 2, 3], [5, 4, 3]],
"output": [3, 5]
}
```
#### **Standard Input/Output Problems**
Problems where code reads from stdin and writes to stdout:
```python
# Problem: Read integers, output their sum
# Model generates:
a = int(input())
b = int(input())
print(a + b)
# Tests provide stdin inputs and expected stdout outputs
tests = {
"fn_name": "none",
"input": ["5\n3", "10\n20"],
"output": ["8", "30"]
}
```
## 📊 Evaluation Metrics
### **Pass@1 (Estimated)**
Uses combinatorial estimation to calculate the probability of at least one correct solution in a single attempt:
```
pass@1 = mean(1 - C(n-c, 1) / C(n, 1)) across all problems
where n = group_size, c = number of correct solutions
```
### **Pass@group_size**
Fraction of problems where at least one solution is correct:
```
pass@group_size = mean(num_correct > 0)
```
### **Difficulty Breakdowns**
Separate metrics for easy, medium, and hard problems:
- `eval/easy_pass_1`
- `eval/medium_pass_1`
- `eval/hard_pass_1`
### **Completion Analysis**
- `eval/completion_length`: Average completion length
- `eval/correct_completion_length`: Average length of correct solutions
- `eval/incorrect_completion_length`: Average length of incorrect solutions
- `eval/overlong_ratio`: Fraction of solutions exceeding token limits
### **Training Metrics**
- `train_rewards/rewards`: Average reward (correctness rate)
- `train_rewards/pass@group_size`: Training pass rate
- `train_rewards/overlong_ratio`: Training overlong ratio
- `train/completion_lengths`: Training completion statistics
## 🔍 Code Extraction & Scoring
### **Code Extraction**
The environment extracts Python code from model responses using regex:
- Looks for code blocks in markdown format: ` ```python ... ``` `
- Takes the last code block if multiple are present
- Returns `None` if no code block is found (scores -1.0)
### **Scoring Logic**
```python
if code is None:
score = -1.0 # No code extracted
elif all_tests_pass(code, test_cases):
score = 1.0 # All tests pass
else:
score = -1.0 # Tests fail or error occurs
```
### **Test Execution**
Code is executed via Modal endpoint with:
- Timeout protection (default 15 seconds per test case)
- Memory limits (5GB)
- Reliability guards (disables file system, network, process operations)
- Error categorization (Wrong Answer, Time Limit Exceeded, Runtime Error)
## 🛠️ Advanced Features
### **Offline Filtering**
Utility to identify problems where the model achieves perfect scores:
```python
await env.offline_filter()
# Saves perfect problem indices to perfect_indices.txt
```
### **Blacklist System**
Problems in `perfect_indices.txt` are automatically blacklisted during training to focus on harder problems.
### **Data Logging**
Comprehensive logging to files:
- **Short logs**: `qwen_data_dump_{timestamp}.txt` - Basic stats per problem
- **Long logs**: `qwen_data_dump_long_{timestamp}.txt` - Includes code, errors, full outputs
- Separate directories: `train_logs/` and `eval_logs/`
### **Example Log Entry**
```json
{
"cur_id": 42,
"num_correct": 5,
"total": 8,
"scores": [1.0, 1.0, 1.0, 1.0, 1.0, -1.0, -1.0, -1.0],
"lengths": [1200, 1180, 1220, 1190, 1210, 800, 750, 820],
"errors": [...],
"codes": [...],
"gen": "assistant message content"
}
```
## 📝 Response Format Requirements
### **Expected Format**
Model responses should contain Python code in markdown code blocks:
````
Here's my solution:
```python
def solve(nums):
return max(nums)
```
````
## 🔐 Code Execution Endpoint (`lcb_modal_endpoint.py`)
The Modal endpoint handles safe execution of generated Python code with comprehensive sandboxing and test validation. Most of the code is adapted from the [LiveCodeBench](https://github.com/LiveCodeBench/LiveCodeBench) and [rllm](https://github.com/rllm-org/rllm) repositories.
#### **Problem Type Support**
- **Call-Based**: Function calls with JSON-serialized inputs/outputs (LeetCode-style)
- **Standard I/O**: stdin/stdout problems (competitive programming style)
The code execution endpoint (`lcb_modal_endpoint.py`) implements extensive safety measures:
### **Reliability Guards**
- Disables file system operations (`os.remove`, `os.chdir`, etc.)
- Blocks process operations (`os.fork`, `subprocess.Popen`)
- Prevents network access
- Limits memory usage (5GB default)
- Sets recursion limit to prevent stack overflow
### **Error Handling**
- Timeout exceptions (SIGALRM)
- Segmentation fault detection (faulthandler)
- Runtime error capture
- Wrong answer detection with detailed comparison
### **Execution Environment**
- Isolated Modal containers
- Signal-based timeouts
- Resource limits (CPU, memory)
- Sandboxed imports (whitelisted standard library modules)
## 📄 License
This environment is part of the Atropos training framework. See the main repository for license information.