mirror of
https://github.com/NousResearch/atropos.git
synced 2026-04-19 12:57:58 +00:00
readme
This commit is contained in:
parent
2fd888cdb0
commit
3348e31a29
1 changed files with 240 additions and 0 deletions
240
environments/code_execution_server/README.md
Normal file
240
environments/code_execution_server/README.md
Normal file
|
|
@ -0,0 +1,240 @@
|
|||
# Code Execution Environment
|
||||
|
||||
A comprehensive environment for training language models to solve coding problems through **code generation and execution**. This environment evaluates models on their ability to generate correct Python code that passes test cases using a Modal endpoint to validate LLM-generated code.
|
||||
|
||||
## 🎯 Overview
|
||||
|
||||
The Code Execution Environment evaluates models on:
|
||||
- Generating correct Python code solutions for coding problems
|
||||
- Passing test cases through actual code execution
|
||||
- Handling both function-based and stdin/stdout problem types
|
||||
- Supporting multiple datasets (RLVR, DeepMind Code Contests, LCB Test)
|
||||
- Providing comprehensive evaluation metrics (pass@1, pass@k, difficulty breakdowns)
|
||||
|
||||
**Key Philosophy**: This environment scores based on **code correctness** (1.0 for passing all tests, -1.0 for failing). Models must generate syntactically valid, executable Python code that produces the correct outputs for given test cases.
|
||||
|
||||
## ✨ Key Features
|
||||
|
||||
### 🔧 **Code Execution & Testing**
|
||||
- Real code execution via Modal endpoint (`lcb_modal_endpoint.py`)
|
||||
- Support for function-based problems (LeetCode-style)
|
||||
- Support for stdin/stdout problems (competitive programming style)
|
||||
- Automatic test case validation
|
||||
- Error handling (timeouts, runtime errors, wrong answers)
|
||||
|
||||
### 📊 **Comprehensive Evaluation**
|
||||
- Pass@1 metric calculation using combinatorial estimation
|
||||
- Pass@group_size evaluation
|
||||
- Difficulty-level breakdowns (easy/medium/hard)
|
||||
- Completion length tracking (correct vs incorrect solutions)
|
||||
- Overlong ratio tracking (solutions exceeding token limits)
|
||||
|
||||
### 📚 **Dataset Support**
|
||||
- **RLVR_Coding_Problems**: Training dataset for reinforcement learning
|
||||
- **DeepMind Code Contests**: Alternative training dataset
|
||||
- **LCB Test**: Evaluation benchmark from LiveCodeBench
|
||||
- Automatic dataset format handling and conversion
|
||||
|
||||
### ⚙️ **Safety & Reliability**
|
||||
- Sandboxed code execution via Modal
|
||||
- Timeout protection
|
||||
- Resource limits (memory, CPU)
|
||||
- Reliability guards preventing destructive operations
|
||||
- Segmentation fault handling
|
||||
|
||||
## 🚀 Quick Start
|
||||
|
||||
### Basic Configuration
|
||||
|
||||
```python
|
||||
from atropos.environments.code_execution_server import CodingEnv, CodeConfig
|
||||
|
||||
# Configuration
|
||||
config = CodeConfig(
|
||||
dataset_name="normal", # or "deepmind"
|
||||
temperature=1.0,
|
||||
eval_temperature=0.6,
|
||||
top_p=1.0,
|
||||
eval_top_p=0.95,
|
||||
start_idx=0,
|
||||
max_eval_token_length=40960,
|
||||
)
|
||||
|
||||
# Initialize environment
|
||||
env = CodingEnv(config, server_configs)
|
||||
```
|
||||
|
||||
### Problem Types
|
||||
|
||||
#### **Function-Based Problems**
|
||||
Problems where code defines a function that is called with test inputs:
|
||||
```python
|
||||
# Problem specification includes function signature
|
||||
def solve(nums: List[int]) -> int:
|
||||
# Model generates function body
|
||||
return max(nums)
|
||||
|
||||
# Tests call the function directly
|
||||
tests = {
|
||||
"fn_name": "solve",
|
||||
"input": [[1, 2, 3], [5, 4, 3]],
|
||||
"output": [3, 5]
|
||||
}
|
||||
```
|
||||
|
||||
#### **Standard Input/Output Problems**
|
||||
Problems where code reads from stdin and writes to stdout:
|
||||
```python
|
||||
# Problem: Read integers, output their sum
|
||||
# Model generates:
|
||||
a = int(input())
|
||||
b = int(input())
|
||||
print(a + b)
|
||||
|
||||
# Tests provide stdin inputs and expected stdout outputs
|
||||
tests = {
|
||||
"fn_name": "none",
|
||||
"input": ["5\n3", "10\n20"],
|
||||
"output": ["8", "30"]
|
||||
}
|
||||
```
|
||||
|
||||
## 📊 Evaluation Metrics
|
||||
|
||||
### **Pass@1 (Estimated)**
|
||||
Uses combinatorial estimation to calculate the probability of at least one correct solution in a single attempt:
|
||||
```
|
||||
pass@1 = mean(1 - C(n-c, 1) / C(n, 1)) across all problems
|
||||
where n = group_size, c = number of correct solutions
|
||||
```
|
||||
|
||||
### **Pass@group_size**
|
||||
Fraction of problems where at least one solution is correct:
|
||||
```
|
||||
pass@group_size = mean(num_correct > 0)
|
||||
```
|
||||
|
||||
### **Difficulty Breakdowns**
|
||||
Separate metrics for easy, medium, and hard problems:
|
||||
- `eval/easy_pass_1`
|
||||
- `eval/medium_pass_1`
|
||||
- `eval/hard_pass_1`
|
||||
|
||||
### **Completion Analysis**
|
||||
- `eval/completion_length`: Average completion length
|
||||
- `eval/correct_completion_length`: Average length of correct solutions
|
||||
- `eval/incorrect_completion_length`: Average length of incorrect solutions
|
||||
- `eval/overlong_ratio`: Fraction of solutions exceeding token limits
|
||||
|
||||
### **Training Metrics**
|
||||
- `train_rewards/rewards`: Average reward (correctness rate)
|
||||
- `train_rewards/pass@group_size`: Training pass rate
|
||||
- `train_rewards/overlong_ratio`: Training overlong ratio
|
||||
- `train/completion_lengths`: Training completion statistics
|
||||
|
||||
|
||||
## 🔍 Code Extraction & Scoring
|
||||
|
||||
### **Code Extraction**
|
||||
The environment extracts Python code from model responses using regex:
|
||||
- Looks for code blocks in markdown format: ` ```python ... ``` `
|
||||
- Takes the last code block if multiple are present
|
||||
- Returns `None` if no code block is found (scores -1.0)
|
||||
|
||||
### **Scoring Logic**
|
||||
```python
|
||||
if code is None:
|
||||
score = -1.0 # No code extracted
|
||||
elif all_tests_pass(code, test_cases):
|
||||
score = 1.0 # All tests pass
|
||||
else:
|
||||
score = -1.0 # Tests fail or error occurs
|
||||
```
|
||||
|
||||
### **Test Execution**
|
||||
Code is executed via Modal endpoint with:
|
||||
- Timeout protection (default 15 seconds per test case)
|
||||
- Memory limits (5GB)
|
||||
- Reliability guards (disables file system, network, process operations)
|
||||
- Error categorization (Wrong Answer, Time Limit Exceeded, Runtime Error)
|
||||
|
||||
## 🛠️ Advanced Features
|
||||
|
||||
### **Offline Filtering**
|
||||
Utility to identify problems where the model achieves perfect scores:
|
||||
```python
|
||||
await env.offline_filter()
|
||||
# Saves perfect problem indices to perfect_indices.txt
|
||||
```
|
||||
|
||||
### **Blacklist System**
|
||||
Problems in `perfect_indices.txt` are automatically blacklisted during training to focus on harder problems.
|
||||
|
||||
### **Data Logging**
|
||||
Comprehensive logging to files:
|
||||
- **Short logs**: `qwen_data_dump_{timestamp}.txt` - Basic stats per problem
|
||||
- **Long logs**: `qwen_data_dump_long_{timestamp}.txt` - Includes code, errors, full outputs
|
||||
- Separate directories: `train_logs/` and `eval_logs/`
|
||||
|
||||
### **Example Log Entry**
|
||||
```json
|
||||
{
|
||||
"cur_id": 42,
|
||||
"num_correct": 5,
|
||||
"total": 8,
|
||||
"scores": [1.0, 1.0, 1.0, 1.0, 1.0, -1.0, -1.0, -1.0],
|
||||
"lengths": [1200, 1180, 1220, 1190, 1210, 800, 750, 820],
|
||||
"errors": [...],
|
||||
"codes": [...],
|
||||
"gen": "assistant message content"
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
## 📝 Response Format Requirements
|
||||
|
||||
### **Expected Format**
|
||||
Model responses should contain Python code in markdown code blocks:
|
||||
````
|
||||
Here's my solution:
|
||||
|
||||
```python
|
||||
def solve(nums):
|
||||
return max(nums)
|
||||
```
|
||||
````
|
||||
|
||||
## 🔐 Code Execution Endpoint (`lcb_modal_endpoint.py`)
|
||||
|
||||
The Modal endpoint handles safe execution of generated Python code with comprehensive sandboxing and test validation. Most of the code is adapted from the [LiveCodeBench](https://github.com/LiveCodeBench/LiveCodeBench) and [rllm](https://github.com/rllm-org/rllm) repositories.
|
||||
|
||||
#### **Problem Type Support**
|
||||
- **Call-Based**: Function calls with JSON-serialized inputs/outputs (LeetCode-style)
|
||||
- **Standard I/O**: stdin/stdout problems (competitive programming style)
|
||||
|
||||
The code execution endpoint (`lcb_modal_endpoint.py`) implements extensive safety measures:
|
||||
|
||||
### **Reliability Guards**
|
||||
- Disables file system operations (`os.remove`, `os.chdir`, etc.)
|
||||
- Blocks process operations (`os.fork`, `subprocess.Popen`)
|
||||
- Prevents network access
|
||||
- Limits memory usage (5GB default)
|
||||
- Sets recursion limit to prevent stack overflow
|
||||
|
||||
### **Error Handling**
|
||||
- Timeout exceptions (SIGALRM)
|
||||
- Segmentation fault detection (faulthandler)
|
||||
- Runtime error capture
|
||||
- Wrong answer detection with detailed comparison
|
||||
|
||||
### **Execution Environment**
|
||||
- Isolated Modal containers
|
||||
- Signal-based timeouts
|
||||
- Resource limits (CPU, memory)
|
||||
- Sandboxed imports (whitelisted standard library modules)
|
||||
|
||||
|
||||
## 📄 License
|
||||
|
||||
This environment is part of the Atropos training framework. See the main repository for license information.
|
||||
|
||||
Loading…
Add table
Add a link
Reference in a new issue