readme

2026-04-19 12:57:58 +00:00 · 2025-12-24 21:22:07 +00:00 · 2025-12-24 21:22:07 +00:00 · 3348e31a29
commit 3348e31a29
parent 2fd888cdb0
1 changed files with 240 additions and 0 deletions
--- a/environments/code_execution_server/README.md
+++ b/environments/code_execution_server/README.md
@ -0,0 +1,240 @@
+# Code Execution Environment
+
+A comprehensive environment for training language models to solve coding problems through **code generation and execution**. This environment evaluates models on their ability to generate correct Python code that passes test cases using a Modal endpoint to validate LLM-generated code.
+
+## 🎯 Overview
+
+The Code Execution Environment evaluates models on:
+- Generating correct Python code solutions for coding problems
+- Passing test cases through actual code execution
+- Handling both function-based and stdin/stdout problem types
+- Supporting multiple datasets (RLVR, DeepMind Code Contests, LCB Test)
+- Providing comprehensive evaluation metrics (pass@1, pass@k, difficulty breakdowns)
+
+**Key Philosophy**: This environment scores based on **code correctness** (1.0 for passing all tests, -1.0 for failing). Models must generate syntactically valid, executable Python code that produces the correct outputs for given test cases.
+
+## ✨ Key Features
+
+### 🔧 **Code Execution & Testing**
+- Real code execution via Modal endpoint (`lcb_modal_endpoint.py`)
+- Support for function-based problems (LeetCode-style)
+- Support for stdin/stdout problems (competitive programming style)
+- Automatic test case validation
+- Error handling (timeouts, runtime errors, wrong answers)
+
+### 📊 **Comprehensive Evaluation**
+- Pass@1 metric calculation using combinatorial estimation
+- Pass@group_size evaluation
+- Difficulty-level breakdowns (easy/medium/hard)
+- Completion length tracking (correct vs incorrect solutions)
+- Overlong ratio tracking (solutions exceeding token limits)
+
+### 📚 **Dataset Support**
+- **RLVR_Coding_Problems**: Training dataset for reinforcement learning
+- **DeepMind Code Contests**: Alternative training dataset
+- **LCB Test**: Evaluation benchmark from LiveCodeBench
+- Automatic dataset format handling and conversion
+
+### ⚙️ **Safety & Reliability**
+- Sandboxed code execution via Modal
+- Timeout protection
+- Resource limits (memory, CPU)
+- Reliability guards preventing destructive operations
+- Segmentation fault handling
+
+## 🚀 Quick Start
+
+### Basic Configuration
+
+```python
+from atropos.environments.code_execution_server import CodingEnv, CodeConfig
+
+# Configuration
+config = CodeConfig(
+    dataset_name="normal",  # or "deepmind"
+    temperature=1.0,
+    eval_temperature=0.6,
+    top_p=1.0,
+    eval_top_p=0.95,
+    start_idx=0,
+    max_eval_token_length=40960,
+)
+
+# Initialize environment
+env = CodingEnv(config, server_configs)
+```
+
+### Problem Types
+
+#### **Function-Based Problems**
+Problems where code defines a function that is called with test inputs:
+```python
+# Problem specification includes function signature
+def solve(nums: List[int]) -> int:
+    # Model generates function body
+    return max(nums)
+
+# Tests call the function directly
+tests = {
+    "fn_name": "solve",
+    "input": [[1, 2, 3], [5, 4, 3]],
+    "output": [3, 5]
+}
+```
+
+#### **Standard Input/Output Problems**
+Problems where code reads from stdin and writes to stdout:
+```python
+# Problem: Read integers, output their sum
+# Model generates:
+a = int(input())
+b = int(input())
+print(a + b)
+
+# Tests provide stdin inputs and expected stdout outputs
+tests = {
+    "fn_name": "none",
+    "input": ["5\n3", "10\n20"],
+    "output": ["8", "30"]
+}
+```
+
+## 📊 Evaluation Metrics
+
+### **Pass@1 (Estimated)**
+Uses combinatorial estimation to calculate the probability of at least one correct solution in a single attempt:
+```
+pass@1 = mean(1 - C(n-c, 1) / C(n, 1)) across all problems
+where n = group_size, c = number of correct solutions
+```
+
+### **Pass@group_size**
+Fraction of problems where at least one solution is correct:
+```
+pass@group_size = mean(num_correct > 0)
+```
+
+### **Difficulty Breakdowns**
+Separate metrics for easy, medium, and hard problems:
+- `eval/easy_pass_1`
+- `eval/medium_pass_1`
+- `eval/hard_pass_1`
+
+### **Completion Analysis**
+- `eval/completion_length`: Average completion length
+- `eval/correct_completion_length`: Average length of correct solutions
+- `eval/incorrect_completion_length`: Average length of incorrect solutions
+- `eval/overlong_ratio`: Fraction of solutions exceeding token limits
+
+### **Training Metrics**
+- `train_rewards/rewards`: Average reward (correctness rate)
+- `train_rewards/pass@group_size`: Training pass rate
+- `train_rewards/overlong_ratio`: Training overlong ratio
+- `train/completion_lengths`: Training completion statistics
+
+
+## 🔍 Code Extraction & Scoring
+
+### **Code Extraction**
+The environment extracts Python code from model responses using regex:
+- Looks for code blocks in markdown format: ` ```python ... ``` `
+- Takes the last code block if multiple are present
+- Returns `None` if no code block is found (scores -1.0)
+
+### **Scoring Logic**
+```python
+if code is None:
+    score = -1.0  # No code extracted
+elif all_tests_pass(code, test_cases):
+    score = 1.0   # All tests pass
+else:
+    score = -1.0  # Tests fail or error occurs
+```
+
+### **Test Execution**
+Code is executed via Modal endpoint with:
+- Timeout protection (default 15 seconds per test case)
+- Memory limits (5GB)
+- Reliability guards (disables file system, network, process operations)
+- Error categorization (Wrong Answer, Time Limit Exceeded, Runtime Error)
+
+## 🛠️ Advanced Features
+
+### **Offline Filtering**
+Utility to identify problems where the model achieves perfect scores:
+```python
+await env.offline_filter()
+# Saves perfect problem indices to perfect_indices.txt
+```
+
+### **Blacklist System**
+Problems in `perfect_indices.txt` are automatically blacklisted during training to focus on harder problems.
+
+### **Data Logging**
+Comprehensive logging to files:
+- **Short logs**: `qwen_data_dump_{timestamp}.txt` - Basic stats per problem
+- **Long logs**: `qwen_data_dump_long_{timestamp}.txt` - Includes code, errors, full outputs
+- Separate directories: `train_logs/` and `eval_logs/`
+
+### **Example Log Entry**
+```json
+{
+    "cur_id": 42,
+    "num_correct": 5,
+    "total": 8,
+    "scores": [1.0, 1.0, 1.0, 1.0, 1.0, -1.0, -1.0, -1.0],
+    "lengths": [1200, 1180, 1220, 1190, 1210, 800, 750, 820],
+    "errors": [...],
+    "codes": [...],
+    "gen": "assistant message content"
+}
+```
+
+
+## 📝 Response Format Requirements
+
+### **Expected Format**
+Model responses should contain Python code in markdown code blocks:
+````
+Here's my solution:
+
+```python
+def solve(nums):
+    return max(nums)
+```
+````
+
+## 🔐 Code Execution Endpoint (`lcb_modal_endpoint.py`)
+
+The Modal endpoint handles safe execution of generated Python code with comprehensive sandboxing and test validation. Most of the code is adapted from the [LiveCodeBench](https://github.com/LiveCodeBench/LiveCodeBench) and [rllm](https://github.com/rllm-org/rllm) repositories.
+
+#### **Problem Type Support**
+- **Call-Based**: Function calls with JSON-serialized inputs/outputs (LeetCode-style)
+- **Standard I/O**: stdin/stdout problems (competitive programming style)
+
+The code execution endpoint (`lcb_modal_endpoint.py`) implements extensive safety measures:
+
+### **Reliability Guards**
+- Disables file system operations (`os.remove`, `os.chdir`, etc.)
+- Blocks process operations (`os.fork`, `subprocess.Popen`)
+- Prevents network access
+- Limits memory usage (5GB default)
+- Sets recursion limit to prevent stack overflow
+
+### **Error Handling**
+- Timeout exceptions (SIGALRM)
+- Segmentation fault detection (faulthandler)
+- Runtime error capture
+- Wrong answer detection with detailed comparison
+
+### **Execution Environment**
+- Isolated Modal containers
+- Signal-based timeouts
+- Resource limits (CPU, memory)
+- Sandboxed imports (whitelisted standard library modules)
+
+
+## 📄 License
+
+This environment is part of the Atropos training framework. See the main repository for license information.
+