# Code Execution Environment A comprehensive environment for training language models to solve coding problems through **code generation and execution**. This environment evaluates models on their ability to generate correct Python code that passes test cases using a Modal endpoint to validate LLM-generated code. ## 🎯 Overview The Code Execution Environment evaluates models on: - Generating correct Python code solutions for coding problems - Passing test cases through actual code execution - Handling both function-based and stdin/stdout problem types - Supporting multiple datasets (RLVR, DeepMind Code Contests, LCB Test) - Providing comprehensive evaluation metrics (pass@1, pass@k, difficulty breakdowns) **Key Philosophy**: This environment scores based on **code correctness** (1.0 for passing all tests, -1.0 for failing). Models must generate syntactically valid, executable Python code that produces the correct outputs for given test cases. ## ✨ Key Features ### 🔧 **Code Execution & Testing** - Real code execution via Modal endpoint (`lcb_modal_endpoint.py`) - Support for function-based problems (LeetCode-style) - Support for stdin/stdout problems (competitive programming style) - Automatic test case validation - Error handling (timeouts, runtime errors, wrong answers) ### 📊 **Comprehensive Evaluation** - Pass@1 metric calculation using combinatorial estimation - Pass@group_size evaluation - Difficulty-level breakdowns (easy/medium/hard) - Completion length tracking (correct vs incorrect solutions) - Overlong ratio tracking (solutions exceeding token limits) ### 📚 **Dataset Support** - **RLVR_Coding_Problems**: Training dataset for reinforcement learning - **DeepMind Code Contests**: Alternative training dataset - **LCB Test**: Evaluation benchmark from LiveCodeBench - Automatic dataset format handling and conversion ### ⚙️ **Safety & Reliability** - Sandboxed code execution via Modal - Timeout protection - Resource limits (memory, CPU) - Reliability guards preventing destructive operations - Segmentation fault handling ## 🚀 Quick Start ### Basic Configuration ```python from atropos.environments.code_execution_server import CodingEnv, CodeConfig # Configuration config = CodeConfig( dataset_name="normal", # or "deepmind" temperature=1.0, eval_temperature=0.6, top_p=1.0, eval_top_p=0.95, start_idx=0, max_eval_token_length=40960, ) # Initialize environment env = CodingEnv(config, server_configs) ``` ### Problem Types #### **Function-Based Problems** Problems where code defines a function that is called with test inputs: ```python # Problem specification includes function signature def solve(nums: List[int]) -> int: # Model generates function body return max(nums) # Tests call the function directly tests = { "fn_name": "solve", "input": [[1, 2, 3], [5, 4, 3]], "output": [3, 5] } ``` #### **Standard Input/Output Problems** Problems where code reads from stdin and writes to stdout: ```python # Problem: Read integers, output their sum # Model generates: a = int(input()) b = int(input()) print(a + b) # Tests provide stdin inputs and expected stdout outputs tests = { "fn_name": "none", "input": ["5\n3", "10\n20"], "output": ["8", "30"] } ``` ## 📊 Evaluation Metrics ### **Pass@1 (Estimated)** Uses combinatorial estimation to calculate the probability of at least one correct solution in a single attempt: ``` pass@1 = mean(1 - C(n-c, 1) / C(n, 1)) across all problems where n = group_size, c = number of correct solutions ``` ### **Pass@group_size** Fraction of problems where at least one solution is correct: ``` pass@group_size = mean(num_correct > 0) ``` ### **Difficulty Breakdowns** Separate metrics for easy, medium, and hard problems: - `eval/easy_pass_1` - `eval/medium_pass_1` - `eval/hard_pass_1` ### **Completion Analysis** - `eval/completion_length`: Average completion length - `eval/correct_completion_length`: Average length of correct solutions - `eval/incorrect_completion_length`: Average length of incorrect solutions - `eval/overlong_ratio`: Fraction of solutions exceeding token limits ### **Training Metrics** - `train_rewards/rewards`: Average reward (correctness rate) - `train_rewards/pass@group_size`: Training pass rate - `train_rewards/overlong_ratio`: Training overlong ratio - `train/completion_lengths`: Training completion statistics ## 🔍 Code Extraction & Scoring ### **Code Extraction** The environment extracts Python code from model responses using regex: - Looks for code blocks in markdown format: ` ```python ... ``` ` - Takes the last code block if multiple are present - Returns `None` if no code block is found (scores -1.0) ### **Scoring Logic** ```python if code is None: score = -1.0 # No code extracted elif all_tests_pass(code, test_cases): score = 1.0 # All tests pass else: score = -1.0 # Tests fail or error occurs ``` ### **Test Execution** Code is executed via Modal endpoint with: - Timeout protection (default 15 seconds per test case) - Memory limits (5GB) - Reliability guards (disables file system, network, process operations) - Error categorization (Wrong Answer, Time Limit Exceeded, Runtime Error) ## 🛠️ Advanced Features ### **Offline Filtering** Utility to identify problems where the model achieves perfect scores: ```python await env.offline_filter() # Saves perfect problem indices to perfect_indices.txt ``` ### **Blacklist System** Problems in `perfect_indices.txt` are automatically blacklisted during training to focus on harder problems. ### **Data Logging** Comprehensive logging to files: - **Short logs**: `qwen_data_dump_{timestamp}.txt` - Basic stats per problem - **Long logs**: `qwen_data_dump_long_{timestamp}.txt` - Includes code, errors, full outputs - Separate directories: `train_logs/` and `eval_logs/` ### **Example Log Entry** ```json { "cur_id": 42, "num_correct": 5, "total": 8, "scores": [1.0, 1.0, 1.0, 1.0, 1.0, -1.0, -1.0, -1.0], "lengths": [1200, 1180, 1220, 1190, 1210, 800, 750, 820], "errors": [...], "codes": [...], "gen": "assistant message content" } ``` ## 📝 Response Format Requirements ### **Expected Format** Model responses should contain Python code in markdown code blocks: ```` Here's my solution: ```python def solve(nums): return max(nums) ``` ```` ## 🔐 Code Execution Endpoint (`lcb_modal_endpoint.py`) The Modal endpoint handles safe execution of generated Python code with comprehensive sandboxing and test validation. Most of the code is adapted from the [LiveCodeBench](https://github.com/LiveCodeBench/LiveCodeBench) and [rllm](https://github.com/rllm-org/rllm) repositories. #### **Problem Type Support** - **Call-Based**: Function calls with JSON-serialized inputs/outputs (LeetCode-style) - **Standard I/O**: stdin/stdout problems (competitive programming style) The code execution endpoint (`lcb_modal_endpoint.py`) implements extensive safety measures: ### **Reliability Guards** - Disables file system operations (`os.remove`, `os.chdir`, etc.) - Blocks process operations (`os.fork`, `subprocess.Popen`) - Prevents network access - Limits memory usage (5GB default) - Sets recursion limit to prevent stack overflow ### **Error Handling** - Timeout exceptions (SIGALRM) - Segmentation fault detection (faulthandler) - Runtime error capture - Wrong answer detection with detailed comparison ### **Execution Environment** - Isolated Modal containers - Signal-based timeouts - Resource limits (CPU, memory) - Sandboxed imports (whitelisted standard library modules) ## 📄 License This environment is part of the Atropos training framework. See the main repository for license information.