| .. | ||
| lean_env.py | ||
| lean_proof_env.py | ||
| README.md | ||
| requirements.txt | ||
Overview
The LeanEnv class implements an environment specifically designed for training and evaluating language models on Lean theorem proving tasks. Its primary motivation is to facilitate research in automated reasoning and to develop LLMs capable of formal mathematical deduction. The environment orchestrates the interaction between an LLM, which generates proof attempts for given theorem statements, and a Lean code checker (currently mocked as PyPantograph), which validates these proofs. This interaction provides the necessary feedback signal (rewards) for reinforcement learning. The environment utilizes the brando/minif2f-lean4 dataset from Hugging Face, providing a standardized benchmark for Lean theorem proving.
Quickstart / Usage
The lean_env.py script is designed to be run directly from the command line, leveraging the BaseEnv.cli() method for easy execution.
1. Serve Mode (Connecting to Atropos Trainer)
To run the environment as a server that the Atropos trainer can connect to for distributed training or inference:
python environments/lean_env.py serve
This command will start the environment, making it available for the Atropos training infrastructure. Ensure your LLM API server (specified in APIServerConfig) is running and accessible.
2. Process Mode (Local Testing and Data Generation)
To run the environment locally for testing, debugging, data generation, or observing behavior with WandB (if enabled):
python environments/lean_env.py process
This mode is useful for:
- Verifying that the dataset loads correctly.
- Testing the proof generation and scoring logic with a small number of examples.
- Generating a sample WandB run to inspect logged metrics and tables (ensure
use_wandb=TrueinBaseEnvConfig). - Running a full evaluation pass locally if configured.
Additional arguments or configurations might be available via the CLI; refer to the BaseEnv.cli() implementation for more details if needed.
WandB Logging and Metrics
If use_wandb is enabled in the configuration, the environment logs detailed metrics and proof attempts to Weights & Biases, providing insights into the training and evaluation process.
Example WandB Run
You can find an example of a public WandB run generated by executing the process command here:
[Link to your public WandB run] (Replace this with an actual link after running the environment).
Custom Metrics
In addition to the standard metrics logged by the BaseEnv, the LeanEnv adds the following custom metrics:
train/batch_avg_percent_compiled: (Logged during training steps) This metric shows the moving average percentage of successfully compiled Lean proofs within recent training batches. It provides a signal of how well the model is learning to generate valid proofs during the training process.eval/percent_compiled: (Logged during evaluation steps) This metric reports the percentage of Lean problems from the test set for which the LLM generated a successfully compiled proof. It is a key indicator of the model's generalization performance on unseen problems.train/lean_proof_attempts(wandb.Table): (Logged during training steps, ifnum_rollouts_to_keep > 0) This is awandb.Tablethat provides a detailed log of individual proof attempts. Each row in the table corresponds to one attempt and includes:- Problem Statement: The original Lean theorem statement (with
sorry). - Generated Proof: The proof block generated by the LLM.
- Score: The reward assigned to the proof attempt (1.0 for compilation, -1.0 for failure).
- Compilation Status: A message indicating whether the proof compiled successfully or the error encountered.
- Problem Statement: The original Lean theorem statement (with
These metrics are crucial for understanding the model's progress in learning to generate correct and efficient Lean proofs. The detailed table of attempts allows for qualitative analysis of the model's outputs.