atropos

mirror of https://github.com/NousResearch/atropos.git synced 2026-04-19 12:57:58 +00:00

History

andrewshab 7e918dfd18 Update lean_env.py		2025-10-14 12:28:13 +02:00
..
lean_env.py	Update lean_env.py	2025-10-14 12:28:13 +02:00
lean_proof_env.py	Update pre-commit hooks to latest versions and fix issues	2025-06-04 10:58:37 -05:00
README.md	Fixed linting issues	2025-05-26 12:59:55 +10:00
requirements.txt	Fix formatting issues in community environments - Applied black, isort, trailing whitespace, and end-of-file fixes - Remaining flake8 issues (unused imports, line length) noted for future cleanup	2025-05-23 13:33:13 +10:00

README.md

Overview

The LeanEnv class implements an environment specifically designed for training and evaluating language models on Lean theorem proving tasks. Its primary motivation is to facilitate research in automated reasoning and to develop LLMs capable of formal mathematical deduction. The environment orchestrates the interaction between an LLM, which generates proof attempts for given theorem statements, and a Lean code checker (currently mocked as PyPantograph), which validates these proofs. This interaction provides the necessary feedback signal (rewards) for reinforcement learning. The environment utilizes the brando/minif2f-lean4 dataset from Hugging Face, providing a standardized benchmark for Lean theorem proving.

Quickstart / Usage

The lean_env.py script is designed to be run directly from the command line, leveraging the BaseEnv.cli() method for easy execution.

1. Serve Mode (Connecting to Atropos Trainer)

To run the environment as a server that the Atropos trainer can connect to for distributed training or inference:

python environments/lean_env.py serve

This command will start the environment, making it available for the Atropos training infrastructure. Ensure your LLM API server (specified in APIServerConfig) is running and accessible.

2. Process Mode (Local Testing and Data Generation)

To run the environment locally for testing, debugging, data generation, or observing behavior with WandB (if enabled):

python environments/lean_env.py process

This mode is useful for:

Verifying that the dataset loads correctly.
Testing the proof generation and scoring logic with a small number of examples.
Generating a sample WandB run to inspect logged metrics and tables (ensure use_wandb=True in BaseEnvConfig).
Running a full evaluation pass locally if configured.

Additional arguments or configurations might be available via the CLI; refer to the BaseEnv.cli() implementation for more details if needed.

WandB Logging and Metrics

If use_wandb is enabled in the configuration, the environment logs detailed metrics and proof attempts to Weights & Biases, providing insights into the training and evaluation process.

Example WandB Run

You can find an example of a public WandB run generated by executing the process command here: [Link to your public WandB run] (Replace this with an actual link after running the environment).

Custom Metrics

In addition to the standard metrics logged by the BaseEnv, the LeanEnv adds the following custom metrics:

train/batch_avg_percent_compiled: (Logged during training steps) This metric shows the moving average percentage of successfully compiled Lean proofs within recent training batches. It provides a signal of how well the model is learning to generate valid proofs during the training process.
eval/percent_compiled: (Logged during evaluation steps) This metric reports the percentage of Lean problems from the test set for which the LLM generated a successfully compiled proof. It is a key indicator of the model's generalization performance on unseen problems.
train/lean_proof_attempts (wandb.Table): (Logged during training steps, if num_rollouts_to_keep > 0) This is a wandb.Table that provides a detailed log of individual proof attempts. Each row in the table corresponds to one attempt and includes:
- Problem Statement: The original Lean theorem statement (with sorry).
- Generated Proof: The proof block generated by the LLM.
- Score: The reward assigned to the proof attempt (1.0 for compilation, -1.0 for failure).
- Compilation Status: A message indicating whether the proof compiled successfully or the error encountered.

These metrics are crucial for understanding the model's progress in learning to generate correct and efficient Lean proofs. The detailed table of attempts allows for qualitative analysis of the model's outputs.