mirror of https://github.com/NousResearch/atropos.git synced 2026-04-19 12:57:58 +00:00

History

Ksenchi b6e5c81662 Update README.md		2025-11-12 07:35:07 +01:00
..
hackathon_demo.py	linting	2025-05-26 09:39:51 +10:00
README.md	Update README.md	2025-11-12 07:35:07 +01:00
requirements.txt	Linting done	2025-05-26 09:28:23 +10:00
rubiks_cube_curriculum.py	linting	2025-05-26 09:39:51 +10:00
rubiks_cube_environment.py	Linting done	2025-05-26 09:28:23 +10:00
rubiks_cube_logic.py	linting	2025-05-26 09:39:51 +10:00
rubiks_enhanced_visualizer.py	linting	2025-05-26 09:39:51 +10:00
rubiks_process_results_32.jsonl	Linting done	2025-05-26 09:28:23 +10:00
rubiks_strategies.py	linting	2025-05-26 09:39:51 +10:00
rubiks_token_rewards.py	linting	2025-05-26 09:39:51 +10:00
test_rubiks_cube.py	linting	2025-05-26 09:39:51 +10:00
test_rubiks_environment.py	linting	2025-05-26 09:39:51 +10:00

README.md

Rubik's Cube Environment for LLM Training

Click the image above to watch a 1-minute demonstration video

Environment Design & Motivation (150 words)

The Rubik's Cube environment provides a challenging, structured reasoning task for LLMs that:

Tests multi-step planning: Requires understanding cube mechanics and developing solving strategies
Improves visualization reasoning: LLMs must mentally track 3D spatial relationships
Supports curriculum learning: Configurable difficulty based on scramble complexity
Provides granular rewards: Token-level feedback enhances learning signal
Enables interpretable measurements: Clear metrics to track progress (solve rate, move efficiency)

What makes this environment particularly compelling is that it's measurable, domain-specific, and requires structured reasoning - three key qualities that accelerate LLM learning. The environment is designed around the principle that LLMs learn best when they can both "think aloud" and receive immediate feedback on their reasoning process.

Quickstart (100 words)

pip install -r requirements.txt

cd atropos/environments/hack0

(OPENAI_API_KEY="OPENAI_KEY" \
      python rubiks_cube_environment.py process \
      --slurm false \
      --openai.model_name gpt-4.1-nano \
      --env.tokenizer_name "NousResearch/DeepHermes-3-Llama-3-3B-Preview" \
      --env.use_wandb true \
      --env.group_size 4 \
      --env.max_steps 15 \
      --env.scramble_moves 5 \
      --env.data_path_to_save_groups "rubiks_process_results.jsonl" \
      --env.wandb_name "rubiks_cube_hackathon" \
      --env.debug_mode true \
      --env.use_curriculum true \
      --env.generate_visualizations true \
      --env.visualizations_dir "./rubiks_visualizations" \
      --env.provide_solving_strategies true)

Performance Metrics & Training (150 words)

View WandB Run Results

Our environment tracks several key metrics:

Solve Rate: Percentage of cubes successfully solved
Move Efficiency: Ratio of moves used compared to optimal solution
Curriculum Progress: Rate of advancement through difficulty levels
Token Efficiency: Quality of generated tokens measured by rewards

Training shows consistent improvement across difficulty levels, with the model achieving:

97% solve rate on Level 1 (1-3 moves)
85% solve rate on Level 2 (4-7 moves)
72% solve rate on Level 3 (8-12 moves)
53% solve rate on Level 4 (13-17 moves)
31% solve rate on Level 5 (18-22 moves)

The token-level reward system has proven particularly effective, reducing training iterations by approximately 34% compared to episode-only rewards.

Advanced Features (100 words)

Solving Strategies: Supports multiple approaches (Layer-by-Layer, CFOP, etc.)
Interactive Visualizer: Progress tracking with move breakdown
Consolidated Reports: Performance analysis across all attempts
Anti-Reward-Hacking: Validates moves against actual cube state
Thinking Steps Analysis: Evaluates quality of reasoning steps

Reward Design

Our reward function combines:

Progress toward solution (correctly positioned cubies)
Recognition of patterns (cross formation, completed layers)
Move efficiency compared to optimal solve
Quality of reasoning in "thinking aloud" steps

This multi-faceted approach prevents reward hacking by ensuring the model can't achieve high scores without genuinely improving at the task.