mirror of https://github.com/NousResearch/atropos.git synced 2026-04-19 12:57:58 +00:00

History

Maxim Evtush d0913d187b Update README.md		2025-06-09 15:39:04 +02:00
..
reward_fns	Integrate Caput Mundi poker environment from yoniebans - Add Six-Seat No-Limit Hold'em poker training environment - Features expert hand history training with dual reward system - Includes action matching and bet sizing evaluation components - Supports multi-stage game analysis (preflop/flop/turn/river) - Integrates with HuggingFace datasets and WandB monitoring - Comprehensive documentation added to community README (#17 ) - All code quality checks passing (black, isort, flake8) Environment moved from hack0/poker to environments/community/poker_holdem/ Resolves PR #84 from yoniebans/atropos	2025-05-26 13:38:49 +10:00
DATASET.md	Integrate Caput Mundi poker environment from yoniebans - Add Six-Seat No-Limit Hold'em poker training environment - Features expert hand history training with dual reward system - Includes action matching and bet sizing evaluation components - Supports multi-stage game analysis (preflop/flop/turn/river) - Integrates with HuggingFace datasets and WandB monitoring - Comprehensive documentation added to community README (#17 ) - All code quality checks passing (black, isort, flake8) Environment moved from hack0/poker to environments/community/poker_holdem/ Resolves PR #84 from yoniebans/atropos	2025-05-26 13:38:49 +10:00
poker_env.py	Integrate Caput Mundi poker environment from yoniebans - Add Six-Seat No-Limit Hold'em poker training environment - Features expert hand history training with dual reward system - Includes action matching and bet sizing evaluation components - Supports multi-stage game analysis (preflop/flop/turn/river) - Integrates with HuggingFace datasets and WandB monitoring - Comprehensive documentation added to community README (#17 ) - All code quality checks passing (black, isort, flake8) Environment moved from hack0/poker to environments/community/poker_holdem/ Resolves PR #84 from yoniebans/atropos	2025-05-26 13:38:49 +10:00
README.md	Update README.md	2025-06-09 15:39:04 +02:00

README.md

Six-Seat No-Limit Hold'em Poker Environment

Atropos environment for improving an LLM's ability to make optimal decisions in No-Limit Hold'em Poker situations through reinforcement learning on expert hand history data.

Overview

This environment trains language models to make poker decisions like expert players. It takes a hand situation as input and rewards the model for matching the actions that winning players took in those situations.

Features

Complete Poker Training Environment: Full implementation of the BaseEnv interface for Atropos
Hugging Face Dataset Integration: Uses yoniebans/6max-nlh-poker dataset with train/test splits
Specialized Reward Functions: Custom reward components for action matching and bet sizing accuracy
Comprehensive Evaluation: Tracking by game stage (preflop, flop, turn, river)
Configurable Parameters: Easy customization via YAML configuration

Files

poker_env.py: Main environment implementation
reward_fns/: Custom reward functions
- action_match.py: Evaluates correctness of action type
- bet_sizing.py: Evaluates precision of bet amount
- combined_poker_reward.py: Combines both reward components
DATASET.md: Detailed documentation of the dataset format

Dataset

The environment uses a specialized dataset containing poker hand situations and the corresponding actions taken by winning players. Each record includes:

Game state information (player positions, cards, current bets)
Previous actions in the hand
The winning player's action (used as the learning target)

See source_dataset.md for detailed dataset documentation.

Reward System

The reward function uses a dual evaluation approach:

Action Matching (60%): Evaluates if the model chose the correct action type
- Exact match: 1.0 score
- Action type match: 0.7 score
- Strategic intent match: 0.5 score
Bet Sizing (40%): Evaluates precision in bet amount selection
- Perfect amount: 1.0 score
- Linear decay as deviation increases
- Zero score beyond 50% deviation

This balanced approach ensures the model learns both strategic correctness and numerical precision.

NOUS HACKATON

[https://www.loom.com/share/7dda14bfc31b458eaa472a8d34e352c4] a link to a 1 minute youtube video
an explanation of your env design and motivation

Quick start

# run vllm
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-1.7B \
    --gpu-memory-utilization 0.95 \
    --dtype auto \
    --port 9002

# run the environment
python poker_env.py process \
    --env.data_path_to_save_groups poker_rollouts.jsonl \
    --openai.base_url https://localhost:9002/v1 \
    --openai.api_key EMPTY \
    --openai.model_name Qwen/Qwen3-1.7B

WanDB runs

[TBD]