| .. | ||
| reward_fns | ||
| DATASET.md | ||
| poker_env.py | ||
| README.md | ||
Six-Seat No-Limit Hold'em Poker Environment
Atropos environment for improving an LLM's ability to make optimal decisions in No-Limit Hold'em Poker situations through reinforcement learning on expert hand history data.
Overview
This environment trains language models to make poker decisions like expert players. It takes a hand situation as input and rewards the model for matching the actions that winning players took in those situations.
Features
- Complete Poker Training Environment: Full implementation of the BaseEnv interface for Atropos
- Hugging Face Dataset Integration: Uses
yoniebans/6max-nlh-pokerdataset with train/test splits - Specialized Reward Functions: Custom reward components for action matching and bet sizing accuracy
- Comprehensive Evaluation: Tracking by game stage (preflop, flop, turn, river)
- Configurable Parameters: Easy customization via YAML configuration
Files
- poker_env.py: Main environment implementation
- reward_fns/: Custom reward functions
- action_match.py: Evaluates correctness of action type
- bet_sizing.py: Evaluates precision of bet amount
- combined_poker_reward.py: Combines both reward components
- DATASET.md: Detailed documentation of the dataset format
Dataset
The environment uses a specialized dataset containing poker hand situations and the corresponding actions taken by winning players. Each record includes:
- Game state information (player positions, cards, current bets)
- Previous actions in the hand
- The winning player's action (used as the learning target)
See source_dataset.md for detailed dataset documentation.
Reward System
The reward function uses a dual evaluation approach:
-
Action Matching (60%): Evaluates if the model chose the correct action type
- Exact match: 1.0 score
- Action type match: 0.7 score
- Strategic intent match: 0.5 score
-
Bet Sizing (40%): Evaluates precision in bet amount selection
- Perfect amount: 1.0 score
- Linear decay as deviation increases
- Zero score beyond 50% deviation
This balanced approach ensures the model learns both strategic correctness and numerical precision.
NOUS HACKATON
- [https://www.loom.com/share/7dda14bfc31b458eaa472a8d34e352c4] a link to a 1 minute youtube video
- an explanation of your env design and motivation
Quick start
# run vllm
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-1.7B \
--gpu-memory-utilization 0.95 \
--dtype auto \
--port 9002
# run the environment
python poker_env.py process \
--env.data_path_to_save_groups poker_rollouts.jsonl \
--openai.base_url https://localhost:9002/v1 \
--openai.api_key EMPTY \
--openai.model_name Qwen/Qwen3-1.7B
WanDB runs
[TBD]