mirror of
https://github.com/NousResearch/atropos.git
synced 2026-04-24 17:04:55 +00:00
Integrate Caput Mundi poker environment from yoniebans - Add Six-Seat No-Limit Hold'em poker training environment - Features expert hand history training with dual reward system - Includes action matching and bet sizing evaluation components - Supports multi-stage game analysis (preflop/flop/turn/river) - Integrates with HuggingFace datasets and WandB monitoring - Comprehensive documentation added to community README (#17) - All code quality checks passing (black, isort, flake8) Environment moved from hack0/poker to environments/community/poker_holdem/ Resolves PR #84 from yoniebans/atropos
This commit is contained in:
parent
04c06c3e20
commit
7cfd3af149
7 changed files with 1145 additions and 0 deletions
|
|
@ -786,6 +786,138 @@ bun dev:server
|
|||
|
||||
**Requirements**: Bun runtime, OpenAI API, React, TypeScript, chess.js, Vite
|
||||
|
||||
### 17. Caput Mundi - Six-Seat No-Limit Hold'em Poker Environment (`poker_holdem/`)
|
||||
**Author**: [yoniebans](https://github.com/yoniebans)
|
||||
**Purpose**: Train language models to make optimal poker decisions through reinforcement learning on expert hand history data
|
||||
|
||||
A comprehensive poker training environment that teaches LLMs to play No-Limit Hold'em poker like winning players. The environment uses processed hand histories from successful poker players to create a supervised learning framework where models learn to match expert actions in various game situations.
|
||||
|
||||
**Features**:
|
||||
- **Expert Hand History Training**: Uses curated dataset of winning player decisions
|
||||
- **Multi-Stage Game Analysis**: Separate tracking for preflop, flop, turn, and river decisions
|
||||
- **Dual Reward System**: Combined action matching and bet sizing evaluation
|
||||
- **Comprehensive Evaluation**: Stage-specific performance metrics and cumulative tracking
|
||||
- **HuggingFace Integration**: Direct dataset loading with train/test splits
|
||||
- **WandB Monitoring**: Detailed logging of training progress and poker-specific metrics
|
||||
|
||||
**Core Training Components**:
|
||||
- **Dataset**: `yoniebans/6max-nlh-poker` with formatted poker situations and expert actions
|
||||
- **Input Format**: Structured poker prompts with game state, player positions, and betting history
|
||||
- **Target Actions**: Expert player decisions including action type and bet sizing
|
||||
- **Reward Functions**: Specialized evaluation for poker action correctness and bet precision
|
||||
- **Evaluation Metrics**: Accuracy tracking by game stage and action distribution analysis
|
||||
|
||||
**Poker-Specific Features**:
|
||||
- **Game Stage Tracking**: Separate analysis for preflop, flop, turn, and river decisions
|
||||
- **Action Type Recognition**: Fold, check, call, bet, raise, re-raise, all-in classification
|
||||
- **Bet Sizing Analysis**: Numerical precision evaluation for betting amounts
|
||||
- **Position Awareness**: Training on positional play and strategic considerations
|
||||
- **Hand History Format**: Realistic poker situation representation
|
||||
|
||||
**Reward System Architecture**:
|
||||
- **Action Match Reward (60%)**: Evaluates correctness of chosen action type
|
||||
- Exact match: 1.0 score
|
||||
- Action type match: 0.7 score
|
||||
- Strategic intent match: 0.5 score
|
||||
- **Bet Sizing Reward (40%)**: Evaluates precision of bet amount
|
||||
- Perfect amount: 1.0 score
|
||||
- Linear decay with deviation
|
||||
- Zero score beyond 50% deviation
|
||||
|
||||
**Training Data Structure**:
|
||||
```
|
||||
Input: "Position: BTN, Stack: 100bb, Pot: 3bb, Action: Hero faces 2bb raise..."
|
||||
Expert Action: "call 2"
|
||||
Model Output: "call 2.5"
|
||||
Action Score: 0.7 (correct action type)
|
||||
Sizing Score: 0.8 (close bet amount)
|
||||
Combined Score: 0.74
|
||||
```
|
||||
|
||||
**Evaluation Framework**:
|
||||
- **Stage-Specific Metrics**: Separate accuracy tracking for each betting round
|
||||
- **Action Distribution**: Monitoring of fold/call/raise frequencies
|
||||
- **Cumulative Performance**: Long-term learning progress across training epochs
|
||||
- **Threshold-Based Accuracy**: Configurable correctness thresholds for evaluation
|
||||
- **Sample-Based Testing**: Efficient evaluation on dataset subsets
|
||||
|
||||
**Dataset Features**:
|
||||
- **Six-Max Format**: Optimized for 6-player No-Limit Hold'em games
|
||||
- **Winning Player Focus**: Hand histories from profitable poker players
|
||||
- **Structured Prompts**: Consistent formatting for game state representation
|
||||
- **Action Formatting**: Standardized expert action representation
|
||||
- **Train/Test Splits**: Proper data separation for training and evaluation
|
||||
|
||||
**WandB Integration**:
|
||||
- **Training Metrics**: Epoch tracking, stage-specific scores, action distributions
|
||||
- **Evaluation Tracking**: Cumulative accuracy, stage performance, threshold analysis
|
||||
- **Poker Analytics**: Action frequency analysis, betting pattern recognition
|
||||
- **Progress Visualization**: Learning curves and performance trends
|
||||
|
||||
**Example Training Flow**:
|
||||
```
|
||||
Epoch 1: Load shuffled hand histories
|
||||
Hand 1: Preflop decision - Model matches expert fold (Score: 1.0)
|
||||
Hand 2: Flop decision - Model bets 8bb vs expert 10bb (Score: 0.85)
|
||||
Hand 3: River decision - Model calls vs expert raise (Score: 0.5)
|
||||
Evaluation: 73% accuracy across all stages
|
||||
```
|
||||
|
||||
**Configuration Options**:
|
||||
- **Model Selection**: Configurable LLM for poker decision making (default: Qwen/Qwen3-1.7B)
|
||||
- **Batch Processing**: Group size and batch size for efficient training
|
||||
- **Evaluation Parameters**: Sample size, temperature, and correctness thresholds
|
||||
- **Reward Weighting**: Adjustable balance between action matching and bet sizing
|
||||
- **Dataset Management**: Epoch-based shuffling and queue management
|
||||
|
||||
**Applications**:
|
||||
- **Poker AI Development**: Training competitive poker playing agents
|
||||
- **Decision Making Research**: Understanding strategic reasoning in uncertain environments
|
||||
- **Game Theory Applications**: Learning optimal play in multi-agent competitive settings
|
||||
- **Financial Modeling**: Risk assessment and decision making under uncertainty
|
||||
- **Educational Tools**: Teaching poker strategy through AI demonstration
|
||||
|
||||
**Technical Implementation**:
|
||||
- **Async Processing**: Non-blocking dataset loading and model inference
|
||||
- **Memory Efficient**: Queue-based training data management
|
||||
- **Robust Parsing**: Action extraction from natural language responses
|
||||
- **Error Handling**: Graceful handling of malformed model outputs
|
||||
- **Scalable Architecture**: Support for large-scale poker dataset training
|
||||
|
||||
**Performance Metrics**:
|
||||
- **Overall Accuracy**: Primary measure of poker decision quality
|
||||
- **Stage Accuracy**: Preflop/flop/turn/river specific performance
|
||||
- **Action Distribution**: Frequency analysis of different poker actions
|
||||
- **Bet Sizing Precision**: Numerical accuracy in betting decisions
|
||||
- **Learning Progress**: Improvement tracking across training epochs
|
||||
|
||||
**Setup Requirements**:
|
||||
1. HuggingFace Datasets library for data loading
|
||||
2. Transformers library for tokenization
|
||||
3. OpenAI-compatible LLM server for inference
|
||||
4. WandB account for training monitoring
|
||||
|
||||
**Command Line Usage**:
|
||||
```bash
|
||||
# Start VLLM server
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--model Qwen/Qwen3-1.7B \
|
||||
--gpu-memory-utilization 0.95 \
|
||||
--dtype auto \
|
||||
--port 9002
|
||||
|
||||
# Run poker training environment
|
||||
python environments/community/poker_holdem/poker_env.py process \
|
||||
--env.data_path_to_save_groups poker_rollouts.jsonl \
|
||||
--openai.base_url http://localhost:9002/v1 \
|
||||
--openai.api_key EMPTY \
|
||||
--openai.model_name Qwen/Qwen3-1.7B
|
||||
```
|
||||
|
||||
**Data Pipeline**: Custom data processing pipeline available at [poker-rl-data](https://github.com/yoniebans/poker-rl-data) for creating poker training datasets from raw hand histories.
|
||||
|
||||
**Requirements**: datasets, transformers, wandb, atroposlib
|
||||
|
||||
---
|
||||
|
||||
## Support
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue