atropos/environments/community/poker_holdem/README.md
2025-06-09 15:39:04 +02:00

77 lines
2.7 KiB
Markdown

# Six-Seat No-Limit Hold'em Poker Environment
Atropos environment for improving an LLM's ability to make optimal decisions in No-Limit Hold'em Poker situations through reinforcement learning on expert hand history data.
## Overview
This environment trains language models to make poker decisions like expert players. It takes a hand situation as input and rewards the model for matching the actions that winning players took in those situations.
## Features
- **Complete Poker Training Environment**: Full implementation of the BaseEnv interface for Atropos
- **Hugging Face Dataset Integration**: Uses `yoniebans/6max-nlh-poker` dataset with train/test splits
- **Specialized Reward Functions**: Custom reward components for action matching and bet sizing accuracy
- **Comprehensive Evaluation**: Tracking by game stage (preflop, flop, turn, river)
- **Configurable Parameters**: Easy customization via YAML configuration
## Files
- **poker_env.py**: Main environment implementation
- **reward_fns/**: Custom reward functions
- action_match.py: Evaluates correctness of action type
- bet_sizing.py: Evaluates precision of bet amount
- combined_poker_reward.py: Combines both reward components
- **DATASET.md**: Detailed documentation of the dataset format
## Dataset
The environment uses a specialized dataset containing poker hand situations and the corresponding actions taken by winning players. Each record includes:
- Game state information (player positions, cards, current bets)
- Previous actions in the hand
- The winning player's action (used as the learning target)
See `source_dataset.md` for detailed dataset documentation.
## Reward System
The reward function uses a dual evaluation approach:
1. **Action Matching (60%)**: Evaluates if the model chose the correct action type
- Exact match: 1.0 score
- Action type match: 0.7 score
- Strategic intent match: 0.5 score
2. **Bet Sizing (40%)**: Evaluates precision in bet amount selection
- Perfect amount: 1.0 score
- Linear decay as deviation increases
- Zero score beyond 50% deviation
This balanced approach ensures the model learns both strategic correctness and numerical precision.
# NOUS HACKATON
- [https://www.loom.com/share/7dda14bfc31b458eaa472a8d34e352c4] a link to a 1 minute youtube video
- an explanation of your env design and motivation
## Quick start
```bash
# run vllm
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-1.7B \
--gpu-memory-utilization 0.95 \
--dtype auto \
--port 9002
```
```bash
# run the environment
python poker_env.py process \
--env.data_path_to_save_groups poker_rollouts.jsonl \
--openai.base_url https://localhost:9002/v1 \
--openai.api_key EMPTY \
--openai.model_name Qwen/Qwen3-1.7B
```
## WanDB runs
[TBD]