mirror of
https://github.com/NousResearch/atropos.git
synced 2026-04-23 16:54:56 +00:00
77 lines
2.7 KiB
Markdown
77 lines
2.7 KiB
Markdown
# Six-Seat No-Limit Hold'em Poker Environment
|
|
|
|
Atropos environment for improving an LLM's ability to make optimal decisions in No-Limit Hold'em Poker situations through reinforcement learning on expert hand history data.
|
|
|
|
## Overview
|
|
|
|
This environment trains language models to make poker decisions like expert players. It takes a hand situation as input and rewards the model for matching the actions that winning players took in those situations.
|
|
|
|
## Features
|
|
|
|
- **Complete Poker Training Environment**: Full implementation of the BaseEnv interface for Atropos
|
|
- **Hugging Face Dataset Integration**: Uses `yoniebans/6max-nlh-poker` dataset with train/test splits
|
|
- **Specialized Reward Functions**: Custom reward components for action matching and bet sizing accuracy
|
|
- **Comprehensive Evaluation**: Tracking by game stage (preflop, flop, turn, river)
|
|
- **Configurable Parameters**: Easy customization via YAML configuration
|
|
|
|
## Files
|
|
|
|
- **poker_env.py**: Main environment implementation
|
|
- **reward_fns/**: Custom reward functions
|
|
- action_match.py: Evaluates correctness of action type
|
|
- bet_sizing.py: Evaluates precision of bet amount
|
|
- combined_poker_reward.py: Combines both reward components
|
|
- **DATASET.md**: Detailed documentation of the dataset format
|
|
|
|
|
|
## Dataset
|
|
|
|
The environment uses a specialized dataset containing poker hand situations and the corresponding actions taken by winning players. Each record includes:
|
|
|
|
- Game state information (player positions, cards, current bets)
|
|
- Previous actions in the hand
|
|
- The winning player's action (used as the learning target)
|
|
|
|
See `source_dataset.md` for detailed dataset documentation.
|
|
|
|
## Reward System
|
|
|
|
The reward function uses a dual evaluation approach:
|
|
|
|
1. **Action Matching (60%)**: Evaluates if the model chose the correct action type
|
|
- Exact match: 1.0 score
|
|
- Action type match: 0.7 score
|
|
- Strategic intent match: 0.5 score
|
|
|
|
2. **Bet Sizing (40%)**: Evaluates precision in bet amount selection
|
|
- Perfect amount: 1.0 score
|
|
- Linear decay as deviation increases
|
|
- Zero score beyond 50% deviation
|
|
|
|
This balanced approach ensures the model learns both strategic correctness and numerical precision.
|
|
|
|
# NOUS HACKATON
|
|
- [https://www.loom.com/share/7dda14bfc31b458eaa472a8d34e352c4] a link to a 1 minute youtube video
|
|
- an explanation of your env design and motivation
|
|
|
|
## Quick start
|
|
|
|
```bash
|
|
# run vllm
|
|
python -m vllm.entrypoints.openai.api_server \
|
|
--model Qwen/Qwen3-1.7B \
|
|
--gpu-memory-utilization 0.95 \
|
|
--dtype auto \
|
|
--port 9002
|
|
```
|
|
|
|
```bash
|
|
# run the environment
|
|
python poker_env.py process \
|
|
--env.data_path_to_save_groups poker_rollouts.jsonl \
|
|
--openai.base_url https://localhost:9002/v1 \
|
|
--openai.api_key EMPTY \
|
|
--openai.model_name Qwen/Qwen3-1.7B
|
|
```
|
|
## WanDB runs
|
|
[TBD]
|