atropos/environments/community/poker_holdem
2025-06-09 15:39:04 +02:00
..
reward_fns Integrate Caput Mundi poker environment from yoniebans - Add Six-Seat No-Limit Hold'em poker training environment - Features expert hand history training with dual reward system - Includes action matching and bet sizing evaluation components - Supports multi-stage game analysis (preflop/flop/turn/river) - Integrates with HuggingFace datasets and WandB monitoring - Comprehensive documentation added to community README (#17) - All code quality checks passing (black, isort, flake8) Environment moved from hack0/poker to environments/community/poker_holdem/ Resolves PR #84 from yoniebans/atropos 2025-05-26 13:38:49 +10:00
DATASET.md Integrate Caput Mundi poker environment from yoniebans - Add Six-Seat No-Limit Hold'em poker training environment - Features expert hand history training with dual reward system - Includes action matching and bet sizing evaluation components - Supports multi-stage game analysis (preflop/flop/turn/river) - Integrates with HuggingFace datasets and WandB monitoring - Comprehensive documentation added to community README (#17) - All code quality checks passing (black, isort, flake8) Environment moved from hack0/poker to environments/community/poker_holdem/ Resolves PR #84 from yoniebans/atropos 2025-05-26 13:38:49 +10:00
poker_env.py Integrate Caput Mundi poker environment from yoniebans - Add Six-Seat No-Limit Hold'em poker training environment - Features expert hand history training with dual reward system - Includes action matching and bet sizing evaluation components - Supports multi-stage game analysis (preflop/flop/turn/river) - Integrates with HuggingFace datasets and WandB monitoring - Comprehensive documentation added to community README (#17) - All code quality checks passing (black, isort, flake8) Environment moved from hack0/poker to environments/community/poker_holdem/ Resolves PR #84 from yoniebans/atropos 2025-05-26 13:38:49 +10:00
README.md Update README.md 2025-06-09 15:39:04 +02:00

Six-Seat No-Limit Hold'em Poker Environment

Atropos environment for improving an LLM's ability to make optimal decisions in No-Limit Hold'em Poker situations through reinforcement learning on expert hand history data.

Overview

This environment trains language models to make poker decisions like expert players. It takes a hand situation as input and rewards the model for matching the actions that winning players took in those situations.

Features

  • Complete Poker Training Environment: Full implementation of the BaseEnv interface for Atropos
  • Hugging Face Dataset Integration: Uses yoniebans/6max-nlh-poker dataset with train/test splits
  • Specialized Reward Functions: Custom reward components for action matching and bet sizing accuracy
  • Comprehensive Evaluation: Tracking by game stage (preflop, flop, turn, river)
  • Configurable Parameters: Easy customization via YAML configuration

Files

  • poker_env.py: Main environment implementation
  • reward_fns/: Custom reward functions
    • action_match.py: Evaluates correctness of action type
    • bet_sizing.py: Evaluates precision of bet amount
    • combined_poker_reward.py: Combines both reward components
  • DATASET.md: Detailed documentation of the dataset format

Dataset

The environment uses a specialized dataset containing poker hand situations and the corresponding actions taken by winning players. Each record includes:

  • Game state information (player positions, cards, current bets)
  • Previous actions in the hand
  • The winning player's action (used as the learning target)

See source_dataset.md for detailed dataset documentation.

Reward System

The reward function uses a dual evaluation approach:

  1. Action Matching (60%): Evaluates if the model chose the correct action type

    • Exact match: 1.0 score
    • Action type match: 0.7 score
    • Strategic intent match: 0.5 score
  2. Bet Sizing (40%): Evaluates precision in bet amount selection

    • Perfect amount: 1.0 score
    • Linear decay as deviation increases
    • Zero score beyond 50% deviation

This balanced approach ensures the model learns both strategic correctness and numerical precision.

NOUS HACKATON

Quick start

# run vllm
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-1.7B \
    --gpu-memory-utilization 0.95 \
    --dtype auto \
    --port 9002
# run the environment
python poker_env.py process \
    --env.data_path_to_save_groups poker_rollouts.jsonl \
    --openai.base_url https://localhost:9002/v1 \
    --openai.api_key EMPTY \
    --openai.model_name Qwen/Qwen3-1.7B

WanDB runs

[TBD]