atropos/environments/community/poker_holdem/README.md

# Six-Seat No-Limit Hold'em Poker Environment

Atropos environment for improving an LLM's ability to make optimal decisions in No-Limit Hold'em Poker situations through reinforcement learning on expert hand history data.

## Overview

This environment trains language models to make poker decisions like expert players. It takes a hand situation as input and rewards the model for matching the actions that winning players took in those situations.

## Features

- **Complete Poker Training Environment**: Full implementation of the BaseEnv interface for Atropos
- **Hugging Face Dataset Integration**: Uses `yoniebans/6max-nlh-poker` dataset with train/test splits
- **Specialized Reward Functions**: Custom reward components for action matching and bet sizing accuracy
- **Comprehensive Evaluation**: Tracking by game stage (preflop, flop, turn, river)
- **Configurable Parameters**: Easy customization via YAML configuration

## Files

- **poker_env.py**: Main environment implementation
- **reward_fns/**: Custom reward functions
  - action_match.py: Evaluates correctness of action type
  - bet_sizing.py: Evaluates precision of bet amount
  - combined_poker_reward.py: Combines both reward components
- **DATASET.md**: Detailed documentation of the dataset format


## Dataset

The environment uses a specialized dataset containing poker hand situations and the corresponding actions taken by winning players. Each record includes:

- Game state information (player positions, cards, current bets)
- Previous actions in the hand
- The winning player's action (used as the learning target)

See `source_dataset.md` for detailed dataset documentation.

## Reward System

The reward function uses a dual evaluation approach:

1. **Action Matching (60%)**: Evaluates if the model chose the correct action type
   - Exact match: 1.0 score
   - Action type match: 0.7 score
   - Strategic intent match: 0.5 score

2. **Bet Sizing (40%)**: Evaluates precision in bet amount selection
   - Perfect amount: 1.0 score
   - Linear decay as deviation increases
   - Zero score beyond 50% deviation

This balanced approach ensures the model learns both strategic correctness and numerical precision.

# NOUS HACKATON
- [https://www.loom.com/share/7dda14bfc31b458eaa472a8d34e352c4] a link to a 1 minute youtube video
- an explanation of your env design and motivation

## Quick start

```bash
# run vllm
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-1.7B \
    --gpu-memory-utilization 0.95 \
    --dtype auto \
    --port 9002
```

```bash
# run the environment
python poker_env.py process \
    --env.data_path_to_save_groups poker_rollouts.jsonl \
    --openai.base_url https://localhost:9002/v1 \
    --openai.api_key EMPTY \
    --openai.model_name Qwen/Qwen3-1.7B
```
## WanDB runs
[TBD]