31 KiB
Environments
This directory contains various environments for training and evaluating language models on different tasks. Each environment implements a specific task with its own input format, reward function, and evaluation metrics.
Directory Structure
- Main Environments: Training-focused environments with comprehensive datasets
- Evaluation Environments: Benchmark-focused environments primarily designed for model evaluation (see eval_environments/README.md)
Available Environments
Letter Counting Environment (letter_counting_environment.py)
A comprehensive environment for training models to count letters in words, sentences, and text passages with configurable difficulty and data modes.
Input Format:
- Single letter counting: "How many 'a's are in the word 'banana'?"
- Multiple letter counting: "Count the occurrences of the letters 'e', 'o', and 't' in the following text: 'The quick brown fox jumps over the lazy dog'"
- Each item contains:
prompt: The counting question with instructionscorrect_counts: Dictionary mapping letters to their countstext: The source text (word, sentence, or passage)target_letters: List of letters to count
System Prompt:
You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem.
Data Modes:
- Word Mode: Uses NLTK's words corpus (236k+ English words)
- Mixed Mode: Combines words and text passages from OpenWebText-10k dataset
- Text Passage Mode: Uses OpenWebText-10k dataset with character-based text extraction
Key Features:
- Multi-letter counting: Configurable simultaneous counting of multiple letters with JSON responses
- Letter selection bias: Configurable bias toward letters present in the text (reduces zero-count questions)
- Random string generation: Optional random strings (80% alphabetical) mixed with real words
- Word capitalization: Optional uppercase and title case transformations
- Punctuation/space handling: Configurable inclusion in letter counting
- Training thresholds: Skip groups that are too easy based on group average scores
- Data dumping: Save rollouts from groups with appropriate difficulty to JSONL files
- Comprehensive metrics: Letter distribution, text lengths, error rates, group average scores
Answer Formats:
- Single letter:
<answer>3</answer> - Multiple letters:
<answer>{"e": 4, "o": 4, "t": 2}</answer>
Reward Function:
- Score of 1.0 if the model's answer exactly matches the expected count(s)
- Score of 0.0 if incorrect, malformed, or missing answer
- Groups with identical scores (no learning signal) return None
- Groups with average score >
max_group_average_for_trainingare skipped for training for difficulty control/curriculum
Configuration Options:
use_text_passages: Enable mixed mode with text passages (default: False)text_passage_percentage: Ratio of passages to words in mixed mode (default: 0.5)max_letters_to_count: Maximum simultaneous letters (default: 1)multi_letter_probability: Probability of multi-letter questions (default: 0.0)present_letter_bias: Bias toward letters present in text (default: 0.5)include_punctuation_in_count: Include punctuation in counting (default: True)include_spaces_in_count: Include spaces in counting (default: False)max_group_average_for_training: Skip easy groups threshold (default: 1.0)dump_rollouts: Save rollouts to JSONL files (default: False)debug_logging: Enable verbose per-item scoring details (default: False)
Evaluation Metrics:
eval/accuracy: Overall accuracy on test seteval/letter_distribution_entropy: Entropy of letter selection distributioneval/avg_word_length: Average length of test itemseval/format_error_rate: Rate of malformed responseseval/think_tag_usage: Percentage using think tagstrain/group_average_scores: Distribution of group difficulty scores
Dependencies:
nltk(for words corpus)datasets(for OpenWebText-10k when using text passages)
Usage Example:
# Word-only mode
python letter_counting_environment.py serve \
--env.use_text_passages=False \
--env.max_letters_to_count=1 \
--env.max_group-average-for-training=0.75
# Mixed mode with multi-letter counting
python letter_counting_environment.py serve \
--env.use_text_passages=True \
--env.text_passage_percentage=0.3 \
--env.max_letters_to_count=4 \
--env.multi_letter_probability=0.2
# Data dumping mode
python letter_counting_environment.py serve \
--env.dump_rollouts=True \
--env.dump_batch_size=100 \
--env.max_group_average_for_training=0.75
MCQA Thinking Environment (mcqa_thinking_env.py)
Multiple Choice Question Answering environment that requires models to think through problems systematically.
Input Format:
- Questions from the MMLU (Massive Multitask Language Understanding) dataset
- Each item contains:
prompt: The question textanswer: Index of correct answerground_truth: Letter (A, B, C, D) of correct answeroptions: List of possible answers
System Prompt:
You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem.
Reward Function:
- Score of 1.0 if the model's answer matches the ground truth letter
- Score of 0.0 if incorrect or invalid response (multiple think tags, malformed thinking sections)
- Length penalty applied if all responses are correct:
- No penalty for responses under 50% of max token length
- Linear penalty scaling from 1.0 down to 0.0 for responses between 50% and 100% of max length
- Returns None if all scores are identical (no learning signal)
GSM8K Environment (gsm8k_server.py)
Mathematical reasoning environment using the GSM8K dataset.
Input Format:
- Questions from GSM8K dataset
- Each item contains:
question: The math problemanswer: The numerical answer
System Prompt:
You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem.
You are allocated a maximum of 2048 tokens, please strive to use less.
You will then provide your answer like this: \boxed{your answer here}
It is important that you provide your answer in the correct format.
If you do not, you will not receive credit for your answer.
So please end your answer with \boxed{your answer here}
Reward Function:
- Score of 1.0 if the model's answer matches the ground truth (using LaTeX verification)
- Score of 0.0 if incorrect or if ground truth is not parseable
- Length penalty applied if all responses are correct:
- No penalty for responses under 50% of max token length
- Linear penalty scaling from 1.0 down to 0.0 for responses between 50% and 100% of max length
- Returns None if all scores are identical (no learning signal)
Tool Calling Environment (tool_calling_server.py)
Environment for training models to make function calls in a structured format.
Input Format:
- Conversations from ShareGPT-Hermes function call dataset
- Each item contains:
conversations: List of messages with roles (system, human, gpt)- Expected tool calls in JSON format
System Prompt:
You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem.
Reward Function:
- Score of 1.0 if all expected tool calls are present and match exactly (including nested JSON fields)
- Score of 0.0 if any tool calls are missing, incorrect, or malformed
- Length penalty applied if all responses are correct:
- No penalty for responses under 50% of max token length
- Linear penalty scaling from 1.0 down to 0.0 for responses between 50% and 100% of max length
- Returns None if all scores are identical (no learning signal)
RLAIF Server Environment (rlaif_server.py)
Environment for Reinforcement Learning from AI Feedback (RLAIF). Used for aligning models to specific personalities or styles based on AI-generated preferences or reward signals.
Input Format:
- Typically involves prompts for which responses are generated and then evaluated by a reward model or preference model to guide the LLM's behavior. Specifics depend on the RLAIF setup.
System Prompt:
- Varies based on the desired personality/style (e.g., "Egregore," "Ascension Maze").
Reward Function:
- Based on the output of an AI judge/reward model, designed to score responses according to the target alignment criteria.
Financial Fundamentals Prediction Environment (fundamental_prediction_environment.py)
Environment for training models to predict financial fundamentals using the "NousResearch/company-fundamentals-prediction-lite" dataset.
Input Format:
- Items include
context(company fundamentals, news, macroeconomic data),fundamental_metric(e.g., revenue, EPS), and ground truthanswer("maintained", "raised", or "reduced") andmagnitude(percentage change). The model analyzes thecontextto predict theanswerandmagnitudefor the givenfundamental_metric.
Task:
- Predict directional changes and magnitude for company financial fundamentals.
Reward Function:
- Based on the accuracy of predictions for both direction and magnitude.
Math Server Environment (math_server.py)
A versatile math problem-solving environment supporting multiple datasets and operational modes.
Datasets:
- Integrates
gsm8k(various subsets),competition_math,math_qa, andMetaMathQA.
Operational Modes:
- Supports standard problem solving, RLAIF (Reinforcement Learning from AI Feedback) for preference learning between solutions, a "judge" mode for evaluating solution correctness, and a "retry/self-correct" mode utilizing feedback on previous attempts.
Input Format:
- Mathematical problems, varying slightly by operational mode (e.g., including solutions for judging/RLAIF).
System Prompt:
- Dynamically constructed based on the operational mode. For standard problem solving, the prompt focuses on the problem itself. Other modes include specific instructions for judging, preference selection, or self-correction.
Reward Function:
- Based on the correctness of the mathematical solution, with variations depending on the mode (e.g., preference scores in RLAIF).
Math Server Zero Environment (math_server_zero.py)
A math problem-solving environment using the "zwhe99/DeepMath-103K" dataset, with a structured prompt format inspired by the Open-Reasoner-Zero project.
Input Format:
- Mathematical problems from the "zwhe99/DeepMath-103K" dataset.
System Prompt Structure:
- Utilizes a specific conversational format where the AI is instructed to first think (using
<think> </think>tags) and then provide the answer (using<answer> </answer>tags, with the final numerical answer in\boxed{}). The overall prompt guides the model through this structured reasoning and response process.prompt_format = "A conversation between User and Assistant... User: {prompt}\nAssistant: <think>"problem_format = "You must put your answer inside <answer> </answer> tags... This is the problem:\n{problem}"
Reward Function:
- Based on the correctness of the mathematical solution within the
<answer>tag, verified using LaTeX parsing.
Coding Server Environment (code_execution_server/coding_server.py)
Environment for training models to generate and potentially execute code.
Input Format:
- Coding problems or prompts (e.g., from datasets like MBPP, HumanEval).
System Prompt:
- Instructs the model to generate code for a given problem.
Reward Function:
- Based on correctness of the generated code, often involving execution and unit test passing.
- The
code_execution_server/directory also contains aDockerfilefor containerized execution.
Dataset Environment (dataset_environment/dataset_env.py)
A highly configurable environment for working with Hugging Face datasets. For more details, see the Dataset Environment README.
Purpose:
- Allows users to easily define RL environments using existing datasets from Hugging Face Hub.
Input Format:
- Defined by the chosen Hugging Face dataset (user specifies prompt and answer fields).
System Prompt:
- Customizable by the user.
Reward Function:
- Highly flexible, supports a registry of predefined reward functions (e.g.,
accuracy,format,cosine_scaled) and allows users to create and register custom reward functions. Multiple reward functions can be combined with weights.
Configuration:
- Primarily through YAML files specifying dataset details, generation parameters, and reward functions.
Multimodal DPO Environments (multimodal_dpo/)
A collection of environments for Direct Preference Optimization (DPO) with multimodal inputs. These environments are designed for tasks that involve processing both text and images.
Files:
ocr_vqa.pypixmo_clocks.pypixmo_count.pypixmo_point_explanations.pyclevr_cogen_a_train.pyclevr_complex.py
Purpose:
- Training models on tasks such as Optical Character Recognition VQA, visual counting, and interpreting complex visual scenes (e.g., Clevr).
Input Format:
- Typically pairs of (image, text prompt) and corresponding preferred/dispreferred responses.
Reward Function:
- Based on the DPO mechanism, implicitly learned from preference data.
Game Environments (game_environments/)
This section covers environments based on interactive games.
Gymnasium Taxi (game_environments/gymnasium/gym_taxi.py)
- Game: Based on the classic Gymnasium Taxi-v3 environment.
- Task: The agent controls a taxi to pick up a passenger and drop them off at the correct location.
- Objective: Optimize for efficient navigation and task completion.
Gymnasium Blackjack (game_environments/gymnasium/blackjack/)
Two Blackjack environment implementations are provided. For more details, see the Blackjack README.
-
blackjack_env_no_thinking.py(Standard Blackjack):- Gameplay: A standard version of Blackjack.
- Objective: Achieve a hand total closer to 21 than the dealer without exceeding 21.
- Interaction: Designed for shorter episodes without complex intermediate "thinking" steps. Aiming to teach the LLM to be a better policy model in uncertain environments.
-
blackjack_env_thinking.py(Blackjack with Windowed Decision Making & Counterfactuals):- Gameplay: A more complex version designed for agents that produce long interaction sequences, including "thinking" steps.
- Features: Windowed decision making, local alternative generation, value-based pruning, and counterfactual data for training (GRPO).
- Use Case: Ideal for training LLMs that engage in explicit multi-step reasoning before action. Teaches the model to be more "confident" about selecting optimal moves & taking informed risks in uncertain environments, even with the knowledge that it might still lose with optimal play.
Instruction Following Environment (instruction_following_algorithm_environment.py)
Dependencies:
datasets(Hugging Face)langdetect
This environment was inspired by AllenAI's RLVR-IFEVAL environment and uses AllenAI's dataset from their Tulu3 paper and project:
- Dataset: https://huggingface.co/datasets/allenai/RLVR-IFeval
- Paper: https://arxiv.org/abs/2411.15124
Environment for training models to follow natural language instructions and constraints, based on the allenai/RLVR-IFeval dataset with advanced adaptive curriculum learning and comprehensive data management.
Input Format:
- Each item from the processed
allenai/RLVR-IFevaldataset contains:prompt: The user's instruction string.func_name: The string name of the verifier function (from a predefined map) used to check if the instruction is followed.args: A dictionary of arguments for the specified verifier function.
System Prompt:
You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem.
Reward Function:
- Score of 1.0 if the model's response correctly follows the instruction, as determined by the specific verifier function associated with the input prompt.
- Score of 0.0 if the response fails the verifier function or has malformed
<think>tags (must have exactly one opening and one closing tag). - Length penalty applied if all responses in a batch are correct (receive a score of 1.0 before penalty):
- No penalty for responses under 75% of max token length.
- Linear penalty scaling from 1.0 down to 0.0 for responses between 75% and 100% of max length.
- Returns None if all scores are identical after potential penalties (no learning signal).
Key Features:
1. Adaptive Curriculum System:
- Cycling Queue: Items are managed in an active training queue where solved items are removed from circulation
- Flexible Solving Criteria: Items can be marked as "solved" based on:
- Group average score >
max_group_average_for_training(default: 0.75) - too easy for training - Group average score ≥ 0.9 - mastered through high performance
- Single correct rollout when
solve_on_single_correct=True- immediate removal on any success
- Group average score >
- Attempt Tracking: Tracks how many times each item has been attempted
- Queue Reset: When all items are solved, the queue resets with previously solved items for continued training
- Comprehensive Logging: Shows task names, group average scores, solve reasons, and contextual messages
2. Dataset State Persistence:
- Automatic Dumping: Saves active queue every 100 iterations to
atropos/environments/datasets/remaining_unsolved.jsonl - Rich Metadata: Includes attempt counts, queue positions, iteration info, and curriculum state
- Resume Capability:
resume_from_unsolved_datasetconfig option to load from saved state - Conflict Handling: When both
dataset_nameandresume_from_unsolved_datasetare set:- Training items come from resume file (overrides dataset_name)
- Test/evaluation items come from dataset_name for consistent evaluation
- System validates compatibility and warns about mismatches
3. Data Dumping Infrastructure:
- Structured Conversations: Saves rollouts as proper chat conversations with role/content format
- Group Format: Data saved with group-level metadata including constraint details and group average scores
- Configurable Thresholds:
rollout_save_score_threshold(default: 0.7) for filtering quality rollouts - Failed Rollout Tracking: Separate
dump_failed_rolloutsoption for debugging constraint violations - Batch Processing: Automatic saving when buffers reach size limits (100 for rollouts, 50 for failed)
- Unique Identifiers: Each run gets a UUID for file organization
- Save Location:
atropos/environments/data_dumps/with descriptive filenames
4. Enhanced Logging and Monitoring:
- Log Suppression:
suppress_base_env_logs(default: True) reduces verbose base environment, httpx, and httpcore logs - Curriculum Metrics: WandB tracking of active items, solved items, percent solved, and average attempts
- Group-Level Insights: Shows which tasks are being mastered vs. which remain challenging
- Training Progress: Clear indication when groups are skipped for being too easy vs. used for training
Configuration Options (IFConfig):
dataset_name: Primary dataset (default: "allenai/RLVR-IFeval")dataset_config_name: Optional dataset configurationtest_set_ratio: Test set proportion (default: 0.05)dump_rollouts: Enable successful rollout saving (default: False)dump_failed_rollouts: Enable failed rollout saving for debugging (default: False)rollout_save_score_threshold: Minimum score for saving rollouts (default: 0.7)max_group_average_for_training: Skip groups above this score (default: 0.75)dataset_shuffle_seed: Reproducible dataset shuffling (default: 42)resume_from_unsolved_dataset: Path to resume file (default: None)suppress_base_env_logs: Reduce verbose logging (default: True)solve_on_single_correct: Mark item as solved if any rollout gets it correct (default: False)
Verifier Functions:
Comprehensive map of 24 verifier functions (IF_FUNCTIONS_MAP) covering diverse constraints:
- Content Requirements:
verify_keywords,verify_keyword_frequency,validate_forbidden_words - Format Constraints:
validate_json_format,validate_title,validate_quotation - Structure Requirements:
verify_paragraph_count,verify_bullet_points,validate_sections - Language Constraints:
validate_response_language,validate_uppercase,validate_lowercase - Length Requirements:
validate_word_constraint,verify_sentence_constraint - Special Formatting:
verify_postscript,validate_placeholders,validate_highlighted_sections - Response Patterns:
validate_repeat_prompt,validate_two_responses,validate_end - Character Constraints:
verify_letter_frequency,validate_no_commas - Advanced Features:
validate_choice,validate_frequency_capital_words
Usage Examples:
# Basic training
python instruction_following_algorithm_environment.py serve
# With data dumping enabled
python instruction_following_algorithm_environment.py serve \
--env.dump_rollouts=True \
--env.rollout_save_score_threshold=0.8
# Resume from previous session
python instruction_following_algorithm_environment.py serve \
--env.resume_from_unsolved_dataset="atropos/environments/datasets/remaining_unsolved.jsonl"
# Adjust difficulty threshold
python instruction_following_algorithm_environment.py serve \
--env.max_group_average_for_training=0.8
# Enable single-correct solving (remove items immediately when any rollout succeeds)
python instruction_following_algorithm_environment.py serve \
--env.solve_on_single_correct=True
Evaluation Metrics:
eval/percent_correct: Overall accuracy on test setcurriculum/active_items: Number of items still in training circulationcurriculum/solved_items: Number of items removed as solvedcurriculum/percent_solved: Percentage of total items solvedcurriculum/avg_attempts_active: Average attempts for items still in circulationtrain/percent_correct: Training accuracy with group-level insights
Specialized Dataset Processing:
- Robust parsing of
allenai/RLVR-IFevalformat with comprehensive error handling - Extraction of user instructions, verifier function names, and arguments
- Validation of verifier function availability in
IF_FUNCTIONS_MAP - Fallback to dummy dataset if primary dataset loading fails
- Configurable dataset shuffling for reproducible experiments
SWE-RL Environment (swe_rl_env.py)
Software Engineering Reinforcement Learning environment for training models to fix bugs based on issue descriptions and code context.
Dependencies:
datasets(Hugging Face)difflibwandbpydantic
Dataset:
- Default:
princeton-nlp/SWE-bench_Lite_oracle - Configurable via
SWERLEnvConfig(e.g.,dataset_name,dataset_split_train,dataset_split_eval).
Input Format (for the model via prompts):
problem_statement: The issue text.content: Relevant code segments from one or more files.
System Prompts:
- Thinking System Prompt:
You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem. - Task System Prompt:
(Followed by instructions on the SEARCH/REPLACE format)A user will ask you to solve a task. You should generate the solution. Your response format must follow the template below:
User Prompt Template:
We are currently solving the following issue within our repository. Here is the issue text:
--- BEGIN ISSUE ---
{problem_statement}
--- END ISSUE ---
Below are some code segments, each from a relevant file. One or more of these files may contain bugs.
--- BEGIN FILE ---
``` {content} ```
--- END FILE ---
Please first localize the bug based on the issue statement, and then generate *SEARCH/REPLACE* edits to fix the issue.
Every *SEARCH/REPLACE* edit must use this format:
1. The file path
2. The start of search block: <<<<<<< SEARCH
3. A contiguous chunk of lines to search for in the existing source code
4. The dividing line: =======
5. The lines to replace into the source code
6. The end of the replace block: >>>>>>> REPLACE
Here is an example:
```python
### mathweb/flask/app.py
import math
from flask import Flask
Please note that the SEARCH/REPLACE edit REQUIRES PROPER INDENTATION. If you would like to add the line ’ print(x)’, you must fully write that out, with all those spaces before the code! Wrap each SEARCH/REPLACE edit in a code block as shown in the example above. If you have multiple SEARCH/REPLACE edits, use a separate code block for each one.
**Reward Function:**
- Primary reward is based on the `SequenceMatcher` ratio between the model's reconstructed generated patch and the oracle patch.
- A score of -1.0 is given initially.
- If the model's response has a `finish_reason` of "length", or if `<think>` tags are present but malformed, the reward remains -1.0 and advantage is set to zero for "length".
- If the SEARCH/REPLACE patch format is correctly parsed from the model's output (after potentially extracting content from `<think> </think>` tags):
- The `SequenceMatcher.ratio()` between the reconstructed predicted patch and the `oracle_patch_str` is used as the reward.
- Buffers track:
- `percent_format_correct_buffer`: Percentage of responses with correctly formatted patches.
- `similarity_score_buffer`: List of similarity scores for correctly formatted patches.
- `think_tags_present_buffer`: Percentage of responses where `<think>` tags were present.
- `think_tags_well_formed_buffer`: Percentage of responses where `<think>` tags were present AND well-formed.
**Evaluation Metrics:**
- `eval/avg_similarity_score_correct_patch_format`: Average similarity score for responses that had a correctly formatted patch.
- `eval/patch_format_accuracy`: Proportion of evaluation items where the patch was correctly formatted.
- `eval/pass_at_1`: Proportion of evaluation items where the patch was correct and achieved a similarity score of 1.0.
- `eval/avg_think_tags_present`: Average presence of think tags in evaluation responses.
- `eval/avg_think_tags_well_formed`: Average well-formedness of think tags in evaluation responses.
**Unique Configuration and Features:**
- **Dataset Handling:** Loads training and test data from Hugging Face datasets, specifically tailored for SWE-bench like formats.
- **Patch Parsing:** Implements robust parsing for a specific SEARCH/REPLACE patch format.
- **Thinking Tag Processing:** Extracts content after `<think> </think>`
---
### Text Reversal Environment (`text_reversal_environment.py`)
Environment for training and evaluating exact string reversal with optional thinking and split train/eval context lengths.
**Dataset:**
- `PrimeIntellect/Reverse-Text-SFT`
**Input Format:**
- Each item contains two `prompt` messages and one `completion` message:
- `prompt`: list of messages with roles {`system`, `user`}
- `completion`: list with a single assistant message containing the reversed text, wrapped in `<reversed_text>...</reversed_text>`
**Prompt Construction:**
- The dataset's system text is NOT used as a system message to the model.
- Instead, it is prepended to the user content with two newline separators and sent as the user turn:
- Effective user content: `"{dataset_system}\n\n{dataset_user}"`
- Optional thinking system prompt is included only when `use_thinking=True`.
**Reward Function:**
- Extract the model output after the first closing `</think>` tag (if present), trim whitespace.
- Score is 1.0 if the remaining output EXACTLY matches the dataset assistant `completion` content; otherwise 0.0.
**Configuration Options (`TextReversalEnvConfig`):**
- `use_thinking` (bool, default: False): include thinking system prompt.
- `dataset_name` (str, default: `PrimeIntellect/Reverse-Text-SFT`): training dataset.
- `eval_dataset_name` (Optional[str], default: None): static eval dataset to use (full split). If `None`, the environment samples `test_set_size` examples from the training dataset for eval.
- `test_set_size` (int, default: 100): number of samples for eval when `eval_dataset_name=None`.
- `max_train_token_length` (int, default: 16384): max tokens for training generations.
- `max_eval_token_length` (int, default: 32768): max tokens for eval generations.
**Usage Examples:**
```bash
# Basic training with default 16k train context, 32k eval context, and sampled eval set (100 examples)
python text_reversal_environment.py serve
# Enable thinking system prompt
python text_reversal_environment.py serve \
--env.use_thinking=True
# Use a static eval dataset instead of sampling from train
python text_reversal_environment.py serve \
--env.eval_dataset_name="someorg/Reverse-Text-EVAL"
# Override max token lengths if needed
python text_reversal_environment.py serve \
--env.max_train_token_length=12000 \
--env.max_eval_token_length=28000
Evaluation Metric:
eval/percent_correct: strict exact-match accuracy on the eval set.