Added new env info

2026-04-19 12:57:58 +00:00 · 2025-05-16 16:44:33 -07:00 · 2025-05-16 16:44:33 -07:00 · fd63c76a5c
commit fd63c76a5c
parent 8d0a326488
2 changed files with 164 additions and 2 deletions
--- a/environments/README.md
+++ b/environments/README.md
@ -88,6 +88,160 @@ You are a deep thinking AI, you may use extremely long chains of thought to deep
  - Linear penalty scaling from 1.0 down to 0.0 for responses between 50% and 100% of max length
  - Returns None if all scores are identical (no learning signal)

+---
+
+### RLAIF Server Environment (`rlaif_server.py`)
+
+Environment for Reinforcement Learning from AI Feedback (RLAIF). Used for aligning models to specific personalities or styles based on AI-generated preferences or reward signals.
+
+**Input Format:**
+- Typically involves prompts for which responses are generated and then evaluated by a reward model or preference model to guide the LLM's behavior. Specifics depend on the RLAIF setup.
+
+**System Prompt:**
+- Varies based on the desired personality/style (e.g., "Egregore," "Ascension Maze").
+
+**Reward Function:**
+- Based on the output of an AI judge/reward model, designed to score responses according to the target alignment criteria.
+
+---
+
+### Financial Fundamentals Prediction Environment (`fundamental_prediction_environment.py`)
+
+Environment for training models to predict financial fundamentals using the "NousResearch/company-fundamentals-prediction-lite" dataset.
+
+**Input Format:**
+- Items include `context` (company fundamentals, news, macroeconomic data), `fundamental_metric` (e.g., revenue, EPS), and ground truth `answer` ("maintained", "raised", or "reduced") and `magnitude` (percentage change). The model analyzes the `context` to predict the `answer` and `magnitude` for the given `fundamental_metric`.
+
+**Task:**
+- Predict directional changes and magnitude for company financial fundamentals.
+
+**Reward Function:**
+- Based on the accuracy of predictions for both direction and magnitude.
+
+---
+
+### Math Server Environment (`math_server.py`)
+
+A versatile math problem-solving environment supporting multiple datasets and operational modes.
+
+**Datasets:**
+- Integrates `gsm8k` (various subsets), `competition_math`, `math_qa`, and `MetaMathQA`.
+
+**Operational Modes:**
+- Supports standard problem solving, RLAIF (Reinforcement Learning from AI Feedback) for preference learning between solutions, a "judge" mode for evaluating solution correctness, and a "retry/self-correct" mode utilizing feedback on previous attempts.
+
+**Input Format:**
+- Mathematical problems, varying slightly by operational mode (e.g., including solutions for judging/RLAIF).
+
+**System Prompt:**
+- Dynamically constructed based on the operational mode. For standard problem solving, the prompt focuses on the problem itself. Other modes include specific instructions for judging, preference selection, or self-correction.
+
+**Reward Function:**
+- Based on the correctness of the mathematical solution, with variations depending on the mode (e.g., preference scores in RLAIF).
+
+---
+
+### Math Server Zero Environment (`math_server_zero.py`)
+
+A math problem-solving environment using the "zwhe99/DeepMath-103K" dataset, with a structured prompt format inspired by the Open-Reasoner-Zero project.
+
+**Input Format:**
+- Mathematical problems from the "zwhe99/DeepMath-103K" dataset.
+
+**System Prompt Structure:**
+- Utilizes a specific conversational format where the AI is instructed to first think (using `<think> </think>` tags) and then provide the answer (using `<answer> </answer>` tags, with the final numerical answer in `\boxed{}`). The overall prompt guides the model through this structured reasoning and response process.
+  - `prompt_format = "A conversation between User and Assistant... User: {prompt}\nAssistant: <think>"`
+  - `problem_format = "You must put your answer inside <answer> </answer> tags... This is the problem:\n{problem}"`
+
+**Reward Function:**
+- Based on the correctness of the mathematical solution within the `<answer>` tag, verified using LaTeX parsing.
+
+---
+
+### Coding Server Environment (`code_execution_server/coding_server.py`)
+
+Environment for training models to generate and potentially execute code.
+
+**Input Format:**
+- Coding problems or prompts (e.g., from datasets like MBPP, HumanEval).
+
+**System Prompt:**
+- Instructs the model to generate code for a given problem.
+
+**Reward Function:**
+- Based on correctness of the generated code, often involving execution and unit test passing.
+- The `code_execution_server/` directory also contains a `Dockerfile` for containerized execution.
+
+---
+
+### Dataset Environment (`dataset_environment/dataset_env.py`)
+
+A highly configurable environment for working with Hugging Face datasets. For more details, see the [Dataset Environment README](dataset_environment/README.md).
+
+**Purpose:**
+- Allows users to easily define RL environments using existing datasets from Hugging Face Hub.
+
+**Input Format:**
+- Defined by the chosen Hugging Face dataset (user specifies prompt and answer fields).
+
+**System Prompt:**
+- Customizable by the user.
+
+**Reward Function:**
+- Highly flexible, supports a registry of predefined reward functions (e.g., `accuracy`, `format`, `cosine_scaled`) and allows users to create and register custom reward functions. Multiple reward functions can be combined with weights.
+
+**Configuration:**
+- Primarily through YAML files specifying dataset details, generation parameters, and reward functions.
+
+---
+
+### Multimodal DPO Environments (`multimodal_dpo/`)
+
+A collection of environments for Direct Preference Optimization (DPO) with multimodal inputs. These environments are designed for tasks that involve processing both text and images.
+
+**Files:**
+- `ocr_vqa.py`
+- `pixmo_clocks.py`
+- `pixmo_count.py`
+- `pixmo_point_explanations.py`
+- `clevr_cogen_a_train.py`
+- `clevr_complex.py`
+
+**Purpose:**
+- Training models on tasks such as Optical Character Recognition VQA, visual counting, and interpreting complex visual scenes (e.g., Clevr).
+
+**Input Format:**
+- Typically pairs of (image, text prompt) and corresponding preferred/dispreferred responses.
+
+**Reward Function:**
+- Based on the DPO mechanism, implicitly learned from preference data.
+
+---
+
+### Game Environments (`game_environments/`)
+
+This section covers environments based on interactive games.
+
+#### Gymnasium Taxi (`game_environments/gymnasium/gym_taxi.py`)
+
+- **Game:** Based on the classic Gymnasium Taxi-v3 environment.
+- **Task:** The agent controls a taxi to pick up a passenger and drop them off at the correct location.
+- **Objective:** Optimize for efficient navigation and task completion.
+
+#### Gymnasium Blackjack (`game_environments/gymnasium/blackjack/`)
+
+Two Blackjack environment implementations are provided. For more details, see the [Blackjack README](game_environments/gymnasium/blackjack/README.md).
+
+- **`blackjack_env_no_thinking.py` (Standard Blackjack):**
+    - **Gameplay:** A standard version of Blackjack.
+    - **Objective:** Achieve a hand total closer to 21 than the dealer without exceeding 21.
+    - **Interaction:** Designed for shorter episodes without complex intermediate "thinking" steps. Aiming to teach the LLM to be a better policy model in uncertain environments.
+
+- **`blackjack_env_thinking.py` (Blackjack with Windowed Decision Making & Counterfactuals):**
+    - **Gameplay:** A more complex version designed for agents that produce long interaction sequences, including "thinking" steps.
+    - **Features:** Windowed decision making, local alternative generation, value-based pruning, and counterfactual data for training (GRPO).
+    - **Use Case:** Ideal for training LLMs that engage in explicit multi-step reasoning before action. Teaches the model to be more "confident" about selecting optimal moves & taking informed risks in uncertain environments, even with the knowledge that it might still lose with optimal play.
+
 ## Common Features

 All environments share these common features:
@ -112,6 +266,12 @@ All environments share these common features:
   - Comprehensive metrics logging
   - Support for multiple model completions per prompt

+5. **Detailed Documentation:**
+   - Most environments, especially those with more complexity, include detailed `README.md` files within their respective subdirectories to provide specific context and usage instructions.
+
+6. **Additional Libraries:**
+   - If an environment requires specific libraries not covered by the main project dependencies, its subdirectory may include a `requirements.txt` file for easy installation via `pip`, or provide installation instructions in its `README.md`.
+
 ## Usage

 Each environment can be initialized with: