mirror of
https://github.com/NousResearch/atropos.git
synced 2026-04-19 12:57:58 +00:00
update environments readme
This commit is contained in:
parent
2ab8905d4f
commit
90e235a3e9
1 changed files with 38 additions and 0 deletions
|
|
@ -88,6 +88,44 @@ You are a deep thinking AI, you may use extremely long chains of thought to deep
|
|||
- Linear penalty scaling from 1.0 down to 0.0 for responses between 50% and 100% of max length
|
||||
- Returns None if all scores are identical (no learning signal)
|
||||
|
||||
---
|
||||
|
||||
### Instruction Following Environment (`instruction_following_algorithm_environment.py`)
|
||||
|
||||
Environment for training models to follow natural language instructions and constraints, based on the `allenai/RLVR-IFeval` dataset and environment.
|
||||
|
||||
**Input Format:**
|
||||
- Each item from the processed `allenai/RLVR-IFeval` dataset contains:
|
||||
- `prompt`: The user's instruction string.
|
||||
- `func_name`: The string name of the verifier function (from a predefined map) used to check if the instruction is followed.
|
||||
- `args`: A dictionary of arguments for the specified verifier function.
|
||||
|
||||
**System Prompt:**
|
||||
```
|
||||
You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem.
|
||||
```
|
||||
|
||||
**Reward Function:**
|
||||
- Score of 1.0 if the model's response correctly follows the instruction, as determined by the specific verifier function associated with the input prompt.
|
||||
- Score of 0.0 if the response fails the verifier function.
|
||||
- Length penalty applied if all responses in a batch are correct (receive a score of 1.0 before penalty):
|
||||
- No penalty for responses under a certain percentage (e.g., 75%) of max token length.
|
||||
- Linear penalty scaling from 1.0 down to 0.0 for responses between the threshold and 100% of max length.
|
||||
- Returns None if all scores are identical after potential penalties (no learning signal).
|
||||
|
||||
**Unique Configuration and Features:**
|
||||
- **Dataset Configuration (`IFConfig`):
|
||||
- `dataset_name`: Specifies the primary dataset to use (defaults to `allenai/RLVR-IFeval`).
|
||||
- `dataset_config_name`: Optional name for a specific configuration or subset of the dataset.
|
||||
- `test_set_ratio`: Defines the proportion of the dataset reserved for testing (defaults to 5%).
|
||||
|
||||
- **Verifier-Based Scoring:** Utilizes a comprehensive map of verifier functions (`IF_FUNCTIONS_MAP`) to evaluate whether the model's
|
||||
output adheres to diverse and specific constraints defined in the input instructions (e.g., keyword presence, response length, JSON format, etc.).
|
||||
|
||||
- **Specialized Dataset Processing:** The `setup` method is specifically designed to parse the `allenai/RLVR-IFeval` dataset, extracting user instructions, the corresponding verifier function name, and its arguments.
|
||||
|
||||
- **Fallback Mechanism:** Includes a fallback to a small, predefined dummy dataset if the primary dataset (`allenai/RLVR-IFeval`) cannot be loaded, ensuring operational continuity for testing or development.
|
||||
|
||||
## Common Features
|
||||
|
||||
All environments share these common features:
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue