Add cycling curriculum, difficulty threshold, update datadumps

2026-04-19 12:57:58 +00:00 · 2025-06-14 07:44:47 -07:00 · 2025-06-14 07:44:47 -07:00 · ad1bdf7f80
commit ad1bdf7f80
parent ab9e90a173
2 changed files with 737 additions and 60 deletions
--- a/environments/README.md
+++ b/environments/README.md
@ -343,7 +343,7 @@ This environment was inspired by AllenAI's RLVR-IFEVAL environment and uses Alle
 - Dataset: https://huggingface.co/datasets/allenai/RLVR-IFeval
 - Paper: https://arxiv.org/abs/2411.15124

-Environment for training models to follow natural language instructions and constraints, based on the `allenai/RLVR-IFeval` dataset and environment.
+Environment for training models to follow natural language instructions and constraints, based on the `allenai/RLVR-IFeval` dataset with advanced adaptive curriculum learning and comprehensive data management.

 **Input Format:**
 - Each item from the processed `allenai/RLVR-IFeval` dataset contains:
@ -358,24 +358,102 @@ You are a deep thinking AI, you may use extremely long chains of thought to deep

 **Reward Function:**
 - Score of 1.0 if the model's response correctly follows the instruction, as determined by the specific verifier function associated with the input prompt.
- Score of 0.0 if the response fails the verifier function.
+- Score of 0.0 if the response fails the verifier function or has malformed `<think>` tags (must have exactly one opening and one closing tag).
 - Length penalty applied if all responses in a batch are correct (receive a score of 1.0 before penalty):
-  - No penalty for responses under a certain percentage (e.g., 75%) of max token length.
-  - Linear penalty scaling from 1.0 down to 0.0 for responses between the threshold and 100% of max length.
+  - No penalty for responses under 75% of max token length.
+  - Linear penalty scaling from 1.0 down to 0.0 for responses between 75% and 100% of max length.
  - Returns None if all scores are identical after potential penalties (no learning signal).

-**Unique Configuration and Features:**
- **Dataset Configuration (`IFConfig`):
-  - `dataset_name`: Specifies the primary dataset to use (defaults to `allenai/RLVR-IFeval`).
-  - `dataset_config_name`: Optional name for a specific configuration or subset of the dataset.
-  - `test_set_ratio`: Defines the proportion of the dataset reserved for testing (defaults to 5%).
+**Key Features:**

- **Verifier-Based Scoring:** Utilizes a comprehensive map of verifier functions (`IF_FUNCTIONS_MAP`) to evaluate whether the model's
-output adheres to diverse and specific constraints defined in the input instructions (e.g., keyword presence, response length, JSON format, etc.).
+**1. Adaptive Curriculum System:**
+- **Cycling Queue**: Items are managed in an active training queue where solved items are removed from circulation
+- **Difficulty-Based Removal**: Items with group average score > `max_group_average_for_training` (default: 0.75) OR score ≥ 0.9 are marked as "solved"
+- **Attempt Tracking**: Tracks how many times each item has been attempted
+- **Queue Reset**: When all items are solved, the queue resets with previously solved items for continued training
+- **Comprehensive Logging**: Shows task names, group average scores, and contextual messages (e.g., "All correct in this group!", "All failed - format/constraint violations!")

- **Specialized Dataset Processing:** The `setup` method is specifically designed to parse the `allenai/RLVR-IFeval` dataset, extracting user instructions, the corresponding verifier function name, and its arguments.
+**2. Dataset State Persistence:**
+- **Automatic Dumping**: Saves active queue every 100 iterations to `atropos/environments/datasets/remaining_unsolved.jsonl`
+- **Rich Metadata**: Includes attempt counts, queue positions, iteration info, and curriculum state
+- **Resume Capability**: `resume_from_unsolved_dataset` config option to load from saved state
+- **Conflict Handling**: When both `dataset_name` and `resume_from_unsolved_dataset` are set:
+  - Training items come from resume file (overrides dataset_name)
+  - Test/evaluation items come from dataset_name for consistent evaluation
+  - System validates compatibility and warns about mismatches

- **Fallback Mechanism:** Includes a fallback to a small, predefined dummy dataset if the primary dataset (`allenai/RLVR-IFeval`) cannot be loaded, ensuring operational continuity for testing or development.
+**3. Data Dumping Infrastructure:**
+- **Structured Conversations**: Saves rollouts as proper chat conversations with role/content format
+- **Group Format**: Data saved with group-level metadata including constraint details and group average scores
+- **Configurable Thresholds**: `rollout_save_score_threshold` (default: 0.7) for filtering quality rollouts
+- **Failed Rollout Tracking**: Separate `dump_failed_rollouts` option for debugging constraint violations
+- **Batch Processing**: Automatic saving when buffers reach size limits (100 for rollouts, 50 for failed)
+- **Unique Identifiers**: Each run gets a UUID for file organization
+- **Save Location**: `atropos/environments/data_dumps/` with descriptive filenames
+
+**4. Enhanced Logging and Monitoring:**
+- **Log Suppression**: `suppress_base_env_logs` (default: True) reduces verbose base environment, httpx, and httpcore logs
+- **Curriculum Metrics**: WandB tracking of active items, solved items, percent solved, and average attempts
+- **Group-Level Insights**: Shows which tasks are being mastered vs. which remain challenging
+- **Training Progress**: Clear indication when groups are skipped for being too easy vs. used for training
+
+**Configuration Options (`IFConfig`):**
+- `dataset_name`: Primary dataset (default: "allenai/RLVR-IFeval")
+- `dataset_config_name`: Optional dataset configuration
+- `test_set_ratio`: Test set proportion (default: 0.05)
+- `dump_rollouts`: Enable successful rollout saving (default: False)
+- `dump_failed_rollouts`: Enable failed rollout saving for debugging (default: False)
+- `rollout_save_score_threshold`: Minimum score for saving rollouts (default: 0.7)
+- `max_group_average_for_training`: Skip groups above this score (default: 0.75)
+- `dataset_shuffle_seed`: Reproducible dataset shuffling (default: 42)
+- `resume_from_unsolved_dataset`: Path to resume file (default: None)
+- `suppress_base_env_logs`: Reduce verbose logging (default: True)
+
+**Verifier Functions:**
+Comprehensive map of 24 verifier functions (`IF_FUNCTIONS_MAP`) covering diverse constraints:
+- **Content Requirements**: `verify_keywords`, `verify_keyword_frequency`, `validate_forbidden_words`
+- **Format Constraints**: `validate_json_format`, `validate_title`, `validate_quotation`
+- **Structure Requirements**: `verify_paragraph_count`, `verify_bullet_points`, `validate_sections`
+- **Language Constraints**: `validate_response_language`, `validate_uppercase`, `validate_lowercase`
+- **Length Requirements**: `validate_word_constraint`, `verify_sentence_constraint`
+- **Special Formatting**: `verify_postscript`, `validate_placeholders`, `validate_highlighted_sections`
+- **Response Patterns**: `validate_repeat_prompt`, `validate_two_responses`, `validate_end`
+- **Character Constraints**: `verify_letter_frequency`, `validate_no_commas`
+- **Advanced Features**: `validate_choice`, `validate_frequency_capital_words`
+
+**Usage Examples:**
+```bash
+# Basic training
+python instruction_following_algorithm_environment.py serve
+
+# With data dumping enabled
+python instruction_following_algorithm_environment.py serve \
+    --env.dump_rollouts=True \
+    --env.rollout_save_score_threshold=0.8
+
+# Resume from previous session
+python instruction_following_algorithm_environment.py serve \
+    --env.resume_from_unsolved_dataset="atropos/environments/datasets/remaining_unsolved.jsonl"
+
+# Adjust difficulty threshold
+python instruction_following_algorithm_environment.py serve \
+    --env.max_group_average_for_training=0.8
+```
+
+**Evaluation Metrics:**
+- `eval/percent_correct`: Overall accuracy on test set
+- `curriculum/active_items`: Number of items still in training circulation
+- `curriculum/solved_items`: Number of items removed as solved
+- `curriculum/percent_solved`: Percentage of total items solved
+- `curriculum/avg_attempts_active`: Average attempts for items still in circulation
+- `train/percent_correct`: Training accuracy with group-level insights
+
+**Specialized Dataset Processing:**
+- Robust parsing of `allenai/RLVR-IFeval` format with comprehensive error handling
+- Extraction of user instructions, verifier function names, and arguments
+- Validation of verifier function availability in `IF_FUNCTIONS_MAP`
+- Fallback to dummy dataset if primary dataset loading fails
+- Configurable dataset shuffling for reproducible experiments

 ---