This commit is contained in:
Cavit Erginsoy 2025-02-03 11:35:30 +00:00
parent 1e27021e11
commit 6c564b3dd9
13 changed files with 305 additions and 317 deletions

View file

@ -6,21 +6,21 @@ This project generates a dataset of word ladder puzzles and (optionally) submits
The project consists of several key components: The project consists of several key components:
- **`main.py`**: - **`main.py`**:
Orchestrates the overall flow. It performs the following tasks: Orchestrates the overall flow. It performs the following tasks:
1. Generates a dataset of word ladder puzzles by calling functions from `utils/create_word_ladders.py`. 1. Generates a dataset of word ladder puzzles by calling functions from `utils/create_word_ladders.py`.
2. (Optionally) Triggers the reasoning request process to augment puzzles with chain-of-thought reasoning via `utils/generate_reasoning.py`. 2. (Optionally) Triggers the reasoning request process to augment puzzles with chain-of-thought reasoning via `utils/generate_reasoning.py`.
3. (Planned) Additional steps such as checking results or uploading the final dataset. 3. (Planned) Additional steps such as checking results or uploading the final dataset.
The configuration for the dataset parameters (e.g., word length, chain length, and dataset size) is centralized here, making it easy to adjust the settings as needed. The configuration for the dataset parameters (e.g., word length, chain length, and dataset size) is centralized here, making it easy to adjust the settings as needed.
- **`utils/create_word_ladders.py`**: - **`utils/create_word_ladders.py`**:
Contains functions to create and validate a word ladder dataset. It leverages underlying modules (e.g., `reasoning_gym`) to generate individual puzzles and ensures uniqueness across the dataset. Contains functions to create and validate a word ladder dataset. It leverages underlying modules (e.g., `reasoning_gym`) to generate individual puzzles and ensures uniqueness across the dataset.
- **`utils/generate_reasoning.py`**: - **`utils/generate_reasoning.py`**:
Reads the generated dataset (in JSONL format), then filters out puzzles that already have reasoning. For puzzles missing chain-of-thought data, it splits them into batches (with a default batch size that you can adjust) and submits each batch to Anthropic's Message Batches API. Each API request includes the puzzle along with a custom system prompt (read from `system_prompt.txt`), and the resulting metadata is stored for later retrieval and analysis. Reads the generated dataset (in JSONL format), then filters out puzzles that already have reasoning. For puzzles missing chain-of-thought data, it splits them into batches (with a default batch size that you can adjust) and submits each batch to Anthropic's Message Batches API. Each API request includes the puzzle along with a custom system prompt (read from `system_prompt.txt`), and the resulting metadata is stored for later retrieval and analysis.
- **`usage_stats.py`**: - **`usage_stats.py`**:
Analyzes API response files to compute detailed usage statistics. This script: Analyzes API response files to compute detailed usage statistics. This script:
- Extracts token usage metrics such as `input_tokens`, `cache_creation_input_tokens`, `cache_read_input_tokens`, and `output_tokens`. - Extracts token usage metrics such as `input_tokens`, `cache_creation_input_tokens`, `cache_read_input_tokens`, and `output_tokens`.
- Calculates costs based on pricing data and shows the savings achieved through prompt caching. - Calculates costs based on pricing data and shows the savings achieved through prompt caching.
@ -29,23 +29,23 @@ The project consists of several key components:
## Warning ## Warning
**Caution:** **Caution:**
Running large batches of requests via the Anthropic API (especially in `generate_reasoning.py`) can incur significant costs in Anthropic credits. **Please review and understand your API quota and budgeting before running the API call.** If you are just testing or working with a demo dataset, ensure you adjust the batch size or dataset size appropriately to avoid unexpected charges. Running large batches of requests via the Anthropic API (especially in `generate_reasoning.py`) can incur significant costs in Anthropic credits. **Please review and understand your API quota and budgeting before running the API call.** If you are just testing or working with a demo dataset, ensure you adjust the batch size or dataset size appropriately to avoid unexpected charges.
## Prerequisites ## Prerequisites
- **Python Version:** Python 3.7+ - **Python Version:** Python 3.7+
- **Dependencies:** - **Dependencies:**
- `tqdm` - `tqdm`
- `anthropic` - `anthropic`
- `reasoning_gym` - `reasoning_gym`
- **Environment Variables:** - **Environment Variables:**
For generating reasoning batches, set your Anthropic API key: For generating reasoning batches, set your Anthropic API key:
```bash ```bash
export ANTROPIC_API_KEY=your_api_key_here export ANTROPIC_API_KEY=your_api_key_here
``` ```
## Directory Structure ## Directory Structure
``` ```
examples/word_ladder/ examples/word_ladder/
@ -62,15 +62,15 @@ examples/word_ladder/
The dataset generation parameters are centralized in `main.py` under the `config` dictionary. You can adjust settings like: The dataset generation parameters are centralized in `main.py` under the `config` dictionary. You can adjust settings like:
- **Word Length:** - **Word Length:**
- `min_word_length` - `min_word_length`
- `max_word_length` - `max_word_length`
- **Chain Length:** - **Chain Length:**
- `min_chain_length` (e.g., set to -1 for the shortest possible chain) - `min_chain_length` (e.g., set to -1 for the shortest possible chain)
- `max_chain_length` - `max_chain_length`
- **Dataset Size:** - **Dataset Size:**
- `size` — the number of puzzles to generate (e.g., `1000` for a demo) - `size` — the number of puzzles to generate (e.g., `1000` for a demo)
## How to Run ## How to Run
@ -114,19 +114,18 @@ The dataset generation parameters are centralized in `main.py` under the `config
## Troubleshooting ## Troubleshooting
- **File Paths:** - **File Paths:**
Verify that `system_prompt.txt` is in the `/examples/word_ladder` folder as expected. The modules use paths relative to their location. Verify that `system_prompt.txt` is in the `/examples/word_ladder` folder as expected. The modules use paths relative to their location.
- **Environment Variables:** - **Environment Variables:**
Make sure your `ANTHROPIC_API_KEY` is set correctly when submitting API requests. Make sure your `ANTHROPIC_API_KEY` is set correctly when submitting API requests.
- **Output Directory Permissions:** - **Output Directory Permissions:**
Ensure the `output` directory exists and is writable by your user. Ensure the `output` directory exists and is writable by your user.
- **Cost Monitoring:** - **Cost Monitoring:**
Check your Anthropic API usage and account balance before running large batches to avoid unexpected costs. Check your Anthropic API usage and account balance before running large batches to avoid unexpected costs.
## License ## License
This project is licensed under the MIT License. This project is licensed under the MIT License.

View file

@ -6,10 +6,10 @@ main.py Orchestrates the overall flow:
3. Upload the final dataset to HuggingFace Hub (if needed) 3. Upload the final dataset to HuggingFace Hub (if needed)
""" """
import uuid
import sys import sys
import uuid
from pathlib import Path from pathlib import Path
from typing import Dict, Any from typing import Any, Dict
from examples.word_ladder.utils import create_word_ladders, generate_reasoning from examples.word_ladder.utils import create_word_ladders, generate_reasoning
@ -17,7 +17,7 @@ from examples.word_ladder.utils import create_word_ladders, generate_reasoning
def create_dataset(jsonl_path: Path, config: Dict[str, Any]) -> bool: def create_dataset(jsonl_path: Path, config: Dict[str, Any]) -> bool:
""" """
Creates the word ladder dataset, handling potential exhaustion gracefully. Creates the word ladder dataset, handling potential exhaustion gracefully.
Returns: Returns:
bool: True if dataset was created (even if truncated), False if creation failed bool: True if dataset was created (even if truncated), False if creation failed
""" """
@ -25,7 +25,7 @@ def create_dataset(jsonl_path: Path, config: Dict[str, Any]) -> bool:
print("Step 1: Algorithmically creating word ladder chains...") print("Step 1: Algorithmically creating word ladder chains...")
create_word_ladders.create_word_ladder_dataset(str(jsonl_path), config=config) create_word_ladders.create_word_ladder_dataset(str(jsonl_path), config=config)
return True return True
except IndexError as e: except IndexError as e:
# Dataset was exhausted but some examples were generated # Dataset was exhausted but some examples were generated
print("\nNote: Dataset generation stopped early due to exhaustion of unique puzzles.") print("\nNote: Dataset generation stopped early due to exhaustion of unique puzzles.")
@ -34,23 +34,24 @@ def create_dataset(jsonl_path: Path, config: Dict[str, Any]) -> bool:
print("Continuing with the partial dataset that was successfully generated.") print("Continuing with the partial dataset that was successfully generated.")
return True return True
return False return False
except Exception as e: except Exception as e:
# Unexpected error during dataset creation # Unexpected error during dataset creation
print(f"\nError: Failed to create dataset: {str(e)}") print(f"\nError: Failed to create dataset: {str(e)}")
return False return False
def main(): def main():
# Centralized configuration for the dataset # Centralized configuration for the dataset
config = { config = {
'dataset_name': 'word_ladder', "dataset_name": "word_ladder",
'dataset_config': { "dataset_config": {
'min_word_length': 3, "min_word_length": 3,
'max_word_length': 5, "max_word_length": 3,
'min_chain_length':-1, # set to -1 for the shortest possible path "min_chain_length": -1, # set to -1 for the shortest possible path
'max_chain_length':10, "max_chain_length": 7,
'size': 100, # Generate a small-ish dataset for demonstration "size": 2000, # Generate a small-ish dataset for demonstration
} },
} }
# Generate a friendly unique identifier and compose the file path # Generate a friendly unique identifier and compose the file path
@ -64,21 +65,20 @@ def main():
print("Exiting due to dataset creation failure.") print("Exiting due to dataset creation failure.")
sys.exit(1) sys.exit(1)
# Step 2: Generate reasoning # Step 2: Generate reasoning
'''
try: try:
print("\nStep 2: Submitting reasoning batches for the dataset...") print("\nStep 2: Submitting reasoning batches for the dataset...")
generate_reasoning.submit_reasoning_batches(input_path=str(jsonl_path)) generate_reasoning.submit_reasoning_batches(input_path=str(jsonl_path))
except Exception as e: except Exception as e:
print(f"\nError: Failed to submit reasoning batches: {str(e)}") print(f"\nError: Failed to submit reasoning batches: {str(e)}")
sys.exit(1) sys.exit(1)
'''
# Step 3: Check Anthropic batch results # Step 3: Check Anthropic batch results
# Step 4: Upload to HuggingFace 🤗 # Step 4: Upload to HuggingFace 🤗
print("\nComplete!") print("\nComplete!")
if __name__ == "__main__": if __name__ == "__main__":
main() main()

View file

@ -2,4 +2,4 @@
{"custom_id":"CONKS_GRAYS_1","result":{"type":"succeeded","message":{"id":"msg_01JrxHJyxFxsTVsn8huufv8T","type":"message","role":"assistant","model":"claude-3-5-sonnet-20241022","content":[{"type":"text","text":"<think>\nAlright, let me try to transform CONKS into GRAYS by changing one letter at a time. First, let me write down both words to analyze what needs to change:\n\nCONKS\nGRAYS\n\nLooking at the letters that need to change:\nC needs to become G\nO needs to become R\nN needs to become A\nK needs to become Y\nS stays S\n\nHmm, four letters need to change, but S stays the same at the end. Let me start by looking at possible changes from CONKS.\n\nFirst letter options from CONKS:\n- Change C to B: BONKS (valid)\n- Change C to D: DONKS (valid)\n- Change C to G: GONKS (is this valid? Not sure)\n- Change C to H: HONKS (valid)\n\nSecond letter options:\n- Change O to A: CANKS (is this valid? Not sure)\n- Change O to I: CINKS (probably not valid)\n- Change O to O: stays same\n- Change O to U: CUNKS (not sure if valid)\n\nWait, maybe instead of changing the first letter right away to G, let's try working with the O. What if we change O to another letter that might help us get closer? \n\nLet me try: CONKS → COOKS (changing N to O)\nYes! COOKS is definitely a valid word. That feels like progress.\n\nFrom COOKS, what can we do?\n- Change first C to G? GOOKS (probably not appropriate)\n- Change second O to R? No, need to keep both O's for now\n- Change K to something? COOLS? No, moving away from target\n- Change S? No, need to keep S\n\nWhat about changing one of the O's to something? COOKS → COOPS (changing K to P)\nYes! COOPS is valid (like chicken coops). This feels promising.\n\nNow from COOPS:\n- Change C to G? GOOPS (not sure if valid)\n- Change first O to R? CROPS! Yes, that works!\n\nCROPS is a valid word. Now we're getting somewhere. Let's see what we can do from CROPS:\n- Change O to A? CRAPS! That's valid and gets us closer to GRAYS\n\nFrom CRAPS:\n- Change P to Y? CRAYS! Perfect!\n\nAnd finally, from CRAYS:\n- Change C to G: GRAYS! We've reached the target!\n\nLet me verify this path:\n1. CONKS → COOKS (N to O)\n2. COOKS → COOPS (K to P)\n3. COOPS → CROPS (O to R)\n4. CROPS → CRAPS (O to A)\n5. CRAPS → CRAYS (P to Y)\n6. CRAYS → GRAYS (C to G)\n\nLet me check each word again:\nCONKS - valid (to hit on head)\nCOOKS - valid\nCOOPS - valid (enclosures)\nCROPS - valid\nCRAPS - valid\nCRAYS - valid (types of fish)\nGRAYS - valid\n\nThis seems to work perfectly! Each step changes exactly one letter, and all words are valid English words. Let me double-check if there might be a shorter path.\n\nCould we go CONKS → CONES? No, that doesn't help get to GRAYS.\nWhat about CONKS → CRANKS? No, that's two letter changes.\nOr CONKS → CORKS? Again, doesn't seem to lead anywhere useful.\n\nThe path we found:\nCONKS → COOKS → COOPS → CROPS → CRAPS → CRAYS → GRAYS\nseems optimal because:\n1. Each step is a valid word\n2. Each change moves us closer to our target\n3. We maintain the final S throughout\n4. We gradually transform the word while keeping it valid at each step\n\nYes, I think this is the solution. Six steps, changing one letter at a time, all valid words, and each change gets us closer to our target GRAYS.\n</think>"}],"stop_reason":"end_turn","stop_sequence":null,"usage":{"input_tokens":64,"cache_creation_input_tokens":0,"cache_read_input_tokens":6434,"output_tokens":956}}}} {"custom_id":"CONKS_GRAYS_1","result":{"type":"succeeded","message":{"id":"msg_01JrxHJyxFxsTVsn8huufv8T","type":"message","role":"assistant","model":"claude-3-5-sonnet-20241022","content":[{"type":"text","text":"<think>\nAlright, let me try to transform CONKS into GRAYS by changing one letter at a time. First, let me write down both words to analyze what needs to change:\n\nCONKS\nGRAYS\n\nLooking at the letters that need to change:\nC needs to become G\nO needs to become R\nN needs to become A\nK needs to become Y\nS stays S\n\nHmm, four letters need to change, but S stays the same at the end. Let me start by looking at possible changes from CONKS.\n\nFirst letter options from CONKS:\n- Change C to B: BONKS (valid)\n- Change C to D: DONKS (valid)\n- Change C to G: GONKS (is this valid? Not sure)\n- Change C to H: HONKS (valid)\n\nSecond letter options:\n- Change O to A: CANKS (is this valid? Not sure)\n- Change O to I: CINKS (probably not valid)\n- Change O to O: stays same\n- Change O to U: CUNKS (not sure if valid)\n\nWait, maybe instead of changing the first letter right away to G, let's try working with the O. What if we change O to another letter that might help us get closer? \n\nLet me try: CONKS → COOKS (changing N to O)\nYes! COOKS is definitely a valid word. That feels like progress.\n\nFrom COOKS, what can we do?\n- Change first C to G? GOOKS (probably not appropriate)\n- Change second O to R? No, need to keep both O's for now\n- Change K to something? COOLS? No, moving away from target\n- Change S? No, need to keep S\n\nWhat about changing one of the O's to something? COOKS → COOPS (changing K to P)\nYes! COOPS is valid (like chicken coops). This feels promising.\n\nNow from COOPS:\n- Change C to G? GOOPS (not sure if valid)\n- Change first O to R? CROPS! Yes, that works!\n\nCROPS is a valid word. Now we're getting somewhere. Let's see what we can do from CROPS:\n- Change O to A? CRAPS! That's valid and gets us closer to GRAYS\n\nFrom CRAPS:\n- Change P to Y? CRAYS! Perfect!\n\nAnd finally, from CRAYS:\n- Change C to G: GRAYS! We've reached the target!\n\nLet me verify this path:\n1. CONKS → COOKS (N to O)\n2. COOKS → COOPS (K to P)\n3. COOPS → CROPS (O to R)\n4. CROPS → CRAPS (O to A)\n5. CRAPS → CRAYS (P to Y)\n6. CRAYS → GRAYS (C to G)\n\nLet me check each word again:\nCONKS - valid (to hit on head)\nCOOKS - valid\nCOOPS - valid (enclosures)\nCROPS - valid\nCRAPS - valid\nCRAYS - valid (types of fish)\nGRAYS - valid\n\nThis seems to work perfectly! Each step changes exactly one letter, and all words are valid English words. Let me double-check if there might be a shorter path.\n\nCould we go CONKS → CONES? No, that doesn't help get to GRAYS.\nWhat about CONKS → CRANKS? No, that's two letter changes.\nOr CONKS → CORKS? Again, doesn't seem to lead anywhere useful.\n\nThe path we found:\nCONKS → COOKS → COOPS → CROPS → CRAPS → CRAYS → GRAYS\nseems optimal because:\n1. Each step is a valid word\n2. Each change moves us closer to our target\n3. We maintain the final S throughout\n4. We gradually transform the word while keeping it valid at each step\n\nYes, I think this is the solution. Six steps, changing one letter at a time, all valid words, and each change gets us closer to our target GRAYS.\n</think>"}],"stop_reason":"end_turn","stop_sequence":null,"usage":{"input_tokens":64,"cache_creation_input_tokens":0,"cache_read_input_tokens":6434,"output_tokens":956}}}}
{"custom_id":"WINGS_WORMY_2","result":{"type":"succeeded","message":{"id":"msg_01KXJZXTZgQoTnLvMqHphEcS","type":"message","role":"assistant","model":"claude-3-5-sonnet-20241022","content":[{"type":"text","text":"<think>\nOkay, let me tackle this word ladder from WINGS to WORMY. First, I'll write both words down and analyze what letters need to change:\n\nWINGS\nWORMY\n\nLooking at the letters:\nW stays the same\nI needs to become O\nN needs to become R\nG needs to become M\nS needs to become Y\n\nSo four letters need to change, keeping only the W. Let me start exploring possible paths by changing one letter at a time.\n\nStarting with WINGS, what can I change? Let's look at each position:\n- First letter (W): stays the same, so no need to change\n- Second letter (I): could become A, E, O, U\n- Third letter (N): could become other consonants\n- Fourth letter (G): could become other consonants\n- Fifth letter (S): could become other letters\n\nLet me try changing I first since it needs to become O eventually:\nWINGS → WONGS? Is that a word? Not sure. \n\nMaybe change N first:\nWINGS → WIGS? No, that removes a letter. Wait, I need to keep it 5 letters.\nWINGS → WINGS (same)\nWINGS → WINKS (changing G to K)! Yes, WINKS is definitely a valid word.\n\nOkay, so WINGS → WINKS works. Now from WINKS, what's next? I still need to:\n- Change I to O\n- Change K to R\n- Change S to Y\n\nLet me try changing I to O in WINKS:\nWINKS → WONKS\nIs WONKS a word? I think it is - it's the plural of wonk, someone who studies a subject extensively. Yes, that works!\n\nSo now I have:\nWINGS → WINKS → WONKS\n\nFrom WONKS, I need to:\n- Change K to R\n- Change S to Y\n\nLet me try changing K to R:\nWONKS → WORKS\nPerfect! WORKS is definitely a valid word.\n\nSo now:\nWINGS → WINKS → WONKS → WORKS\n\nFrom WORKS, I still need to:\n- Change S to Y\n\nBut wait, looking at the target WORMY, I notice I need M before Y. So from WORKS, maybe I should change S to M first?\n\nWORKS → WORMS\nYes! WORMS is a valid word.\n\nNow I have:\nWINGS → WINKS → WONKS → WORKS → WORMS\n\nFinally, from WORMS to WORMY, just need to change S to Y:\nWORMS → WORMY\n\nLet me verify the complete path:\n1. WINGS → WINKS (change G to K)\n2. WINKS → WONKS (change I to O)\n3. WONKS → WORKS (change K to R)\n4. WORKS → WORMS (change K to M)\n5. WORMS → WORMY (change S to Y)\n\nLet me check if each word is valid:\n- WINGS: valid (plural of wing)\n- WINKS: valid (plural of wink)\n- WONKS: valid (plural of wonk)\n- WORKS: valid (plural of work)\n- WORMS: valid (plural of worm)\n- WORMY: valid (full of worms)\n\nCould there be a shorter path? Let me think...\nCould I go WINGS → WONGS (I to O)? But is WONGS a valid word? Not sure about that.\nOr WINGS → WRINGS (add R)? No, that would be adding a letter, not changing one.\nOr WINGS → WORMS directly? No, that would require changing multiple letters at once.\n\nThe path I found requires five steps, which makes sense because we needed to change four letters (I→O, N→R, G→M, S→Y) and couldn't combine any changes since we can only change one letter at a time.\n\nSo after exploring alternatives, I believe the path:\nWINGS → WINKS → WONKS → WORKS → WORMS → WORMY\nis optimal, with each step changing exactly one letter and forming a valid English word.\n</think>"}],"stop_reason":"end_turn","stop_sequence":null,"usage":{"input_tokens":58,"cache_creation_input_tokens":0,"cache_read_input_tokens":6434,"output_tokens":962}}}} {"custom_id":"WINGS_WORMY_2","result":{"type":"succeeded","message":{"id":"msg_01KXJZXTZgQoTnLvMqHphEcS","type":"message","role":"assistant","model":"claude-3-5-sonnet-20241022","content":[{"type":"text","text":"<think>\nOkay, let me tackle this word ladder from WINGS to WORMY. First, I'll write both words down and analyze what letters need to change:\n\nWINGS\nWORMY\n\nLooking at the letters:\nW stays the same\nI needs to become O\nN needs to become R\nG needs to become M\nS needs to become Y\n\nSo four letters need to change, keeping only the W. Let me start exploring possible paths by changing one letter at a time.\n\nStarting with WINGS, what can I change? Let's look at each position:\n- First letter (W): stays the same, so no need to change\n- Second letter (I): could become A, E, O, U\n- Third letter (N): could become other consonants\n- Fourth letter (G): could become other consonants\n- Fifth letter (S): could become other letters\n\nLet me try changing I first since it needs to become O eventually:\nWINGS → WONGS? Is that a word? Not sure. \n\nMaybe change N first:\nWINGS → WIGS? No, that removes a letter. Wait, I need to keep it 5 letters.\nWINGS → WINGS (same)\nWINGS → WINKS (changing G to K)! Yes, WINKS is definitely a valid word.\n\nOkay, so WINGS → WINKS works. Now from WINKS, what's next? I still need to:\n- Change I to O\n- Change K to R\n- Change S to Y\n\nLet me try changing I to O in WINKS:\nWINKS → WONKS\nIs WONKS a word? I think it is - it's the plural of wonk, someone who studies a subject extensively. Yes, that works!\n\nSo now I have:\nWINGS → WINKS → WONKS\n\nFrom WONKS, I need to:\n- Change K to R\n- Change S to Y\n\nLet me try changing K to R:\nWONKS → WORKS\nPerfect! WORKS is definitely a valid word.\n\nSo now:\nWINGS → WINKS → WONKS → WORKS\n\nFrom WORKS, I still need to:\n- Change S to Y\n\nBut wait, looking at the target WORMY, I notice I need M before Y. So from WORKS, maybe I should change S to M first?\n\nWORKS → WORMS\nYes! WORMS is a valid word.\n\nNow I have:\nWINGS → WINKS → WONKS → WORKS → WORMS\n\nFinally, from WORMS to WORMY, just need to change S to Y:\nWORMS → WORMY\n\nLet me verify the complete path:\n1. WINGS → WINKS (change G to K)\n2. WINKS → WONKS (change I to O)\n3. WONKS → WORKS (change K to R)\n4. WORKS → WORMS (change K to M)\n5. WORMS → WORMY (change S to Y)\n\nLet me check if each word is valid:\n- WINGS: valid (plural of wing)\n- WINKS: valid (plural of wink)\n- WONKS: valid (plural of wonk)\n- WORKS: valid (plural of work)\n- WORMS: valid (plural of worm)\n- WORMY: valid (full of worms)\n\nCould there be a shorter path? Let me think...\nCould I go WINGS → WONGS (I to O)? But is WONGS a valid word? Not sure about that.\nOr WINGS → WRINGS (add R)? No, that would be adding a letter, not changing one.\nOr WINGS → WORMS directly? No, that would require changing multiple letters at once.\n\nThe path I found requires five steps, which makes sense because we needed to change four letters (I→O, N→R, G→M, S→Y) and couldn't combine any changes since we can only change one letter at a time.\n\nSo after exploring alternatives, I believe the path:\nWINGS → WINKS → WONKS → WORKS → WORMS → WORMY\nis optimal, with each step changing exactly one letter and forming a valid English word.\n</think>"}],"stop_reason":"end_turn","stop_sequence":null,"usage":{"input_tokens":58,"cache_creation_input_tokens":0,"cache_read_input_tokens":6434,"output_tokens":962}}}}
{"custom_id":"PINTO_RASPS_3","result":{"type":"succeeded","message":{"id":"msg_01GJQuw2NwcJQpx2civyj8Fu","type":"message","role":"assistant","model":"claude-3-5-sonnet-20241022","content":[{"type":"text","text":"<think>\nAlright, I need to transform PINTO into RASPS by changing one letter at a time. Let me analyze both words first to understand what needs to change.\n\nPINTO\nRASPS\n\nLooking at the letters:\nP -> R\nI -> A\nN -> S\nT -> P\nO -> S\n\nHmm, all five letters need to change. This might be tricky. Let me start by exploring possible changes from PINTO.\n\nFirst letter (P) possibilities:\n- MINTO (is that a word? Maybe a proper noun, but that's not allowed)\n- RINTO (not a word)\n- TINTO (might be Spanish, not valid English)\n\nSecond letter (I) possibilities:\n- PANTO (yes, that's valid!)\n- PENTO (not sure if that's valid)\n- PUNTO (Spanish, not English)\n\nThird letter (N) possibilities:\n- PINTO -> PINTS (changing O to S)\nThat's interesting! PINTS is definitely valid. Maybe that's a good first step since we need to end with 'S' anyway.\n\nLet me try PINTO -> PINTS first.\nNow from PINTS, what can we do? We need to eventually get to RASPS, so maybe changing that I to A would help?\nPINTS -> PANTS (changing I to A)\nYes! PANTS is valid. Now we have:\nPINTO -> PINTS -> PANTS\n\nFrom PANTS, looking at the letters compared to RASPS:\nP -> R\nA (matches!)\nN -> S\nT -> P\nS (matches!)\n\nMaybe we can change that T to R? No, that wouldn't help because we still need P in that position eventually. What about changing N to R? No, we need S there. \n\nLet me try changing T to something... PANTS -> PARTS (changing N to R)\nYes! PARTS is valid. So now:\nPINTO -> PINTS -> PANTS -> PARTS\n\nFrom PARTS to RASPS, we need to:\n- Change P to R at the start\n- Change T to P\n- Keep the S\n\nWait, but if I change P to R now, how would I get the P back later? Maybe I need an intermediate step.\n\nWhat if... PARTS -> WARTS (changing P to W)?\nYes! That's valid. Then from WARTS, maybe... WARPS (changing T to P)?\nWARTS -> WARPS works! \n\nSo now I have:\nPINTO -> PINTS -> PANTS -> PARTS -> WARTS -> WARPS\n\nGetting closer! From WARPS to RASPS, I need to:\n- Change W to R\n- Change P to S in the middle\n\nBut wait, if I change W to R directly, would that work? Let me try another approach...\nWARPS -> WASPS (changing R to S)\nYes! That works. Then finally:\nWASPS -> RASPS (changing W to R)\n\nLet me verify the complete path:\n1. PINTO -> PINTS (O to S)\n2. PINTS -> PANTS (I to A)\n3. PANTS -> PARTS (N to R)\n4. PARTS -> WARTS (P to W)\n5. WARTS -> WARPS (T to P)\n6. WARPS -> WASPS (R to S)\n7. WASPS -> RASPS (W to R)\n\nChecking each word:\nPINTO - valid (type of horse/car)\nPINTS - valid (plural of pint)\nPANTS - valid (clothing)\nPARTS - valid (components)\nWARTS - valid (skin growths)\nWARPS - valid (bends/distorts)\nWASPS - valid (insects)\nRASPS - valid (coarse files/harsh sounds)\n\nEach step changes exactly one letter and creates a valid English word. Seven steps in total. Let me double-check if there might be a shorter path...\n\nCould I have gone from PINTS directly to PARTS? No, that would be changing two letters (I to A and N to R).\nCould I have skipped WARTS? If I went PARTS -> WARPS directly, that would change two letters (P to W and T to P).\nCould I have gone from WARPS to RASPS directly? No, that would change two letters (W to R and P to S).\n\nNo, each step seems necessary. Therefore, this seven-step solution appears to be valid and efficient:\nPINTO -> PINTS -> PANTS -> PARTS -> WARTS -> WARPS -> WASPS -> RASPS\n</think>"}],"stop_reason":"end_turn","stop_sequence":null,"usage":{"input_tokens":67,"cache_creation_input_tokens":6434,"cache_read_input_tokens":0,"output_tokens":1081}}}} {"custom_id":"PINTO_RASPS_3","result":{"type":"succeeded","message":{"id":"msg_01GJQuw2NwcJQpx2civyj8Fu","type":"message","role":"assistant","model":"claude-3-5-sonnet-20241022","content":[{"type":"text","text":"<think>\nAlright, I need to transform PINTO into RASPS by changing one letter at a time. Let me analyze both words first to understand what needs to change.\n\nPINTO\nRASPS\n\nLooking at the letters:\nP -> R\nI -> A\nN -> S\nT -> P\nO -> S\n\nHmm, all five letters need to change. This might be tricky. Let me start by exploring possible changes from PINTO.\n\nFirst letter (P) possibilities:\n- MINTO (is that a word? Maybe a proper noun, but that's not allowed)\n- RINTO (not a word)\n- TINTO (might be Spanish, not valid English)\n\nSecond letter (I) possibilities:\n- PANTO (yes, that's valid!)\n- PENTO (not sure if that's valid)\n- PUNTO (Spanish, not English)\n\nThird letter (N) possibilities:\n- PINTO -> PINTS (changing O to S)\nThat's interesting! PINTS is definitely valid. Maybe that's a good first step since we need to end with 'S' anyway.\n\nLet me try PINTO -> PINTS first.\nNow from PINTS, what can we do? We need to eventually get to RASPS, so maybe changing that I to A would help?\nPINTS -> PANTS (changing I to A)\nYes! PANTS is valid. Now we have:\nPINTO -> PINTS -> PANTS\n\nFrom PANTS, looking at the letters compared to RASPS:\nP -> R\nA (matches!)\nN -> S\nT -> P\nS (matches!)\n\nMaybe we can change that T to R? No, that wouldn't help because we still need P in that position eventually. What about changing N to R? No, we need S there. \n\nLet me try changing T to something... PANTS -> PARTS (changing N to R)\nYes! PARTS is valid. So now:\nPINTO -> PINTS -> PANTS -> PARTS\n\nFrom PARTS to RASPS, we need to:\n- Change P to R at the start\n- Change T to P\n- Keep the S\n\nWait, but if I change P to R now, how would I get the P back later? Maybe I need an intermediate step.\n\nWhat if... PARTS -> WARTS (changing P to W)?\nYes! That's valid. Then from WARTS, maybe... WARPS (changing T to P)?\nWARTS -> WARPS works! \n\nSo now I have:\nPINTO -> PINTS -> PANTS -> PARTS -> WARTS -> WARPS\n\nGetting closer! From WARPS to RASPS, I need to:\n- Change W to R\n- Change P to S in the middle\n\nBut wait, if I change W to R directly, would that work? Let me try another approach...\nWARPS -> WASPS (changing R to S)\nYes! That works. Then finally:\nWASPS -> RASPS (changing W to R)\n\nLet me verify the complete path:\n1. PINTO -> PINTS (O to S)\n2. PINTS -> PANTS (I to A)\n3. PANTS -> PARTS (N to R)\n4. PARTS -> WARTS (P to W)\n5. WARTS -> WARPS (T to P)\n6. WARPS -> WASPS (R to S)\n7. WASPS -> RASPS (W to R)\n\nChecking each word:\nPINTO - valid (type of horse/car)\nPINTS - valid (plural of pint)\nPANTS - valid (clothing)\nPARTS - valid (components)\nWARTS - valid (skin growths)\nWARPS - valid (bends/distorts)\nWASPS - valid (insects)\nRASPS - valid (coarse files/harsh sounds)\n\nEach step changes exactly one letter and creates a valid English word. Seven steps in total. Let me double-check if there might be a shorter path...\n\nCould I have gone from PINTS directly to PARTS? No, that would be changing two letters (I to A and N to R).\nCould I have skipped WARTS? If I went PARTS -> WARPS directly, that would change two letters (P to W and T to P).\nCould I have gone from WARPS to RASPS directly? No, that would change two letters (W to R and P to S).\n\nNo, each step seems necessary. Therefore, this seven-step solution appears to be valid and efficient:\nPINTO -> PINTS -> PANTS -> PARTS -> WARTS -> WARPS -> WASPS -> RASPS\n</think>"}],"stop_reason":"end_turn","stop_sequence":null,"usage":{"input_tokens":67,"cache_creation_input_tokens":6434,"cache_read_input_tokens":0,"output_tokens":1081}}}}
{"custom_id":"GRACE_WAINS_4","result":{"type":"succeeded","message":{"id":"msg_01He4e1HXh1DXdxPz2j7K3GZ","type":"message","role":"assistant","model":"claude-3-5-sonnet-20241022","content":[{"type":"text","text":"<think>\nLet me tackle this word ladder puzzle. I need to transform GRACE into WAINS by changing just one letter at a time, making sure each step creates a valid English word.\n\nFirst, let me compare the starting and ending words:\nGRACE\nWAINS\n\nLooking at what needs to change:\nG needs to become W\nR needs to become A\nA needs to become I\nC needs to become N\nE needs to become S\n\nWow, all five letters need to change! This might be tricky. Let me start by looking at possible changes from GRACE.\n\nFrom GRACE, I could change:\n- First letter (G): BRACE, TRACE, SPACE... but I need to get to W eventually\n- Second letter (R): GRACE to GLACE? That's valid\n- Third letter (A): GRACE to GRECE? No, that's not a word\n- Fourth letter (C): GRACE to GRADE, GRAVE, GRAPE... wait, GRAPE is interesting\n\nLet me try GRACE → GRAPE (changing C to P)\nThat works! Now from GRAPE, I need to somehow get closer to WAINS.\n\nLooking at GRAPE vs WAINS:\nG still needs to become W\nR still needs to become A\nA needs to become I\nP needs to become N\nE needs to become S\n\nMaybe I can change the A to I next? GRAPE → GRIPE\nYes! GRIPE is a valid word. That's progress because now the middle letter matches our target word.\n\nFrom GRIPE:\nG still needs to become W\nR still needs to become A\nP needs to become N\nE needs to become S\n\nLooking at GRIPE... what can I change? The final E needs to become S eventually. But wait, looking at the P... if I change P to S, would that work? GRIPE → GRISE? No, that's not common. What about GRIPS? Yes! GRIPE → GRIPS works by changing E to S.\n\nNow I have GRIPS. Getting closer! Looking at what's left:\nG needs to become W\nR needs to become A\nI stays I\nP needs to become N\nS stays S\n\nFrom GRIPS, maybe I can change P to N? GRIPS → GRINS\nYes! That works. Now I have GRINS.\n\nLooking at GRINS vs WAINS:\nG needs to become W\nR needs to become A\nI stays I\nN stays N\nS stays S\n\nFrom GRINS, I need to change R to A. GRINS → GAINS\nPerfect! Now I just need to change G to W to reach the target.\n\nGAINS → WAINS\n\nLet me verify the complete path:\n1. GRACE → GRAPE (C→P)\n2. GRAPE → GRIPE (A→I)\n3. GRIPE → GRIPS (E→S)\n4. GRIPS → GRINS (P→N)\n5. GRINS → GAINS (R→A)\n6. GAINS → WAINS (G→W)\n\nLet me check each word:\nGRACE - valid\nGRAPE - valid (fruit)\nGRIPE - valid (complaint)\nGRIPS - valid (holds)\nGRINS - valid (smiles)\nGAINS - valid (increases)\nWAINS - valid (wagons)\n\nAnd each step only changes one letter. This works! Though let me just double-check if there might be a shorter path...\n\nCould I have gone from GRACE directly to WAINS in fewer steps? No, because five letters need to change, so minimum five steps needed. Could I have gone:\nGRACE → GRANS? (Not a word)\nGRACE → GRINS? (No, that would be two letters)\nGRACE → WRACE? (Not a word)\n\nNo, I think the path I found is optimal because:\n1. Each step is a valid word\n2. Each step changes only one letter\n3. We needed to change all five letters\n4. Each change gets us closer to the target\n\nTherefore, the solution path GRACE → GRAPE → GRIPE → GRIPS → GRINS → GAINS → WAINS is correct and efficient.\n</think>"}],"stop_reason":"end_turn","stop_sequence":null,"usage":{"input_tokens":61,"cache_creation_input_tokens":0,"cache_read_input_tokens":6434,"output_tokens":967}}}} {"custom_id":"GRACE_WAINS_4","result":{"type":"succeeded","message":{"id":"msg_01He4e1HXh1DXdxPz2j7K3GZ","type":"message","role":"assistant","model":"claude-3-5-sonnet-20241022","content":[{"type":"text","text":"<think>\nLet me tackle this word ladder puzzle. I need to transform GRACE into WAINS by changing just one letter at a time, making sure each step creates a valid English word.\n\nFirst, let me compare the starting and ending words:\nGRACE\nWAINS\n\nLooking at what needs to change:\nG needs to become W\nR needs to become A\nA needs to become I\nC needs to become N\nE needs to become S\n\nWow, all five letters need to change! This might be tricky. Let me start by looking at possible changes from GRACE.\n\nFrom GRACE, I could change:\n- First letter (G): BRACE, TRACE, SPACE... but I need to get to W eventually\n- Second letter (R): GRACE to GLACE? That's valid\n- Third letter (A): GRACE to GRECE? No, that's not a word\n- Fourth letter (C): GRACE to GRADE, GRAVE, GRAPE... wait, GRAPE is interesting\n\nLet me try GRACE → GRAPE (changing C to P)\nThat works! Now from GRAPE, I need to somehow get closer to WAINS.\n\nLooking at GRAPE vs WAINS:\nG still needs to become W\nR still needs to become A\nA needs to become I\nP needs to become N\nE needs to become S\n\nMaybe I can change the A to I next? GRAPE → GRIPE\nYes! GRIPE is a valid word. That's progress because now the middle letter matches our target word.\n\nFrom GRIPE:\nG still needs to become W\nR still needs to become A\nP needs to become N\nE needs to become S\n\nLooking at GRIPE... what can I change? The final E needs to become S eventually. But wait, looking at the P... if I change P to S, would that work? GRIPE → GRISE? No, that's not common. What about GRIPS? Yes! GRIPE → GRIPS works by changing E to S.\n\nNow I have GRIPS. Getting closer! Looking at what's left:\nG needs to become W\nR needs to become A\nI stays I\nP needs to become N\nS stays S\n\nFrom GRIPS, maybe I can change P to N? GRIPS → GRINS\nYes! That works. Now I have GRINS.\n\nLooking at GRINS vs WAINS:\nG needs to become W\nR needs to become A\nI stays I\nN stays N\nS stays S\n\nFrom GRINS, I need to change R to A. GRINS → GAINS\nPerfect! Now I just need to change G to W to reach the target.\n\nGAINS → WAINS\n\nLet me verify the complete path:\n1. GRACE → GRAPE (C→P)\n2. GRAPE → GRIPE (A→I)\n3. GRIPE → GRIPS (E→S)\n4. GRIPS → GRINS (P→N)\n5. GRINS → GAINS (R→A)\n6. GAINS → WAINS (G→W)\n\nLet me check each word:\nGRACE - valid\nGRAPE - valid (fruit)\nGRIPE - valid (complaint)\nGRIPS - valid (holds)\nGRINS - valid (smiles)\nGAINS - valid (increases)\nWAINS - valid (wagons)\n\nAnd each step only changes one letter. This works! Though let me just double-check if there might be a shorter path...\n\nCould I have gone from GRACE directly to WAINS in fewer steps? No, because five letters need to change, so minimum five steps needed. Could I have gone:\nGRACE → GRANS? (Not a word)\nGRACE → GRINS? (No, that would be two letters)\nGRACE → WRACE? (Not a word)\n\nNo, I think the path I found is optimal because:\n1. Each step is a valid word\n2. Each step changes only one letter\n3. We needed to change all five letters\n4. Each change gets us closer to the target\n\nTherefore, the solution path GRACE → GRAPE → GRIPE → GRIPS → GRINS → GAINS → WAINS is correct and efficient.\n</think>"}],"stop_reason":"end_turn","stop_sequence":null,"usage":{"input_tokens":61,"cache_creation_input_tokens":0,"cache_read_input_tokens":6434,"output_tokens":967}}}}

View file

@ -7,4 +7,4 @@
{"question": "Transform the word ladder 'LAIRS' to 'FUELS' by changing one letter at a time.", "answer": "LAIRS,FAIRS,FAILS,FARLS,FURLS,FUELS", "reasoning": null, "metadata": {"start_word": "LAIRS", "end_word": "FUELS", "word_length": 5, "chain_length": 6}} {"question": "Transform the word ladder 'LAIRS' to 'FUELS' by changing one letter at a time.", "answer": "LAIRS,FAIRS,FAILS,FARLS,FURLS,FUELS", "reasoning": null, "metadata": {"start_word": "LAIRS", "end_word": "FUELS", "word_length": 5, "chain_length": 6}}
{"question": "Transform the word ladder 'WINKS' to 'TENTH' by changing one letter at a time.", "answer": "WINKS,PINKS,PINTS,TINTS,TENTS,TENTH", "reasoning": null, "metadata": {"start_word": "WINKS", "end_word": "TENTH", "word_length": 5, "chain_length": 6}} {"question": "Transform the word ladder 'WINKS' to 'TENTH' by changing one letter at a time.", "answer": "WINKS,PINKS,PINTS,TINTS,TENTS,TENTH", "reasoning": null, "metadata": {"start_word": "WINKS", "end_word": "TENTH", "word_length": 5, "chain_length": 6}}
{"question": "Transform the word ladder 'VERBS' to 'DATES' by changing one letter at a time.", "answer": "VERBS,HERBS,HERDS,HARDS,HARES,HATES,DATES", "reasoning": null, "metadata": {"start_word": "VERBS", "end_word": "DATES", "word_length": 5, "chain_length": 7}} {"question": "Transform the word ladder 'VERBS' to 'DATES' by changing one letter at a time.", "answer": "VERBS,HERBS,HERDS,HARDS,HARES,HATES,DATES", "reasoning": null, "metadata": {"start_word": "VERBS", "end_word": "DATES", "word_length": 5, "chain_length": 7}}
{"question": "Transform the word ladder 'SCENT' to 'HEARS' by changing one letter at a time.", "answer": "SCENT,SCANT,SCART,SCARS,SEARS,HEARS", "reasoning": null, "metadata": {"start_word": "SCENT", "end_word": "HEARS", "word_length": 5, "chain_length": 6}} {"question": "Transform the word ladder 'SCENT' to 'HEARS' by changing one letter at a time.", "answer": "SCENT,SCANT,SCART,SCARS,SEARS,HEARS", "reasoning": null, "metadata": {"start_word": "SCENT", "end_word": "HEARS", "word_length": 5, "chain_length": 6}}

View file

@ -1,6 +1,5 @@
# Required dependencies for the Word Ladder Puzzle Dataset Generator # Required dependencies for the Word Ladder Puzzle Dataset Generator
# Python 3.7+ # Python 3.7+
anthropic>=0.45.2 # Client library for interacting with Anthropic's API anthropic>=0.45.2 # Client library for interacting with Anthropic's API
tqdm>=4.67.1 # For progress bars tqdm>=4.67.1 # For progress bars

View file

@ -1,8 +1,8 @@
Word Ladder puzzles involve transforming a start word into an end word. You are allowed to change only one letter a time and you must keep the number of letters constant. Word Ladder puzzles involve transforming a start word into an end word. You are allowed to change only one letter a time and you must keep the number of letters constant.
Each time you change one letter the word in the chain must be forming a new word that's valid in English. Plurals are allowed, but not proper nouns. Each time you change one letter the word in the chain must be forming a new word that's valid in English. Plurals are allowed, but not proper nouns.
Your task will be to simulate a detailed step-by-step chain of thought reasoning of the transformation process from a start word to the end word of a word ladder. Your task will be to simulate a detailed step-by-step chain of thought reasoning of the transformation process from a start word to the end word of a word ladder.
You will be given the word ladder question, as well as the correct solution path. So you don't need to solve the problem, you have the solution already, which should be your 'cheat sheet' to refer to as you go along. You will be given the word ladder question, as well as the correct solution path. So you don't need to solve the problem, you have the solution already, which should be your 'cheat sheet' to refer to as you go along.
Your task is to provide a perfectly simulated chain of thought reasoning exactly in the style of the example below, Your task is to provide a perfectly simulated chain of thought reasoning exactly in the style of the example below,
including considering multiple possibilities, backtracking where needed, validating words, and showing the final path. including considering multiple possibilities, backtracking where needed, validating words, and showing the final path.
Remember you're NOT trying to 'find' the shortest path necessarily. Make sure you're not lazy and show the user that you're working hard to find the intermediate words in the chain, and that you haven't gotten simply lucky in finding the steps. Remember you're NOT trying to 'find' the shortest path necessarily. Make sure you're not lazy and show the user that you're working hard to find the intermediate words in the chain, and that you haven't gotten simply lucky in finding the steps.
For example, you may go down a path that's no longer getting you closer to the solution, so you backtrack and try a different path, which ends up working. For example, you may go down a path that's no longer getting you closer to the solution, so you backtrack and try a different path, which ends up working.
@ -285,4 +285,4 @@ Yes, this works. Therefore, the sequence is BALL → BOLL → POLL → POLE →
Finally, I've seen in your previous answers to similar questions that you sometimes get complacent and deviate from the hard work required. You must maintain the style and level effort throughout the entirety of your answer until the end. Finally, I've seen in your previous answers to similar questions that you sometimes get complacent and deviate from the hard work required. You must maintain the style and level effort throughout the entirety of your answer until the end.
Now, the user will provide you with the word ladder question and the correct solution path. Provide the user with a very verbose, first person chain of thought reasoning that is simulating how a human would methodically think to solve the word ladder step by step. Now, the user will provide you with the word ladder question and the correct solution path. Provide the user with a very verbose, first person chain of thought reasoning that is simulating how a human would methodically think to solve the word ladder step by step.

View file

@ -1,11 +1,12 @@
import os
import json
import time
import datetime import datetime
import pytest import json
import os
import time
from pathlib import Path from pathlib import Path
from types import SimpleNamespace from types import SimpleNamespace
import pytest
from examples.word_ladder import generate_reasoning from examples.word_ladder import generate_reasoning
# We alias the functions and globals for easier usage in our tests. # We alias the functions and globals for easier usage in our tests.
@ -16,6 +17,7 @@ COMMON_UUID = generate_reasoning.COMMON_UUID
BATCH_SIZE = generate_reasoning.BATCH_SIZE BATCH_SIZE = generate_reasoning.BATCH_SIZE
client = generate_reasoning.client client = generate_reasoning.client
# Define a mock batch response class mimicking Anthropic's API response. # Define a mock batch response class mimicking Anthropic's API response.
class MockBatchResponse: class MockBatchResponse:
def __init__(self, batch_id="msgbatch_mock", processing_status="in_progress", fail=False): def __init__(self, batch_id="msgbatch_mock", processing_status="in_progress", fail=False):
@ -23,13 +25,7 @@ class MockBatchResponse:
self.type = "message_batch" self.type = "message_batch"
self.processing_status = processing_status self.processing_status = processing_status
# Make request_counts a SimpleNamespace object with the required attributes # Make request_counts a SimpleNamespace object with the required attributes
self.request_counts = SimpleNamespace( self.request_counts = SimpleNamespace(processing=0, succeeded=0, errored=0, canceled=0, expired=0)
processing=0,
succeeded=0,
errored=0,
canceled=0,
expired=0
)
self.ended_at = None self.ended_at = None
# Use datetime objects so that isoformat() is available # Use datetime objects so that isoformat() is available
self.created_at = datetime.datetime.utcnow().replace(tzinfo=datetime.timezone.utc) self.created_at = datetime.datetime.utcnow().replace(tzinfo=datetime.timezone.utc)
@ -37,52 +33,68 @@ class MockBatchResponse:
self.cancel_initiated_at = None self.cancel_initiated_at = None
self.results_url = None self.results_url = None
# Helper: Create a temporary system prompt file. # Helper: Create a temporary system prompt file.
@pytest.fixture @pytest.fixture
def system_prompt_file(tmp_path, monkeypatch): def system_prompt_file(tmp_path, monkeypatch):
prompt_text = "This is a system prompt." prompt_text = "This is a system prompt."
sys_file = tmp_path / "system_prompt.txt" sys_file = tmp_path / "system_prompt.txt"
sys_file.write_text(prompt_text, encoding="utf-8") sys_file.write_text(prompt_text, encoding="utf-8")
# Monkeypatch the system prompt path # Monkeypatch the system prompt path
monkeypatch.setattr(generate_reasoning, "DEFAULT_SYSTEM_PROMPT", str(sys_file)) monkeypatch.setattr(generate_reasoning, "DEFAULT_SYSTEM_PROMPT", str(sys_file))
return sys_file return sys_file
# Helper: Create necessary directories using a temporary location. # Helper: Create necessary directories using a temporary location.
@pytest.fixture @pytest.fixture
def setup_directories(tmp_path, monkeypatch): def setup_directories(tmp_path, monkeypatch):
# Create output directory in temporary path # Create output directory in temporary path
output_dir = tmp_path / "output" output_dir = tmp_path / "output"
output_dir.mkdir(exist_ok=True) output_dir.mkdir(exist_ok=True)
# Monkeypatch the DEFAULT_OUTPUT_DIR to be a Path (temporary directory). # Monkeypatch the DEFAULT_OUTPUT_DIR to be a Path (temporary directory).
monkeypatch.setattr(generate_reasoning, "DEFAULT_OUTPUT_DIR", output_dir) monkeypatch.setattr(generate_reasoning, "DEFAULT_OUTPUT_DIR", output_dir)
# Ensure we're working in the temporary directory # Ensure we're working in the temporary directory
monkeypatch.chdir(tmp_path) monkeypatch.chdir(tmp_path)
return output_dir return output_dir
# Helper: Create a temporary input JSONL file with given entries. # Helper: Create a temporary input JSONL file with given entries.
@pytest.fixture @pytest.fixture
def input_jsonl_file(tmp_path, setup_directories, monkeypatch): def input_jsonl_file(tmp_path, setup_directories, monkeypatch):
# Create input file in temporary directory # Create input file in temporary directory
file_path = setup_directories / "word_ladder_examples.jsonl" file_path = setup_directories / "word_ladder_examples.jsonl"
entries = [ entries = [
{ "question": "Transform 'A' to 'B'", "answer": "A,X,B", "reasoning": None, {
"metadata": { "start_word": "A", "end_word": "B", "word_length": 1, "chain_length": 3 } }, "question": "Transform 'A' to 'B'",
{ "question": "Transform 'C' to 'D'", "answer": "C,Y,D", "reasoning": "Some reasoning", "answer": "A,X,B",
"metadata": { "start_word": "C", "end_word": "D", "word_length": 1, "chain_length": 3 } }, "reasoning": None,
{ "question": "Transform 'E' to 'F'", "answer": "E,Z,F", "reasoning": None, "metadata": {"start_word": "A", "end_word": "B", "word_length": 1, "chain_length": 3},
"metadata": { "start_word": "E", "end_word": "F", "word_length": 1, "chain_length": 3 } } },
{
"question": "Transform 'C' to 'D'",
"answer": "C,Y,D",
"reasoning": "Some reasoning",
"metadata": {"start_word": "C", "end_word": "D", "word_length": 1, "chain_length": 3},
},
{
"question": "Transform 'E' to 'F'",
"answer": "E,Z,F",
"reasoning": None,
"metadata": {"start_word": "E", "end_word": "F", "word_length": 1, "chain_length": 3},
},
] ]
with file_path.open("w", encoding="utf-8") as f: with file_path.open("w", encoding="utf-8") as f:
for entry in entries: for entry in entries:
f.write(json.dumps(entry) + "\n") f.write(json.dumps(entry) + "\n")
# Monkeypatch DEFAULT_INPUT_JSONL to point to our temporary test file. # Monkeypatch DEFAULT_INPUT_JSONL to point to our temporary test file.
monkeypatch.setattr(generate_reasoning, "DEFAULT_INPUT_JSONL", str(file_path)) monkeypatch.setattr(generate_reasoning, "DEFAULT_INPUT_JSONL", str(file_path))
return file_path return file_path
# Test that submit_reasoning_batches builds a batch skipping entries with existing reasoning. # Test that submit_reasoning_batches builds a batch skipping entries with existing reasoning.
def test_submit_batches_success(system_prompt_file, input_jsonl_file, setup_directories, monkeypatch): def test_submit_batches_success(system_prompt_file, input_jsonl_file, setup_directories, monkeypatch):
def fake_create(requests): def fake_create(requests):
@ -112,17 +124,16 @@ def test_submit_batches_success(system_prompt_file, input_jsonl_file, setup_dire
assert temperature == 0.5, "Incorrect temperature" assert temperature == 0.5, "Incorrect temperature"
assert "C_D_" not in custom_id assert "C_D_" not in custom_id
return MockBatchResponse(batch_id="msgbatch_test_success") return MockBatchResponse(batch_id="msgbatch_test_success")
monkeypatch.setattr(client.messages.batches, "create", fake_create) monkeypatch.setattr(client.messages.batches, "create", fake_create)
batch_metadata_prefix = "test_metadata" batch_metadata_prefix = "test_metadata"
submit_reasoning_batches(input_path=str(input_jsonl_file), submit_reasoning_batches(input_path=str(input_jsonl_file), batch_metadata_prefix=batch_metadata_prefix)
batch_metadata_prefix=batch_metadata_prefix)
metadata_filename = f"{COMMON_UUID}_{batch_metadata_prefix}.jsonl" metadata_filename = f"{COMMON_UUID}_{batch_metadata_prefix}.jsonl"
meta_file_path = setup_directories / metadata_filename meta_file_path = setup_directories / metadata_filename
assert meta_file_path.exists(), "Metadata file was not created as expected." assert meta_file_path.exists(), "Metadata file was not created as expected."
with meta_file_path.open("r", encoding="utf-8") as f: with meta_file_path.open("r", encoding="utf-8") as f:
lines = f.readlines() lines = f.readlines()
# Expecting only those entries that did not already have a reasoning. # Expecting only those entries that did not already have a reasoning.
@ -136,61 +147,65 @@ def test_submit_batches_success(system_prompt_file, input_jsonl_file, setup_dire
custom_ids = metadata["custom_ids"] custom_ids = metadata["custom_ids"]
assert len(custom_ids) == 2 assert len(custom_ids) == 2
# Test that _submit_single_batch retries once and eventually succeeds. # Test that _submit_single_batch retries once and eventually succeeds.
def test_retry_logic(system_prompt_file, setup_directories, monkeypatch): def test_retry_logic(system_prompt_file, setup_directories, monkeypatch):
call_count = {"count": 0} call_count = {"count": 0}
def fake_create_retry(requests): def fake_create_retry(requests):
if call_count["count"] == 0: if call_count["count"] == 0:
call_count["count"] += 1 call_count["count"] += 1
raise Exception("Temporary failure") raise Exception("Temporary failure")
return MockBatchResponse(batch_id="msgbatch_retry_success") return MockBatchResponse(batch_id="msgbatch_retry_success")
monkeypatch.setattr(client.messages.batches, "create", fake_create_retry) monkeypatch.setattr(client.messages.batches, "create", fake_create_retry)
dummy_request = type("DummyRequest", (), {"custom_id": "dummy_1"})() dummy_request = type("DummyRequest", (), {"custom_id": "dummy_1"})()
batch_requests = [dummy_request] batch_requests = [dummy_request]
custom_ids = ["dummy_1"] custom_ids = ["dummy_1"]
_submit_single_batch(batch_requests, custom_ids, 0, "test_retry", "dummy_input.jsonl") _submit_single_batch(batch_requests, custom_ids, 0, "test_retry", "dummy_input.jsonl")
metadata_filename = f"{COMMON_UUID}_test_retry.jsonl" metadata_filename = f"{COMMON_UUID}_test_retry.jsonl"
meta_file_path = setup_directories / metadata_filename meta_file_path = setup_directories / metadata_filename
assert meta_file_path.exists(), "Retry metadata file was not created." assert meta_file_path.exists(), "Retry metadata file was not created."
with meta_file_path.open("r", encoding="utf-8") as f: with meta_file_path.open("r", encoding="utf-8") as f:
metadata = json.loads(f.read()) metadata = json.loads(f.read())
assert metadata["api_response"]["id"] == "msgbatch_retry_success" assert metadata["api_response"]["id"] == "msgbatch_retry_success"
assert call_count["count"] == 1 assert call_count["count"] == 1
# Test that when all attempts to submit a batch fail, the error is logged to the failed file. # Test that when all attempts to submit a batch fail, the error is logged to the failed file.
def test_failed_batch(system_prompt_file, setup_directories, monkeypatch): def test_failed_batch(system_prompt_file, setup_directories, monkeypatch):
def fake_create_fail(requests): def fake_create_fail(requests):
raise Exception("Permanent failure") raise Exception("Permanent failure")
monkeypatch.setattr(client.messages.batches, "create", fake_create_fail) monkeypatch.setattr(client.messages.batches, "create", fake_create_fail)
dummy_request = type("DummyRequest", (), {"custom_id": "dummy_fail"})() dummy_request = type("DummyRequest", (), {"custom_id": "dummy_fail"})()
batch_requests = [dummy_request] batch_requests = [dummy_request]
custom_ids = ["dummy_fail"] custom_ids = ["dummy_fail"]
_submit_single_batch(batch_requests, custom_ids, 0, "test_failed", "dummy_input.jsonl") _submit_single_batch(batch_requests, custom_ids, 0, "test_failed", "dummy_input.jsonl")
error_filename = f"{COMMON_UUID}_failed_batches.jsonl" error_filename = f"{COMMON_UUID}_failed_batches.jsonl"
error_file_path = setup_directories / error_filename error_file_path = setup_directories / error_filename
assert error_file_path.exists(), "Failed batch log file was not created." assert error_file_path.exists(), "Failed batch log file was not created."
with error_file_path.open("r", encoding="utf-8") as f: with error_file_path.open("r", encoding="utf-8") as f:
error_entry = json.loads(f.readline()) error_entry = json.loads(f.readline())
assert error_entry["batch_number"] == 0 assert error_entry["batch_number"] == 0
assert "Permanent failure" in error_entry["error"] assert "Permanent failure" in error_entry["error"]
assert error_entry["batch_requests"] == ["dummy_fail"] assert error_entry["batch_requests"] == ["dummy_fail"]
# Test batching behavior when multiple batches are needed. # Test batching behavior when multiple batches are needed.
def test_multiple_batches(system_prompt_file, setup_directories, monkeypatch): def test_multiple_batches(system_prompt_file, setup_directories, monkeypatch):
test_batch_size = 2 test_batch_size = 2
monkeypatch.setattr(generate_reasoning, "BATCH_SIZE", test_batch_size) monkeypatch.setattr(generate_reasoning, "BATCH_SIZE", test_batch_size)
# Create input file # Create input file
input_file = setup_directories / "word_ladder_examples.jsonl" input_file = setup_directories / "word_ladder_examples.jsonl"
entries = [ entries = [
@ -198,42 +213,42 @@ def test_multiple_batches(system_prompt_file, setup_directories, monkeypatch):
"question": f"Transform word ladder {idx}", "question": f"Transform word ladder {idx}",
"answer": f"start,mid,end_{idx}", "answer": f"start,mid,end_{idx}",
"reasoning": None, "reasoning": None,
"metadata": {"start_word": f"start_{idx}", "end_word": f"end_{idx}"} "metadata": {"start_word": f"start_{idx}", "end_word": f"end_{idx}"},
} }
for idx in range(5) for idx in range(5)
] ]
with input_file.open("w", encoding="utf-8") as f: with input_file.open("w", encoding="utf-8") as f:
for entry in entries: for entry in entries:
f.write(json.dumps(entry) + "\n") f.write(json.dumps(entry) + "\n")
# Monkeypatch DEFAULT_INPUT_JSONL. # Monkeypatch DEFAULT_INPUT_JSONL.
monkeypatch.setattr(generate_reasoning, "DEFAULT_INPUT_JSONL", str(input_file)) monkeypatch.setattr(generate_reasoning, "DEFAULT_INPUT_JSONL", str(input_file))
batch_ids = [] batch_ids = []
def fake_create(requests): def fake_create(requests):
new_id = f"msgbatch_batch_{len(batch_ids)}" new_id = f"msgbatch_batch_{len(batch_ids)}"
batch_ids.append(new_id) batch_ids.append(new_id)
return MockBatchResponse(batch_id=new_id) return MockBatchResponse(batch_id=new_id)
monkeypatch.setattr(client.messages.batches, "create", fake_create) monkeypatch.setattr(client.messages.batches, "create", fake_create)
batch_metadata_prefix = "test_multi" batch_metadata_prefix = "test_multi"
submit_reasoning_batches(input_path=str(input_file), submit_reasoning_batches(input_path=str(input_file), batch_metadata_prefix=batch_metadata_prefix)
batch_metadata_prefix=batch_metadata_prefix)
metadata_filename = f"{COMMON_UUID}_{batch_metadata_prefix}.jsonl" metadata_filename = f"{COMMON_UUID}_{batch_metadata_prefix}.jsonl"
meta_file_path = setup_directories / metadata_filename meta_file_path = setup_directories / metadata_filename
assert meta_file_path.exists(), "Multiple batch metadata file was not created." assert meta_file_path.exists(), "Multiple batch metadata file was not created."
with meta_file_path.open("r", encoding="utf-8") as f: with meta_file_path.open("r", encoding="utf-8") as f:
metadata_lines = f.readlines() metadata_lines = f.readlines()
# With 5 qualifying entries and a batch size of 2 we expect 3 batches. # With 5 qualifying entries and a batch size of 2 we expect 3 batches.
assert len(metadata_lines) == 3 assert len(metadata_lines) == 3
seen_custom_ids = [] seen_custom_ids = []
for line in metadata_lines: for line in metadata_lines:
metadata = json.loads(line) metadata = json.loads(line)
seen_custom_ids.extend(metadata["custom_ids"]) seen_custom_ids.extend(metadata["custom_ids"])
assert metadata["api_response"]["id"] in batch_ids assert metadata["api_response"]["id"] in batch_ids
assert len(seen_custom_ids) == 5 assert len(seen_custom_ids) == 5

View file

@ -11,29 +11,30 @@ from tqdm import tqdm
import reasoning_gym import reasoning_gym
def check_duplicates(jsonl_path: str) -> tuple[bool, dict]: def check_duplicates(jsonl_path: str) -> tuple[bool, dict]:
""" """
Check for duplicate word pairs in a word ladder JSONL file. Check for duplicate word pairs in a word ladder JSONL file.
Returns: Returns:
tuple[bool, dict]: (has_duplicates, valid_entries) where: tuple[bool, dict]: (has_duplicates, valid_entries) where:
- has_duplicates: True if any duplicates were found - has_duplicates: True if any duplicates were found
- valid_entries: Dict mapping line_number -> data for non-duplicate entries - valid_entries: Dict mapping line_number -> data for non-duplicate entries
Note: A pair is considered duplicate if either (word1, word2) or (word2, word1) Note: A pair is considered duplicate if either (word1, word2) or (word2, word1)
already exists, since word ladder paths are bidirectional. already exists, since word ladder paths are bidirectional.
""" """
pairs_seen = {} # (start, end) -> (line_number, data) pairs_seen = {} # (start, end) -> (line_number, data)
valid_entries = {} valid_entries = {}
duplicates_found = False duplicates_found = False
with open(jsonl_path, 'r', encoding='utf-8') as f: with open(jsonl_path, "r", encoding="utf-8") as f:
for line_num, line in enumerate(f): for line_num, line in enumerate(f):
data = json.loads(line) data = json.loads(line)
metadata = data['metadata'] metadata = data["metadata"]
pair = (metadata['start_word'], metadata['end_word']) pair = (metadata["start_word"], metadata["end_word"])
reverse_pair = (metadata['end_word'], metadata['start_word']) reverse_pair = (metadata["end_word"], metadata["start_word"])
# Check both orientations of the pair # Check both orientations of the pair
if pair in pairs_seen or reverse_pair in pairs_seen: if pair in pairs_seen or reverse_pair in pairs_seen:
duplicates_found = True duplicates_found = True
@ -43,9 +44,10 @@ def check_duplicates(jsonl_path: str) -> tuple[bool, dict]:
# Store both the line number and data for valid entries # Store both the line number and data for valid entries
pairs_seen[pair] = (line_num, data) pairs_seen[pair] = (line_num, data)
valid_entries[line_num] = data valid_entries[line_num] = data
return duplicates_found, valid_entries return duplicates_found, valid_entries
def create_word_ladder_dataset(jsonl_path: str = None, config: dict = None) -> None: def create_word_ladder_dataset(jsonl_path: str = None, config: dict = None) -> None:
""" """
Creates a word ladder dataset and writes each sample as a JSON line. Creates a word ladder dataset and writes each sample as a JSON line.
@ -65,67 +67,67 @@ def create_word_ladder_dataset(jsonl_path: str = None, config: dict = None) -> N
jsonl_path = output_dir / f"word_ladders_{unique_id}.jsonl" jsonl_path = output_dir / f"word_ladders_{unique_id}.jsonl"
else: else:
jsonl_path = Path(jsonl_path) jsonl_path = Path(jsonl_path)
target_size = config['dataset_config']['size'] target_size = config["dataset_config"]["size"]
current_size = 0 current_size = 0
max_attempts = 3 # Limit total regeneration attempts max_attempts = 3 # Limit total regeneration attempts
attempt = 0 attempt = 0
# Initial generation # Initial generation
dataset = reasoning_gym.create_dataset(config['dataset_name'], **config['dataset_config']) dataset = reasoning_gym.create_dataset(config["dataset_name"], **config["dataset_config"])
with open(jsonl_path, 'w', encoding='utf-8') as f: with open(jsonl_path, "w", encoding="utf-8") as f:
for item in tqdm(dataset, desc="Generating initial ladder examples"): for item in tqdm(dataset, desc="Generating initial ladder examples"):
row = { row = {
'question': item['question'], "question": item["question"],
'answer': item['answer'], "answer": item["answer"],
'reasoning': None, "reasoning": None,
'metadata': item.get('metadata', {}) "metadata": item.get("metadata", {}),
} }
f.write(json.dumps(row) + '\n') f.write(json.dumps(row) + "\n")
while attempt < max_attempts: while attempt < max_attempts:
# Check entire file for duplicates # Check entire file for duplicates
has_duplicates, valid_entries = check_duplicates(jsonl_path) has_duplicates, valid_entries = check_duplicates(jsonl_path)
current_size = len(valid_entries) current_size = len(valid_entries)
if not has_duplicates and current_size == target_size: if not has_duplicates and current_size == target_size:
print(f"\nSuccessfully created dataset with {current_size} unique examples.") print(f"\nSuccessfully created dataset with {current_size} unique examples.")
return return
# If we have duplicates or not enough entries, regenerate the missing amount # If we have duplicates or not enough entries, regenerate the missing amount
needed = target_size - current_size needed = target_size - current_size
if needed > 0: if needed > 0:
print(f"\nAttempt {attempt + 1}: Regenerating {needed} examples to replace duplicates/missing entries...") print(f"\nAttempt {attempt + 1}: Regenerating {needed} examples to replace duplicates/missing entries...")
# Generate additional examples # Generate additional examples
config['dataset_config']['size'] = needed config["dataset_config"]["size"] = needed
additional_dataset = reasoning_gym.create_dataset(config['dataset_name'], **config['dataset_config']) additional_dataset = reasoning_gym.create_dataset(config["dataset_name"], **config["dataset_config"])
# Write all entries to a temporary file # Write all entries to a temporary file
temp_path = jsonl_path.with_suffix('.tmp') temp_path = jsonl_path.with_suffix(".tmp")
with open(temp_path, 'w', encoding='utf-8') as f: with open(temp_path, "w", encoding="utf-8") as f:
# Write existing valid entries # Write existing valid entries
for data in valid_entries.values(): for data in valid_entries.values():
f.write(json.dumps(data) + '\n') f.write(json.dumps(data) + "\n")
# Write new entries # Write new entries
for item in additional_dataset: for item in additional_dataset:
row = { row = {
'question': item['question'], "question": item["question"],
'answer': item['answer'], "answer": item["answer"],
'reasoning': None, "reasoning": None,
'metadata': item.get('metadata', {}) "metadata": item.get("metadata", {}),
} }
f.write(json.dumps(row) + '\n') f.write(json.dumps(row) + "\n")
# Replace original file with temporary file # Replace original file with temporary file
temp_path.replace(jsonl_path) temp_path.replace(jsonl_path)
# Note: We'll check for duplicates again at the start of the next loop # Note: We'll check for duplicates again at the start of the next loop
attempt += 1 attempt += 1
if current_size < target_size: if current_size < target_size:
print(f"\nWarning: Could only generate {current_size} unique examples after {max_attempts} attempts.") print(f"\nWarning: Could only generate {current_size} unique examples after {max_attempts} attempts.")
else: else:
print(f"\nSuccessfully created dataset with {current_size} unique examples.") print(f"\nSuccessfully created dataset with {current_size} unique examples.")

View file

@ -14,17 +14,16 @@ In our informal testing, Sonnet was deemed best performance value.
You can swap out to another API, but this will need a rewrite to remove anthropic-specific code. You can swap out to another API, but this will need a rewrite to remove anthropic-specific code.
""" """
import os
import json import json
import uuid import os
import time import time
import uuid
from pathlib import Path from pathlib import Path
from tqdm import tqdm
import anthropic import anthropic
from anthropic.types.message_create_params import MessageCreateParamsNonStreaming from anthropic.types.message_create_params import MessageCreateParamsNonStreaming
from anthropic.types.messages.batch_create_params import Request from anthropic.types.messages.batch_create_params import Request
from tqdm import tqdm
# Updated default output directory to use the parent directory. # Updated default output directory to use the parent directory.
DEFAULT_OUTPUT_DIR = Path(__file__).resolve().parent.parent / "output" DEFAULT_OUTPUT_DIR = Path(__file__).resolve().parent.parent / "output"
@ -36,18 +35,19 @@ BATCH_SIZE = 2500
COMMON_UUID = uuid.uuid4().hex[:8] COMMON_UUID = uuid.uuid4().hex[:8]
# Set up the Anthropic client (ensure the API key is set in the environment) # Set up the Anthropic client (ensure the API key is set in the environment)
client = anthropic.Anthropic(api_key=os.environ['ANTHROPIC_API_KEY']) client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
def submit_reasoning_batches( def submit_reasoning_batches(
input_path: str = DEFAULT_INPUT_JSONL, input_path: str = DEFAULT_INPUT_JSONL,
batch_metadata_prefix: str = "batch_metadata", batch_metadata_prefix: str = "batch_metadata",
system_prompt_path: str = DEFAULT_SYSTEM_PROMPT system_prompt_path: str = DEFAULT_SYSTEM_PROMPT,
) -> None: ) -> None:
""" """
Reads the input JSONL file of word ladder examples, builds batch requests for any example that Reads the input JSONL file of word ladder examples, builds batch requests for any example that
does not have reasoning, splits them into groups of BATCH_SIZE, and submits each batch using does not have reasoning, splits them into groups of BATCH_SIZE, and submits each batch using
Anthropic's Message Batches API. Anthropic's Message Batches API.
Args: Args:
input_path: Path to input JSONL file input_path: Path to input JSONL file
batch_metadata_prefix: Prefix for batch metadata files batch_metadata_prefix: Prefix for batch metadata files
@ -59,34 +59,36 @@ def submit_reasoning_batches(
# Read the system prompt from file (used as a preamble for every request) # Read the system prompt from file (used as a preamble for every request)
with open(system_prompt_path, "r", encoding="utf-8") as sys_file: with open(system_prompt_path, "r", encoding="utf-8") as sys_file:
system_message = [{ system_message = [
"type": "text", {
"text": sys_file.read(), "type": "text",
"cache_control": {"type": "ephemeral"} # Enable anthropic prompt caching "text": sys_file.read(),
}] "cache_control": {"type": "ephemeral"}, # Enable anthropic prompt caching
}
]
batch_requests = [] batch_requests = []
custom_ids = [] # List of custom_ids for the current batch custom_ids = [] # List of custom_ids for the current batch
batch_num = 0 batch_num = 0
# Get the total number of lines in advance for tqdm progress bar. # Get the total number of lines in advance for tqdm progress bar.
total_lines = sum(1 for _ in open(input_path)) total_lines = sum(1 for _ in open(input_path))
with open(input_path, 'r', encoding="utf-8") as infile: with open(input_path, "r", encoding="utf-8") as infile:
for idx, line in tqdm(enumerate(infile), desc="Preparing batch requests", total=total_lines): for idx, line in tqdm(enumerate(infile), desc="Preparing batch requests", total=total_lines):
data = json.loads(line) data = json.loads(line)
# Skip example if 'reasoning' already exists. # Skip example if 'reasoning' already exists.
if not data.get('reasoning'): if not data.get("reasoning"):
# Build a custom id. Here we use the row position and the start/end words: # Build a custom id. Here we use the row position and the start/end words:
metadata = data.get("metadata", {}) metadata = data.get("metadata", {})
start = metadata.get("start_word", "unknown") start = metadata.get("start_word", "unknown")
end = metadata.get("end_word", "unknown") end = metadata.get("end_word", "unknown")
custom_id = f"{start}_{end}_{idx}" custom_id = f"{start}_{end}_{idx}"
custom_ids.append(custom_id) custom_ids.append(custom_id)
# Build the prompt text exactly as before. # Build the prompt text exactly as before.
prompt = f"{data['question']}. The correct solution is {data['answer']}. " prompt = f"{data['question']}. The correct solution is {data['answer']}. "
# Build the request payload using Request and MessageCreateParamsNonStreaming. # Build the request payload using Request and MessageCreateParamsNonStreaming.
request_payload = Request( request_payload = Request(
custom_id=custom_id, custom_id=custom_id,
@ -95,10 +97,8 @@ def submit_reasoning_batches(
max_tokens=8192, max_tokens=8192,
temperature=0.5, temperature=0.5,
system=system_message, system=system_message,
messages=[ messages=[{"role": "user", "content": prompt}],
{"role": "user", "content": prompt} ),
]
)
) )
# Instead of wrapping in SimpleNamespace, simply ensure custom_id is set. # Instead of wrapping in SimpleNamespace, simply ensure custom_id is set.
if isinstance(request_payload, dict): if isinstance(request_payload, dict):
@ -106,7 +106,7 @@ def submit_reasoning_batches(
else: else:
request_payload.custom_id = custom_id request_payload.custom_id = custom_id
batch_requests.append(request_payload) batch_requests.append(request_payload)
# If we have reached our batch size limit, submit the current batch. # If we have reached our batch size limit, submit the current batch.
if len(batch_requests) >= BATCH_SIZE: if len(batch_requests) >= BATCH_SIZE:
_submit_single_batch(batch_requests, custom_ids, batch_num, batch_metadata_prefix, input_path) _submit_single_batch(batch_requests, custom_ids, batch_num, batch_metadata_prefix, input_path)
@ -114,7 +114,7 @@ def submit_reasoning_batches(
# Reset for the next batch # Reset for the next batch
batch_requests = [] batch_requests = []
custom_ids = [] custom_ids = []
# Submit any remaining requests that didn't complete a full batch. # Submit any remaining requests that didn't complete a full batch.
if batch_requests: if batch_requests:
_submit_single_batch(batch_requests, custom_ids, batch_num, batch_metadata_prefix, input_path) _submit_single_batch(batch_requests, custom_ids, batch_num, batch_metadata_prefix, input_path)
@ -141,11 +141,11 @@ def _submit_single_batch(batch_requests, custom_ids, batch_num, batch_metadata_p
if dt.tzinfo is not None and dt.tzinfo.utcoffset(dt) is not None: if dt.tzinfo is not None and dt.tzinfo.utcoffset(dt) is not None:
iso_str = iso_str.replace("+00:00", "Z") iso_str = iso_str.replace("+00:00", "Z")
return iso_str return iso_str
def extract_custom_id(req): def extract_custom_id(req):
# Safely extract the custom_id attribute whether req is an object or a dict. # Safely extract the custom_id attribute whether req is an object or a dict.
return req.custom_id if hasattr(req, "custom_id") else req.get("custom_id") return req.custom_id if hasattr(req, "custom_id") else req.get("custom_id")
max_attempts = 2 max_attempts = 2
attempt = 0 attempt = 0
last_exception = None last_exception = None
@ -153,9 +153,7 @@ def _submit_single_batch(batch_requests, custom_ids, batch_num, batch_metadata_p
while attempt < max_attempts: while attempt < max_attempts:
try: try:
print(f"Submitting batch {batch_num} with {len(batch_requests)} requests... (attempt {attempt+1})") print(f"Submitting batch {batch_num} with {len(batch_requests)} requests... (attempt {attempt+1})")
message_batch = client.messages.batches.create( message_batch = client.messages.batches.create(requests=batch_requests)
requests=batch_requests
)
time.sleep(1) time.sleep(1)
print(f"Batch {batch_num} submitted with ID: {message_batch.id}") print(f"Batch {batch_num} submitted with ID: {message_batch.id}")
break # Success: exit the loop. break # Success: exit the loop.
@ -166,17 +164,21 @@ def _submit_single_batch(batch_requests, custom_ids, batch_num, batch_metadata_p
if attempt < max_attempts: if attempt < max_attempts:
print("Retrying...") print("Retrying...")
time.sleep(1) time.sleep(1)
if message_batch is None: if message_batch is None:
error_filename = output_dir / f"{COMMON_UUID}_failed_batches.jsonl" error_filename = output_dir / f"{COMMON_UUID}_failed_batches.jsonl"
error_msg = f"{str(last_exception)} after {max_attempts} attempts" if last_exception else f"Failed after {max_attempts} attempts" error_msg = (
f"{str(last_exception)} after {max_attempts} attempts"
if last_exception
else f"Failed after {max_attempts} attempts"
)
failed_info = { failed_info = {
"batch_number": batch_num, "batch_number": batch_num,
"error": error_msg, "error": error_msg,
"batch_requests": [extract_custom_id(req) for req in batch_requests], "batch_requests": [extract_custom_id(req) for req in batch_requests],
"input_file": input_path, "input_file": input_path,
} }
with open(error_filename, 'a', encoding='utf-8') as error_file: with open(error_filename, "a", encoding="utf-8") as error_file:
error_file.write(json.dumps(failed_info) + "\n") error_file.write(json.dumps(failed_info) + "\n")
print(f"Batch {batch_num} permanently failed. Logged to {error_filename}.") print(f"Batch {batch_num} permanently failed. Logged to {error_filename}.")
return return
@ -198,14 +200,15 @@ def _submit_single_batch(batch_requests, custom_ids, batch_num, batch_metadata_p
"batch_id": message_batch.id, "batch_id": message_batch.id,
"api_response": api_response, "api_response": api_response,
"custom_ids": custom_ids, "custom_ids": custom_ids,
"input_file": os.path.basename(input_path), "input_file": os.path.basename(input_path),
} }
metadata_filename = output_dir / f"{COMMON_UUID}_{batch_metadata_prefix}.jsonl" metadata_filename = output_dir / f"{COMMON_UUID}_{batch_metadata_prefix}.jsonl"
with open(metadata_filename, 'a', encoding='utf-8') as meta_file: with open(metadata_filename, "a", encoding="utf-8") as meta_file:
meta_file.write(json.dumps(batch_metadata) + "\n") meta_file.write(json.dumps(batch_metadata) + "\n")
print(f"Batch metadata for batch {batch_num} appended to {metadata_filename}.") print(f"Batch metadata for batch {batch_num} appended to {metadata_filename}.")
if __name__ == "__main__": if __name__ == "__main__":
# When running this module directly, submit the reasoning batches. # When running this module directly, submit the reasoning batches.
submit_reasoning_batches() submit_reasoning_batches()

View file

@ -9,7 +9,7 @@ It then calculates and prints statistics for each usage token field:
- cache_creation_input_tokens - cache_creation_input_tokens
- cache_read_input_tokens - cache_read_input_tokens
- output_tokens - output_tokens
+pricing calculations +pricing calculations
+calculates the savings from caching (vs if we hadn't done any caching) +calculates the savings from caching (vs if we hadn't done any caching)
+forecasts costs for 10,000, 20,000 and 50,000 jobs based on tokens per query +forecasts costs for 10,000, 20,000 and 50,000 jobs based on tokens per query
@ -17,17 +17,14 @@ Usage:
python usage_stats.py path/to/msgbatch_01X9LgZNVkLFhzrrBd9LNgWb_results.jsonl python usage_stats.py path/to/msgbatch_01X9LgZNVkLFhzrrBd9LNgWb_results.jsonl
""" """
import json
import argparse import argparse
import json
from statistics import mean from statistics import mean
def main(): def main():
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(description="Compute usage token statistics from a JSONL file.")
description="Compute usage token statistics from a JSONL file." parser.add_argument("file", help="Path to the JSONL file containing usage token data.")
)
parser.add_argument(
"file", help="Path to the JSONL file containing usage token data."
)
args = parser.parse_args() args = parser.parse_args()
# Usage token fields that we want to track # Usage token fields that we want to track
@ -43,7 +40,7 @@ def main():
pricing = { pricing = {
"input_tokens": base_input_rate, "input_tokens": base_input_rate,
"cache_creation_input_tokens": base_input_rate * 1.25, # More expensive for initial computation "cache_creation_input_tokens": base_input_rate * 1.25, # More expensive for initial computation
"cache_read_input_tokens": base_input_rate * 0.1, # Cheaper for cache-read tokens "cache_read_input_tokens": base_input_rate * 0.1, # Cheaper for cache-read tokens
"output_tokens": 7.50, "output_tokens": 7.50,
} }
@ -82,7 +79,7 @@ def main():
print(f"\nProcessed {total_lines} lines with {error_count} error(s).\n") print(f"\nProcessed {total_lines} lines with {error_count} error(s).\n")
print("Usage Tokens Statistics:") print("Usage Tokens Statistics:")
print("-" * 40) print("-" * 40)
grand_total_cost = 0.0 grand_total_cost = 0.0
# Calculate and print stats for each token type # Calculate and print stats for each token type
for key in usage_fields: for key in usage_fields:
@ -115,7 +112,7 @@ def main():
# Without caching, all tokens would have been charged at the standard input rate. # Without caching, all tokens would have been charged at the standard input rate.
# #
# Baseline cost (if no caching were used): # Baseline cost (if no caching were used):
# = (input_tokens + cache_creation_input_tokens + cache_read_input_tokens) # = (input_tokens + cache_creation_input_tokens + cache_read_input_tokens)
# / 1_000_000 * base_input_rate # / 1_000_000 * base_input_rate
# #
# Actual cost (with caching): # Actual cost (with caching):
@ -129,9 +126,11 @@ def main():
sum_cache_read = sum(usage_data["cache_read_input_tokens"]) sum_cache_read = sum(usage_data["cache_read_input_tokens"])
baseline_input_cost = (sum_input + sum_cache_creation + sum_cache_read) / 1_000_000 * pricing["input_tokens"] baseline_input_cost = (sum_input + sum_cache_creation + sum_cache_read) / 1_000_000 * pricing["input_tokens"]
actual_input_cost = (sum_input) / 1_000_000 * pricing["input_tokens"] \ actual_input_cost = (
+ (sum_cache_creation) / 1_000_000 * pricing["cache_creation_input_tokens"] \ (sum_input) / 1_000_000 * pricing["input_tokens"]
+ (sum_cache_read) / 1_000_000 * pricing["cache_read_input_tokens"] + (sum_cache_creation) / 1_000_000 * pricing["cache_creation_input_tokens"]
+ (sum_cache_read) / 1_000_000 * pricing["cache_read_input_tokens"]
)
caching_savings = baseline_input_cost - actual_input_cost caching_savings = baseline_input_cost - actual_input_cost
print(f"Caching Savings (input-related tokens): ${caching_savings:.2f}") print(f"Caching Savings (input-related tokens): ${caching_savings:.2f}")
@ -172,12 +171,16 @@ def main():
forecast_output = avg_output_tokens * jobs forecast_output = avg_output_tokens * jobs
# Forecast actual cost (with caching applied for input tokens): # Forecast actual cost (with caching applied for input tokens):
actual_input_cost_forecast = (forecast_input) / 1_000_000 * pricing["input_tokens"] \ actual_input_cost_forecast = (
+ (forecast_cache_creation) / 1_000_000 * pricing["cache_creation_input_tokens"] \ (forecast_input) / 1_000_000 * pricing["input_tokens"]
+ (forecast_cache_read) / 1_000_000 * pricing["cache_read_input_tokens"] + (forecast_cache_creation) / 1_000_000 * pricing["cache_creation_input_tokens"]
+ (forecast_cache_read) / 1_000_000 * pricing["cache_read_input_tokens"]
)
# Without caching, all input-related tokens would be at base_input_rate: # Without caching, all input-related tokens would be at base_input_rate:
baseline_input_cost_forecast = (forecast_input + forecast_cache_creation + forecast_cache_read) / 1_000_000 * pricing["input_tokens"] baseline_input_cost_forecast = (
(forecast_input + forecast_cache_creation + forecast_cache_read) / 1_000_000 * pricing["input_tokens"]
)
caching_savings_forecast = baseline_input_cost_forecast - actual_input_cost_forecast caching_savings_forecast = baseline_input_cost_forecast - actual_input_cost_forecast
@ -198,5 +201,6 @@ def main():
else: else:
print("No valid jobs to forecast future costs.") print("No valid jobs to forecast future costs.")
if __name__ == "__main__": if __name__ == "__main__":
main() main()

View file

@ -13,24 +13,24 @@ from ..factory import ProceduralDataset, register_dataset
@dataclass @dataclass
class WordLadderConfig: class WordLadderConfig:
"""Configuration for word ladder task generation""" """Configuration for word ladder task generation"""
min_word_length: int = 4 # Minimum word length min_word_length: int = 4 # Minimum word length
max_word_length: int = 4 # Maximum word length max_word_length: int = 4 # Maximum word length
min_chain_length: int = -1 # Set to -1 for shortest path or a minimum of 3 min_chain_length: int = -1 # Set to -1 for shortest path or a minimum of 3
max_chain_length: int = -1 # Set to -1 for shortest path or a max max_chain_length: int = -1 # Set to -1 for shortest path or a max
seed: Optional[int] = None seed: Optional[int] = None
size: int = 500 size: int = 500
def validate(self) -> None: def validate(self) -> None:
"""Validate configuration parameters""" """Validate configuration parameters"""
assert self.min_word_length >= 3, "min_word_length must be >= 3" assert self.min_word_length >= 3, "min_word_length must be >= 3"
assert self.max_word_length >= self.min_word_length, "max_word_length must be >= min_word_length" assert self.max_word_length >= self.min_word_length, "max_word_length must be >= min_word_length"
assert self.max_word_length <= 5, "max_word_length must be <= 5" assert self.max_word_length <= 5, "max_word_length must be <= 5"
# Add size validation # Add size validation
if self.size > 20000: # Add reasonable upper limit if self.size > 20000: # Add reasonable upper limit
raise ValueError("Dataset size too large for this algorithm and constraints") raise ValueError("Dataset size too large for this algorithm and constraints")
# Modified validation logic # Modified validation logic
if self.min_chain_length == -1: if self.min_chain_length == -1:
if self.max_chain_length != -1: if self.max_chain_length != -1:
@ -50,10 +50,12 @@ class WordLadderConfig:
if self.max_chain_length == -1: if self.max_chain_length == -1:
return length >= 3 return length >= 3
return 3 <= length <= self.max_chain_length return 3 <= length <= self.max_chain_length
# Otherwise check against both min and max # Otherwise check against both min and max
return (self.min_chain_length <= length <= return (
(self.max_chain_length if self.max_chain_length != -1 else float('inf'))) self.min_chain_length <= length <= (self.max_chain_length if self.max_chain_length != -1 else float("inf"))
)
class WordLadderDataset(ProceduralDataset): class WordLadderDataset(ProceduralDataset):
"""Generates word ladder transformation tasks""" """Generates word ladder transformation tasks"""
@ -62,27 +64,25 @@ class WordLadderDataset(ProceduralDataset):
self.config = config self.config = config
self.word_sets = {} self.word_sets = {}
self.word_graphs = {} self.word_graphs = {}
# Load words from CSV # Load words from CSV
self.word_sets = self._load_words_from_csv( self.word_sets = self._load_words_from_csv(
min_length=self.config.min_word_length, min_length=self.config.min_word_length, max_length=self.config.max_word_length
max_length=self.config.max_word_length
) )
# Precompute word graphs for all lengths # Precompute word graphs for all lengths
for length in range(self.config.min_word_length, self.config.max_word_length + 1): for length in range(self.config.min_word_length, self.config.max_word_length + 1):
self.word_graphs[length] = self._build_word_graph(length) self.word_graphs[length] = self._build_word_graph(length)
config.validate() config.validate()
super().__init__(config=config, seed=config.seed, size=config.size) super().__init__(config=config, seed=config.seed, size=config.size)
@classmethod @classmethod
def _load_words_from_csv(cls, min_length: int = 3, max_length: int = 5) -> Dict[int, Set[str]]: def _load_words_from_csv(cls, min_length: int = 3, max_length: int = 5) -> Dict[int, Set[str]]:
"""Load words from CSV file organized by length""" """Load words from CSV file organized by length"""
# Validate length range before processing # Validate length range before processing
assert 3 <= min_length <= max_length <= 5, "Word length must be between 3 and 5 inclusive" assert 3 <= min_length <= max_length <= 5, "Word length must be between 3 and 5 inclusive"
import csv import csv
from io import StringIO from io import StringIO
@ -99,14 +99,14 @@ class WordLadderDataset(ProceduralDataset):
for row in reader: for row in reader:
# Process each word length column using config range # Process each word length column using config range
for length in range(min_length, max_length + 1): for length in range(min_length, max_length + 1):
col_name = f'{length}_letter' col_name = f"{length}_letter"
word = row.get(col_name, '') word = row.get(col_name, "")
if not word: # Skip empty entries if not word: # Skip empty entries
continue continue
word_sets.setdefault(length, set()).add(word.upper()) word_sets.setdefault(length, set()).add(word.upper())
except Exception as e: except Exception as e:
raise RuntimeError(f"Error processing words.csv content: {e}") from e raise RuntimeError(f"Error processing words.csv content: {e}") from e
@ -122,12 +122,12 @@ class WordLadderDataset(ProceduralDataset):
# Try precomputed graph first # Try precomputed graph first
if len(word) in self.word_graphs and word in self.word_graphs[len(word)]: if len(word) in self.word_graphs and word in self.word_graphs[len(word)]:
return self.word_graphs[len(word)].get(word, set()) return self.word_graphs[len(word)].get(word, set())
# Fall back to computing neighbors directly for custom word sets # Fall back to computing neighbors directly for custom word sets
neighbors = set() neighbors = set()
for i in range(len(word)): for i in range(len(word)):
for c in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ': for c in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
neighbor = word[:i] + c + word[i+1:] neighbor = word[:i] + c + word[i + 1 :]
if neighbor != word and neighbor in word_set: if neighbor != word and neighbor in word_set:
neighbors.add(neighbor) neighbors.add(neighbor)
return neighbors return neighbors
@ -137,21 +137,21 @@ class WordLadderDataset(ProceduralDataset):
# Return cached graph if it exists # Return cached graph if it exists
if word_length in self.word_graphs: if word_length in self.word_graphs:
return self.word_graphs[word_length] return self.word_graphs[word_length]
# Build new graph # Build new graph
word_set = self.word_sets[word_length] word_set = self.word_sets[word_length]
graph = {} graph = {}
# Build connections # Build connections
for word in word_set: for word in word_set:
neighbors = set() neighbors = set()
for i in range(word_length): for i in range(word_length):
for c in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ': for c in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
neighbor = word[:i] + c + word[i+1:] neighbor = word[:i] + c + word[i + 1 :]
if neighbor != word and neighbor in word_set: if neighbor != word and neighbor in word_set:
neighbors.add(neighbor) neighbors.add(neighbor)
graph[word] = neighbors graph[word] = neighbors
# Cache and return # Cache and return
self.word_graphs[word_length] = graph self.word_graphs[word_length] = graph
return self.word_graphs[word_length] return self.word_graphs[word_length]
@ -161,7 +161,7 @@ class WordLadderDataset(ProceduralDataset):
# Early exit if words are direct neighbors # Early exit if words are direct neighbors
if end in self._get_neighbors(start, word_set): if end in self._get_neighbors(start, word_set):
return [start, end] return [start, end]
# Use basic BFS for shortest path # Use basic BFS for shortest path
queue = deque([(start, [start])]) queue = deque([(start, [start])])
visited = {start} visited = {start}
@ -172,13 +172,13 @@ class WordLadderDataset(ProceduralDataset):
if self.config.is_valid_path_length(len(path)): if self.config.is_valid_path_length(len(path)):
return path return path
return None return None
for neighbor in self._get_neighbors(current, word_set): for neighbor in self._get_neighbors(current, word_set):
if neighbor not in visited: if neighbor not in visited:
visited.add(neighbor) visited.add(neighbor)
new_path = path + [neighbor] new_path = path + [neighbor]
queue.append((neighbor, new_path)) queue.append((neighbor, new_path))
return None return None
def _generate_word_pair(self, rng: Random, length: int) -> Tuple[str, str, List[str]]: def _generate_word_pair(self, rng: Random, length: int) -> Tuple[str, str, List[str]]:
@ -186,25 +186,25 @@ class WordLadderDataset(ProceduralDataset):
word_set = self.word_sets[length] word_set = self.word_sets[length]
words_list = sorted(word_set) words_list = sorted(word_set)
max_attempts = 100 max_attempts = 100
for _ in range(max_attempts): for _ in range(max_attempts):
start = rng.choice(words_list) start = rng.choice(words_list)
end = rng.choice(words_list) end = rng.choice(words_list)
if start == end: if start == end:
continue continue
path = self._find_path(start, end, word_set) path = self._find_path(start, end, word_set)
if path: if path:
return start, end, path return start, end, path
raise RuntimeError(f"Failed to find valid pair for length {length}") raise RuntimeError(f"Failed to find valid pair for length {length}")
def __getitem__(self, idx: int) -> dict: def __getitem__(self, idx: int) -> dict:
"""Generate a single word ladder task""" """Generate a single word ladder task"""
if idx >= self.size: if idx >= self.size:
raise IndexError(f"Dataset index {idx} out of range for size {self.size}") raise IndexError(f"Dataset index {idx} out of range for size {self.size}")
try: try:
rng = Random(self.seed + idx) rng = Random(self.seed + idx)
length = rng.randint(self.config.min_word_length, self.config.max_word_length) length = rng.randint(self.config.min_word_length, self.config.max_word_length)
@ -213,11 +213,12 @@ class WordLadderDataset(ProceduralDataset):
# If we run out of valid paths, adjust the virtual size # If we run out of valid paths, adjust the virtual size
self.size = idx self.size = idx
raise IndexError(f"Dataset exhausted at index {idx}. {str(e)}") raise IndexError(f"Dataset exhausted at index {idx}. {str(e)}")
return { return {
"question": f"Transform the word ladder '{start}' to '{end}' by changing one letter at a time.", "question": f"Transform the word ladder '{start}' to '{end}' by changing one letter at a time.",
"answer": ",".join(path), "answer": ",".join(path),
"metadata": {"start_word": start, "end_word": end, "word_length": length, "chain_length": len(path)}, "metadata": {"start_word": start, "end_word": end, "word_length": length, "chain_length": len(path)},
} }
register_dataset("word_ladder", WordLadderDataset, WordLadderConfig) register_dataset("word_ladder", WordLadderDataset, WordLadderConfig)

View file

@ -5305,4 +5305,4 @@ ZZZ,FEST,CAULS
,,ZONAL ,,ZONAL
,,ZONED ,,ZONED
,,ZONES ,,ZONES
,,ZOOMS ,,ZOOMS

1 3_letter 4_letter 5_letter
5305 ZONAL
5306 ZONED
5307 ZONES
5308 ZOOMS

View file

@ -1,6 +1,7 @@
import pytest
from random import Random
import time import time
from random import Random
import pytest
from reasoning_gym.algorithmic.word_ladder import WordLadderConfig, WordLadderDataset from reasoning_gym.algorithmic.word_ladder import WordLadderConfig, WordLadderDataset
@ -63,14 +64,14 @@ def test_word_ladder_dataset_unique_pairs():
"""Test that generated word pairs are unique""" """Test that generated word pairs are unique"""
config = WordLadderConfig(size=50, seed=42) config = WordLadderConfig(size=50, seed=42)
dataset = WordLadderDataset(config) dataset = WordLadderDataset(config)
# Track all generated pairs # Track all generated pairs
seen_pairs = set() seen_pairs = set()
for i in range(len(dataset)): for i in range(len(dataset)):
item = dataset[i] item = dataset[i]
pair = ( pair = (
min(item["metadata"]["start_word"], item["metadata"]["end_word"]), min(item["metadata"]["start_word"], item["metadata"]["end_word"]),
max(item["metadata"]["start_word"], item["metadata"]["end_word"]) max(item["metadata"]["start_word"], item["metadata"]["end_word"]),
) )
assert pair not in seen_pairs, f"Duplicate pair found: {pair}" assert pair not in seen_pairs, f"Duplicate pair found: {pair}"
seen_pairs.add(pair) seen_pairs.add(pair)
@ -80,23 +81,13 @@ def test_word_ladder_dataset_items():
"""Test basic properties of generated items""" """Test basic properties of generated items"""
# Test with specific chain length constraints # Test with specific chain length constraints
config1 = WordLadderConfig( config1 = WordLadderConfig(
min_word_length=3, min_word_length=3, max_word_length=5, min_chain_length=3, max_chain_length=5, size=10, seed=42
max_word_length=5,
min_chain_length=3,
max_chain_length=5,
size=10,
seed=42
) )
dataset1 = WordLadderDataset(config1) dataset1 = WordLadderDataset(config1)
# Test with shortest path mode # Test with shortest path mode
config2 = WordLadderConfig( config2 = WordLadderConfig(
min_word_length=3, min_word_length=3, max_word_length=5, min_chain_length=-1, max_chain_length=-1, size=10, seed=42
max_word_length=5,
min_chain_length=-1,
max_chain_length=-1,
size=10,
seed=42
) )
dataset2 = WordLadderDataset(config2) dataset2 = WordLadderDataset(config2)
@ -124,13 +115,13 @@ def test_word_ladder_dataset_items():
# Verify solution chain from answer # Verify solution chain from answer
solution_chain = item["answer"].split(",") solution_chain = item["answer"].split(",")
# Verify chain length based on config # Verify chain length based on config
if dataset.config.min_chain_length == -1: if dataset.config.min_chain_length == -1:
assert len(solution_chain) >= 3 assert len(solution_chain) >= 3
else: else:
assert dataset.config.min_chain_length <= len(solution_chain) <= dataset.config.max_chain_length assert dataset.config.min_chain_length <= len(solution_chain) <= dataset.config.max_chain_length
assert solution_chain[0] == metadata["start_word"] assert solution_chain[0] == metadata["start_word"]
assert solution_chain[-1] == metadata["end_word"] assert solution_chain[-1] == metadata["end_word"]
@ -171,35 +162,32 @@ def test_word_ladder_path_finding():
min_chain_length=-1, # Shortest path mode min_chain_length=-1, # Shortest path mode
max_chain_length=-1, max_chain_length=-1,
size=10, size=10,
seed=42 seed=42,
) )
dataset = WordLadderDataset(config) dataset = WordLadderDataset(config)
# Test finding path between known words # Test finding path between known words
word_set = dataset.word_sets[4] word_set = dataset.word_sets[4]
path = dataset._find_path("WORD", "FIND", word_set) path = dataset._find_path("WORD", "FIND", word_set)
# Verify path properties # Verify path properties
assert path is not None assert path is not None
assert path[0] == "WORD" assert path[0] == "WORD"
assert path[-1] == "FIND" assert path[-1] == "FIND"
assert len(path) >= 3 assert len(path) >= 3
# Verify each step differs by only one letter # Verify each step differs by only one letter
for i in range(len(path)-1): for i in range(len(path) - 1):
current = path[i] current = path[i]
next_word = path[i+1] next_word = path[i + 1]
assert next_word in dataset._get_neighbors(current, word_set) assert next_word in dataset._get_neighbors(current, word_set)
def test_word_ladder_csv_loading(): def test_word_ladder_csv_loading():
"""Test word loading from CSV""" """Test word loading from CSV"""
config = WordLadderConfig( config = WordLadderConfig(min_word_length=3, max_word_length=5)
min_word_length=3,
max_word_length=5
)
dataset = WordLadderDataset(config) dataset = WordLadderDataset(config)
# Verify word sets for each length # Verify word sets for each length
for length in range(3, 6): for length in range(3, 6):
assert length in dataset.word_sets assert length in dataset.word_sets
@ -210,7 +198,7 @@ def test_word_ladder_csv_loading():
assert len(word) == length assert len(word) == length
assert word.isupper() assert word.isupper()
assert word.isalpha() assert word.isalpha()
# Test invalid length range # Test invalid length range
with pytest.raises(AssertionError): with pytest.raises(AssertionError):
bad_config = WordLadderConfig(min_word_length=2, max_word_length=7) bad_config = WordLadderConfig(min_word_length=2, max_word_length=7)
@ -219,23 +207,18 @@ def test_word_ladder_csv_loading():
def test_word_ladder_pair_generation(): def test_word_ladder_pair_generation():
"""Test word pair generation logic""" """Test word pair generation logic"""
config = WordLadderConfig( config = WordLadderConfig(min_word_length=4, max_word_length=4, size=10, seed=42)
min_word_length=4,
max_word_length=4,
size=10,
seed=42
)
dataset = WordLadderDataset(config) dataset = WordLadderDataset(config)
# Test pair generation # Test pair generation
rng = Random(42) rng = Random(42)
start, end, path = dataset._generate_word_pair(rng, 4) start, end, path = dataset._generate_word_pair(rng, 4)
# Verify path properties # Verify path properties
assert start == path[0] assert start == path[0]
assert end == path[-1] assert end == path[-1]
assert len(path) >= 3 assert len(path) >= 3
# Verify path is valid (each step differs by one letter) # Verify path is valid (each step differs by one letter)
for i in range(len(path) - 1): for i in range(len(path) - 1):
current = path[i] current = path[i]
@ -247,17 +230,17 @@ def test_word_graph_caching():
"""Test word graph caching functionality""" """Test word graph caching functionality"""
config = WordLadderConfig(seed=42) config = WordLadderConfig(seed=42)
dataset = WordLadderDataset(config) dataset = WordLadderDataset(config)
# Verify initial graphs are precomputed # Verify initial graphs are precomputed
assert len(dataset.word_graphs) > 0 assert len(dataset.word_graphs) > 0
# Get initial graph for length 4 # Get initial graph for length 4
graph_4 = dataset.word_graphs[4] graph_4 = dataset.word_graphs[4]
# Verify cached graph is returned # Verify cached graph is returned
cached_graph = dataset._build_word_graph(4) cached_graph = dataset._build_word_graph(4)
assert cached_graph is dataset.word_graphs[4] assert cached_graph is dataset.word_graphs[4]
# Verify graph structure # Verify graph structure
for word, neighbors in graph_4.items(): for word, neighbors in graph_4.items():
assert len(word) == 4 assert len(word) == 4
@ -269,33 +252,24 @@ def test_word_graph_caching():
def test_word_ladder_path_validation(): def test_word_ladder_path_validation():
"""Test path length validation logic""" """Test path length validation logic"""
config = WordLadderConfig( config = WordLadderConfig(min_chain_length=4, max_chain_length=6)
min_chain_length=4,
max_chain_length=6
)
# Test specific length mode # Test specific length mode
assert config.is_valid_path_length(4) # Min length assert config.is_valid_path_length(4) # Min length
assert config.is_valid_path_length(5) # Middle length assert config.is_valid_path_length(5) # Middle length
assert config.is_valid_path_length(6) # Max length assert config.is_valid_path_length(6) # Max length
assert not config.is_valid_path_length(3) # Too short assert not config.is_valid_path_length(3) # Too short
assert not config.is_valid_path_length(7) # Too long assert not config.is_valid_path_length(7) # Too long
# Test shortest path mode # Test shortest path mode
config_shortest = WordLadderConfig( config_shortest = WordLadderConfig(min_chain_length=-1, max_chain_length=-1)
min_chain_length=-1,
max_chain_length=-1
)
assert config_shortest.is_valid_path_length(3) assert config_shortest.is_valid_path_length(3)
assert config_shortest.is_valid_path_length(4) assert config_shortest.is_valid_path_length(4)
assert config_shortest.is_valid_path_length(10) assert config_shortest.is_valid_path_length(10)
assert not config_shortest.is_valid_path_length(2) assert not config_shortest.is_valid_path_length(2)
# Test mixed mode (shortest with max limit) # Test mixed mode (shortest with max limit)
config_mixed = WordLadderConfig( config_mixed = WordLadderConfig(min_chain_length=-1, max_chain_length=5)
min_chain_length=-1,
max_chain_length=5
)
assert config_mixed.is_valid_path_length(3) assert config_mixed.is_valid_path_length(3)
assert config_mixed.is_valid_path_length(4) assert config_mixed.is_valid_path_length(4)
assert config_mixed.is_valid_path_length(5) assert config_mixed.is_valid_path_length(5)
@ -305,44 +279,41 @@ def test_word_ladder_path_validation():
def test_word_ladder_solution_optimality(): def test_word_ladder_solution_optimality():
"""Test that generated solutions are optimal when min_chain_length=-1""" """Test that generated solutions are optimal when min_chain_length=-1"""
config = WordLadderConfig( config = WordLadderConfig(
min_word_length=4, min_word_length=4, max_word_length=4, min_chain_length=-1, max_chain_length=-1, size=20, seed=42
max_word_length=4,
min_chain_length=-1,
max_chain_length=-1,
size=20,
seed=42
) )
dataset = WordLadderDataset(config) dataset = WordLadderDataset(config)
for i in range(len(dataset)): for i in range(len(dataset)):
item = dataset[i] item = dataset[i]
solution_chain = item["answer"].split(",") solution_chain = item["answer"].split(",")
start_word = item["metadata"]["start_word"] start_word = item["metadata"]["start_word"]
end_word = item["metadata"]["end_word"] end_word = item["metadata"]["end_word"]
# Verify this is the shortest possible path # Verify this is the shortest possible path
word_set = dataset.word_sets[len(start_word)] word_set = dataset.word_sets[len(start_word)]
# Build graph and use BFS to find shortest path # Build graph and use BFS to find shortest path
from collections import deque from collections import deque
queue = deque([(start_word, [start_word])]) queue = deque([(start_word, [start_word])])
visited = {start_word} visited = {start_word}
shortest_path = None shortest_path = None
while queue and not shortest_path: while queue and not shortest_path:
current_word, path = queue.popleft() current_word, path = queue.popleft()
if current_word == end_word: if current_word == end_word:
shortest_path = path shortest_path = path
break break
for neighbor in dataset._get_neighbors(current_word, word_set): for neighbor in dataset._get_neighbors(current_word, word_set):
if neighbor not in visited: if neighbor not in visited:
visited.add(neighbor) visited.add(neighbor)
queue.append((neighbor, path + [neighbor])) queue.append((neighbor, path + [neighbor]))
assert shortest_path is not None, f"No path found between {start_word} and {end_word}" assert shortest_path is not None, f"No path found between {start_word} and {end_word}"
assert len(solution_chain) == len(shortest_path), \ assert len(solution_chain) == len(
f"Solution {solution_chain} is not optimal. Shortest path: {shortest_path}" shortest_path
), f"Solution {solution_chain} is not optimal. Shortest path: {shortest_path}"
def test_word_ladder_performance(): def test_word_ladder_performance():
@ -351,13 +322,13 @@ def test_word_ladder_performance():
start_time = time.time() start_time = time.time()
dataset = WordLadderDataset(config) dataset = WordLadderDataset(config)
init_time = time.time() - start_time init_time = time.time() - start_time
# Test item generation time # Test item generation time
start_time = time.time() start_time = time.time()
for i in range(len(dataset)): for i in range(len(dataset)):
_ = dataset[i] _ = dataset[i]
access_time = time.time() - start_time access_time = time.time() - start_time
# These thresholds should be adjusted based on requirements # These thresholds should be adjusted based on requirements
assert init_time < 2.0, f"Initialization took too long: {init_time:.2f}s" assert init_time < 2.0, f"Initialization took too long: {init_time:.2f}s"
assert access_time < 1.0, f"Data access took too long: {access_time:.2f}s" assert access_time < 1.0, f"Data access took too long: {access_time:.2f}s"
@ -369,24 +340,18 @@ def test_word_ladder_edge_cases():
config = WordLadderConfig(size=1) config = WordLadderConfig(size=1)
dataset = WordLadderDataset(config) dataset = WordLadderDataset(config)
assert len(dataset) == 1 assert len(dataset) == 1
# Test with same start/end word length but maximum distance # Test with same start/end word length but maximum distance
config = WordLadderConfig( config = WordLadderConfig(min_word_length=4, max_word_length=4, min_chain_length=-1, max_chain_length=-1, size=10)
min_word_length=4,
max_word_length=4,
min_chain_length=-1,
max_chain_length=-1,
size=10
)
dataset = WordLadderDataset(config) dataset = WordLadderDataset(config)
# Find the pair with longest solution # Find the pair with longest solution
max_length = 0 max_length = 0
for i in range(len(dataset)): for i in range(len(dataset)):
item = dataset[i] item = dataset[i]
chain_length = len(item["answer"].split(",")) chain_length = len(item["answer"].split(","))
max_length = max(max_length, chain_length) max_length = max(max_length, chain_length)
assert max_length > 3, "No challenging word pairs generated" assert max_length > 3, "No challenging word pairs generated"