Add example output

2026-04-24 17:04:55 +00:00 · 2025-05-18 17:43:38 -04:00 · 2025-05-18 17:43:38 -04:00 · 09eed6584d
commit 09eed6584d
parent 71a87edd50
1 changed files with 26 additions and 6 deletions
--- a/environments/hack0/README.md
+++ b/environments/hack0/README.md
@ -1,22 +1,38 @@
 # Punchline VR-CLI Environment

-This fork contains an Atropos environment designed to train a large language model to generate humorous punchlines for jokes. The environment utilizes a Reinforcement Learning (RL) technique called Verifiable Rewards via Completion Likelihood Improvement (VR-CLI), taken from the paper "Learning to Reason for Long-Form Story Generation" (Gurung & Lapata, 2025) [https://arxiv.org/html/2503.22828v1#S8](https://arxiv.org/html/2503.22828v1#S8).
+This fork contains an Atropos environment designed to train a large language model to generate humorous punchlines for jokes. The environment utilizes a Reinforcement Learning (RL) technique called Verifiable Rewards via Completion Likelihood Improvement (VR-CLI), taken from the paper "Learning to Reason for Long-Form Story Generation" (Gurung & Lapata, 2025) [https://arxiv.org/html/2503.22828v1](https://arxiv.org/html/2503.22828v1).

 ## Environment Design and Motivation

 The core idea is to teach a model not just to produce a punchline, but to first generate a "reasoning" or "thought process" that leads to a good punchline. The quality of this reasoning is then "verified" by measuring how much it improves the likelihood (reduces the perplexity) of the *actual* punchline from the dataset, as assessed by a separate, fixed reference model. This greatly reduces overfitting, as the model does not have access to the ground-truth answer.

+## Example
+
+Here's an example of how the model generates a punchline with reasoning:
+
+```
+Question: What do you call a herd of cows masturbating?
+
+<think>
+Okay, the user is asking, "What do you call a herd of cows masturbating?" Hmm, that's a play on words. Let me think. The key here is the word "masturbating" and the animal "cows." The answer needs to be a pun or a play on words.
+
+First, I need to connect "herd" with "masturbating." A herd of cows is a group, so maybe something related to a group. "Masturbating" is a term that's not typically used for animals, but maybe there's a word that combines the two.........
+</think>
+
+Beef strokin off!
+```
+
 ### Key Components:

 *   **Dataset:** The environment uses the `"SocialGrep/one-million-reddit-jokes"` dataset, filtering for jokes with a question-answer format (setup and punchline) and a minimum number of upvotes.
-*   **Task:** Given the setup of a joke (the "question"), the model is prompted to generate a thinking process (`<think>...</think>`) followed by the punchline.
+*   **Task:** Given the setup of a joke (the "question"), the model is prompted to generate a thinking process `<think>...</think>` followed by the punchline.
 *   **Reward (VR-CLI):**
-    1.  A base perplexity of the "golden" punchline is calculated given only the joke's setup, using a reference LLM (`Qwen/Qwen3-1.7B-Base`).
+    1.  A base perplexity of the "golden" punchline is calculated given only the joke's setup, using a reference LLM `Qwen/Qwen3-1.7B-Base`.
    2.  A new perplexity of the golden punchline is calculated, this time conditioned on both the joke's setup AND the model-generated reasoning.
-    3.  The reward is proportional to the improvement in perplexity ( `(base_perplexity - plus_perplexity) / base_perplexity`). A positive reward indicates the reasoning was helpful.
+    3.  The reward is proportional to the improvement in perplexity `(base_perplexity - plus_perplexity) / base_perplexity`. A positive reward indicates the reasoning was helpful.
 *   **Models:**
    *   The environment is configured to use `Qwen/Qwen3-1.7B` for generating trajectories.
-    *   A reference model (`Qwen/Qwen3-1.7B-Base`) is used locally to calculate the VR-CLI reward.
+    *   A reference model `Qwen/Qwen3-1.7B-Base` is used locally to calculate the VR-CLI reward.

 The motivation is to guide the LLM towards generating more creative and contextually relevant punchlines by explicitly rewarding the intermediate reasoning steps that make a punchline "work." Typical fine-tuning fails to do this, as it makes the models memorize the jokes rather than gain an understanding of what makes them funny.

@ -55,4 +71,8 @@ python punchline_env.py process \

 You will need to have vLLM serving the model on port 9001 for this to work.

-[Weights & Biases link](https://wandb.ai/jaboggs-nous-hackathon-nc-state-university/uncategorized/runs/c24sz5t5)
+[Weights & Biases link](https://wandb.ai/jaboggs-nous-hackathon-nc-state-university/uncategorized/runs/c24sz5t5)
+
+#### Output
+
+Zip placeholder