Integrate subrahmanyam cybersecurity (#142)

* cybersecurity env for offline RL trajectories * output file addition * jsonl outputs * code cleanup * pulled out outputs and fixing .gitignore * removed zip file * gitignore typo fix * Integrate cybersecurity Sigma rule generation environment --------- Co-authored-by: Subrahmanyam Arunachalam <subrahmanyam.arunachalam@FVFGK0VTQ05P.local>
2026-04-19 12:57:58 +00:00 · 2025-05-28 08:41:51 +10:00 · 2025-05-28 08:41:51 +10:00 · b774e97215
commit b774e97215
parent b33070f56b
4 changed files with 890 additions and 0 deletions
--- a/environments/community/cybersecurity_sigma/README.md
+++ b/environments/community/cybersecurity_sigma/README.md
@ -0,0 +1,139 @@
+# Cybersecurity Sigma Rule Generation Environment
+
+This environment trains LLMs to generate semantically correct Sigma detection rules from threat-hunting prompts. It provides two different reward mechanisms for evaluating generated rules.
+
+## Overview
+
+The environment focuses on structured generation tasks where outputs must be valid YAML conforming to Sigma detection rule schemas. It includes two implementations with different reward functions:
+
+1. **Jaccard Similarity Reward** (`jaccard_reward_env.py`) - Uses token-based similarity scoring
+2. **LLM Judge Reward** (`llm_judge_env.py`) - Uses LLM-based semantic evaluation
+
+## Core Features
+
+### Dataset Integration
+- Uses the `mmaisel1/nous-rl-hackathon-sigma` dataset from Hugging Face
+- Contains threat-hunting prompts paired with corresponding Sigma rules
+- Automatic train/test split with shuffling for reproducibility
+
+### Structured Output Format
+- Enforces specific output format with `<think>...</think>` reasoning tags
+- Requires YAML output wrapped in LaTeX `\boxed{...}` environment
+- Validates YAML syntax and Sigma rule structure
+
+### Dual Reward Mechanisms
+
+#### Jaccard Similarity Scoring
+- Compares flattened key paths of gold and generated YAML under `detection:` section
+- Uses scikit-learn's Jaccard similarity for token-based matching
+- Tends to produce low and sparse rewards due to structural mismatches
+
+#### LLM-as-a-Judge Scoring
+- Uses binary LLM evaluation for semantic equivalence assessment
+- Returns 1.0 if generated rule is functionally equivalent to gold standard
+- Provides higher-fidelity supervision even when structure varies
+
+### Advanced Features
+- Length penalty system for overly verbose outputs
+- Comprehensive evaluation metrics tracking
+- W&B integration for experiment monitoring
+- Configurable token limits and batch sizes
+
+## Technical Implementation
+
+### Environment Configuration
+- **Model**: NousResearch/DeepHermes-3-Llama-3-3B-Preview
+- **Max Token Length**: 2048 tokens
+- **Group Size**: 8 completions per prompt
+- **Batch Size**: 12 items per batch
+- **Evaluation Frequency**: Every 100 steps
+
+### System Prompt Structure
+The environment uses a detailed system prompt that:
+- Enforces structured reasoning with `<think>` tags
+- Requires YAML output in `\boxed{}` environment
+- Provides Sigma rule best practices and examples
+- Specifies exact formatting requirements for parser compatibility
+
+### Scoring Pipeline
+1. **Extraction**: Parse YAML from `\boxed{}` wrapper using regex
+2. **Validation**: Attempt YAML parsing and structure validation
+3. **Evaluation**: Apply either Jaccard similarity or LLM judge scoring
+4. **Aggregation**: Collect scores for batch-level reward computation
+
+## Setup and Usage
+
+### Environment Variables
+```bash
+export OPENAI_API_KEY="your-openai-api-key"  # For LLM judge (optional)
+export NOUS_API_KEY="your-nous-api-key"     # For model inference
+```
+
+### Command Line Usage
+```bash
+# Jaccard similarity reward
+python environments/community/cybersecurity_sigma/jaccard_reward_env.py
+
+# LLM judge reward
+python environments/community/cybersecurity_sigma/llm_judge_env.py
+```
+
+### Dependencies
+- `datasets` - Hugging Face dataset loading
+- `scikit-learn` - Jaccard similarity computation (jaccard_reward_env only)
+- `latex2sympy2_extended` - LaTeX parsing utilities
+- `math_verify` - YAML extraction from LaTeX boxes
+- `openai` - LLM judge API calls (llm_judge_env only)
+
+## Research Applications
+
+### Cybersecurity Training
+- Train models to understand threat detection patterns
+- Generate rules for various attack vectors and techniques
+- Develop automated threat hunting capabilities
+
+### Structured Generation Research
+- Study LLM performance on constrained output formats
+- Compare token-based vs. semantic evaluation methods
+- Investigate reasoning quality in cybersecurity domains
+
+### Evaluation Methodology Development
+- Benchmark different reward function approaches
+- Analyze correlation between structural and semantic correctness
+- Develop better automated evaluation metrics for domain-specific tasks
+
+## Performance Characteristics
+
+### Jaccard Similarity Results
+- **Typical Rewards**: 0.1-0.3 range due to structural sensitivity
+- **Strengths**: Fast computation, deterministic scoring
+- **Limitations**: Sensitive to formatting differences, low reward density
+
+### LLM Judge Results
+- **Typical Rewards**: Binary 0.0/1.0 with higher success rates
+- **Strengths**: Semantic understanding, format flexibility
+- **Limitations**: API latency, potential inconsistency, cost considerations
+
+## Example Outputs
+
+### Input Prompt
+```
+DotNET Assembly DLL Loaded Via Office Application: Detects any assembly DLL being loaded by an Office Product
+```
+
+### Expected Sigma Rule Format
+```yaml
+detection:
+  condition: selection
+  selection:
+    process_name:
+      - excel.exe
+      - word.exe
+      - powerpnt.exe
+    dll_loaded: "*.dll"
+logsource:
+  category: process
+  product: windows
+```
+
+The environment provides a robust framework for training LLMs on cybersecurity detection rule generation with flexible evaluation mechanisms suited for different research objectives.