mirror of
https://github.com/NousResearch/atropos.git
synced 2026-04-19 12:57:58 +00:00
* cybersecurity env for offline RL trajectories * output file addition * jsonl outputs * code cleanup * pulled out outputs and fixing .gitignore * removed zip file * gitignore typo fix * Integrate cybersecurity Sigma rule generation environment --------- Co-authored-by: Subrahmanyam Arunachalam <subrahmanyam.arunachalam@FVFGK0VTQ05P.local> |
||
|---|---|---|
| .. | ||
| jaccard_reward_env.py | ||
| llm_judge_env.py | ||
| README.md | ||
Cybersecurity Sigma Rule Generation Environment
This environment trains LLMs to generate semantically correct Sigma detection rules from threat-hunting prompts. It provides two different reward mechanisms for evaluating generated rules.
Overview
The environment focuses on structured generation tasks where outputs must be valid YAML conforming to Sigma detection rule schemas. It includes two implementations with different reward functions:
- Jaccard Similarity Reward (
jaccard_reward_env.py) - Uses token-based similarity scoring - LLM Judge Reward (
llm_judge_env.py) - Uses LLM-based semantic evaluation
Core Features
Dataset Integration
- Uses the
mmaisel1/nous-rl-hackathon-sigmadataset from Hugging Face - Contains threat-hunting prompts paired with corresponding Sigma rules
- Automatic train/test split with shuffling for reproducibility
Structured Output Format
- Enforces specific output format with
<think>...</think>reasoning tags - Requires YAML output wrapped in LaTeX
\boxed{...}environment - Validates YAML syntax and Sigma rule structure
Dual Reward Mechanisms
Jaccard Similarity Scoring
- Compares flattened key paths of gold and generated YAML under
detection:section - Uses scikit-learn's Jaccard similarity for token-based matching
- Tends to produce low and sparse rewards due to structural mismatches
LLM-as-a-Judge Scoring
- Uses binary LLM evaluation for semantic equivalence assessment
- Returns 1.0 if generated rule is functionally equivalent to gold standard
- Provides higher-fidelity supervision even when structure varies
Advanced Features
- Length penalty system for overly verbose outputs
- Comprehensive evaluation metrics tracking
- W&B integration for experiment monitoring
- Configurable token limits and batch sizes
Technical Implementation
Environment Configuration
- Model: NousResearch/DeepHermes-3-Llama-3-3B-Preview
- Max Token Length: 2048 tokens
- Group Size: 8 completions per prompt
- Batch Size: 12 items per batch
- Evaluation Frequency: Every 100 steps
System Prompt Structure
The environment uses a detailed system prompt that:
- Enforces structured reasoning with
<think>tags - Requires YAML output in
\boxed{}environment - Provides Sigma rule best practices and examples
- Specifies exact formatting requirements for parser compatibility
Scoring Pipeline
- Extraction: Parse YAML from
\boxed{}wrapper using regex - Validation: Attempt YAML parsing and structure validation
- Evaluation: Apply either Jaccard similarity or LLM judge scoring
- Aggregation: Collect scores for batch-level reward computation
Setup and Usage
Environment Variables
export OPENAI_API_KEY="your-openai-api-key" # For LLM judge (optional)
export NOUS_API_KEY="your-nous-api-key" # For model inference
Command Line Usage
# Jaccard similarity reward
python environments/community/cybersecurity_sigma/jaccard_reward_env.py
# LLM judge reward
python environments/community/cybersecurity_sigma/llm_judge_env.py
Dependencies
datasets- Hugging Face dataset loadingscikit-learn- Jaccard similarity computation (jaccard_reward_env only)latex2sympy2_extended- LaTeX parsing utilitiesmath_verify- YAML extraction from LaTeX boxesopenai- LLM judge API calls (llm_judge_env only)
Research Applications
Cybersecurity Training
- Train models to understand threat detection patterns
- Generate rules for various attack vectors and techniques
- Develop automated threat hunting capabilities
Structured Generation Research
- Study LLM performance on constrained output formats
- Compare token-based vs. semantic evaluation methods
- Investigate reasoning quality in cybersecurity domains
Evaluation Methodology Development
- Benchmark different reward function approaches
- Analyze correlation between structural and semantic correctness
- Develop better automated evaluation metrics for domain-specific tasks
Performance Characteristics
Jaccard Similarity Results
- Typical Rewards: 0.1-0.3 range due to structural sensitivity
- Strengths: Fast computation, deterministic scoring
- Limitations: Sensitive to formatting differences, low reward density
LLM Judge Results
- Typical Rewards: Binary 0.0/1.0 with higher success rates
- Strengths: Semantic understanding, format flexibility
- Limitations: API latency, potential inconsistency, cost considerations
Example Outputs
Input Prompt
DotNET Assembly DLL Loaded Via Office Application: Detects any assembly DLL being loaded by an Office Product
Expected Sigma Rule Format
detection:
condition: selection
selection:
process_name:
- excel.exe
- word.exe
- powerpnt.exe
dll_loaded: "*.dll"
logsource:
category: process
product: windows
The environment provides a robust framework for training LLMs on cybersecurity detection rule generation with flexible evaluation mechanisms suited for different research objectives.