# Cybersecurity Sigma Rule Generation Environment
This environment trains LLMs to generate semantically correct Sigma detection rules from threat-hunting prompts. It provides two different reward mechanisms for evaluating generated rules.
## Overview
The environment focuses on structured generation tasks where outputs must be valid YAML conforming to Sigma detection rule schemas. It includes two implementations with different reward functions:
1. **Jaccard Similarity Reward** (`jaccard_reward_env.py`) - Uses token-based similarity scoring
2. **LLM Judge Reward** (`llm_judge_env.py`) - Uses LLM-based semantic evaluation
## Core Features
### Dataset Integration
- Uses the `mmaisel1/nous-rl-hackathon-sigma` dataset from Hugging Face
- Contains threat-hunting prompts paired with corresponding Sigma rules
- Automatic train/test split with shuffling for reproducibility
### Structured Output Format
- Enforces specific output format with `...` reasoning tags
- Requires YAML output wrapped in LaTeX `\boxed{...}` environment
- Validates YAML syntax and Sigma rule structure
### Dual Reward Mechanisms
#### Jaccard Similarity Scoring
- Compares flattened key paths of gold and generated YAML under `detection:` section
- Uses scikit-learn's Jaccard similarity for token-based matching
- Tends to produce low and sparse rewards due to structural mismatches
#### LLM-as-a-Judge Scoring
- Uses binary LLM evaluation for semantic equivalence assessment
- Returns 1.0 if generated rule is functionally equivalent to gold standard
- Provides higher-fidelity supervision even when structure varies
### Advanced Features
- Length penalty system for overly verbose outputs
- Comprehensive evaluation metrics tracking
- W&B integration for experiment monitoring
- Configurable token limits and batch sizes
## Technical Implementation
### Environment Configuration
- **Model**: NousResearch/DeepHermes-3-Llama-3-3B-Preview
- **Max Token Length**: 2048 tokens
- **Group Size**: 8 completions per prompt
- **Batch Size**: 12 items per batch
- **Evaluation Frequency**: Every 100 steps
### System Prompt Structure
The environment uses a detailed system prompt that:
- Enforces structured reasoning with `` tags
- Requires YAML output in `\boxed{}` environment
- Provides Sigma rule best practices and examples
- Specifies exact formatting requirements for parser compatibility
### Scoring Pipeline
1. **Extraction**: Parse YAML from `\boxed{}` wrapper using regex
2. **Validation**: Attempt YAML parsing and structure validation
3. **Evaluation**: Apply either Jaccard similarity or LLM judge scoring
4. **Aggregation**: Collect scores for batch-level reward computation
## Setup and Usage
### Environment Variables
```bash
export OPENAI_API_KEY="your-openai-api-key" # For LLM judge (optional)
export NOUS_API_KEY="your-nous-api-key" # For model inference
```
### Command Line Usage
```bash
# Jaccard similarity reward
python environments/community/cybersecurity_sigma/jaccard_reward_env.py
# LLM judge reward
python environments/community/cybersecurity_sigma/llm_judge_env.py
```
### Dependencies
- `datasets` - Hugging Face dataset loading
- `scikit-learn` - Jaccard similarity computation (jaccard_reward_env only)
- `latex2sympy2_extended` - LaTeX parsing utilities
- `math_verify` - YAML extraction from LaTeX boxes
- `openai` - LLM judge API calls (llm_judge_env only)
## Research Applications
### Cybersecurity Training
- Train models to understand threat detection patterns
- Generate rules for various attack vectors and techniques
- Develop automated threat hunting capabilities
### Structured Generation Research
- Study LLM performance on constrained output formats
- Compare token-based vs. semantic evaluation methods
- Investigate reasoning quality in cybersecurity domains
### Evaluation Methodology Development
- Benchmark different reward function approaches
- Analyze correlation between structural and semantic correctness
- Develop better automated evaluation metrics for domain-specific tasks
## Performance Characteristics
### Jaccard Similarity Results
- **Typical Rewards**: 0.1-0.3 range due to structural sensitivity
- **Strengths**: Fast computation, deterministic scoring
- **Limitations**: Sensitive to formatting differences, low reward density
### LLM Judge Results
- **Typical Rewards**: Binary 0.0/1.0 with higher success rates
- **Strengths**: Semantic understanding, format flexibility
- **Limitations**: API latency, potential inconsistency, cost considerations
## Example Outputs
### Input Prompt
```
DotNET Assembly DLL Loaded Via Office Application: Detects any assembly DLL being loaded by an Office Product
```
### Expected Sigma Rule Format
```yaml
detection:
condition: selection
selection:
process_name:
- excel.exe
- word.exe
- powerpnt.exe
dll_loaded: "*.dll"
logsource:
category: process
product: windows
```
The environment provides a robust framework for training LLMs on cybersecurity detection rule generation with flexible evaluation mechanisms suited for different research objectives.