atropos/environments/community/cybersecurity_sigma
shannonsands b774e97215
Integrate subrahmanyam cybersecurity (#142)
* cybersecurity env for offline RL trajectories

* output file addition

* jsonl outputs

* code cleanup

* pulled out outputs and fixing .gitignore

* removed zip file

* gitignore typo fix

* Integrate cybersecurity Sigma rule generation environment

---------

Co-authored-by: Subrahmanyam Arunachalam <subrahmanyam.arunachalam@FVFGK0VTQ05P.local>
2025-05-28 08:41:51 +10:00
..
jaccard_reward_env.py Integrate subrahmanyam cybersecurity (#142) 2025-05-28 08:41:51 +10:00
llm_judge_env.py Integrate subrahmanyam cybersecurity (#142) 2025-05-28 08:41:51 +10:00
README.md Integrate subrahmanyam cybersecurity (#142) 2025-05-28 08:41:51 +10:00

Cybersecurity Sigma Rule Generation Environment

This environment trains LLMs to generate semantically correct Sigma detection rules from threat-hunting prompts. It provides two different reward mechanisms for evaluating generated rules.

Overview

The environment focuses on structured generation tasks where outputs must be valid YAML conforming to Sigma detection rule schemas. It includes two implementations with different reward functions:

  1. Jaccard Similarity Reward (jaccard_reward_env.py) - Uses token-based similarity scoring
  2. LLM Judge Reward (llm_judge_env.py) - Uses LLM-based semantic evaluation

Core Features

Dataset Integration

  • Uses the mmaisel1/nous-rl-hackathon-sigma dataset from Hugging Face
  • Contains threat-hunting prompts paired with corresponding Sigma rules
  • Automatic train/test split with shuffling for reproducibility

Structured Output Format

  • Enforces specific output format with <think>...</think> reasoning tags
  • Requires YAML output wrapped in LaTeX \boxed{...} environment
  • Validates YAML syntax and Sigma rule structure

Dual Reward Mechanisms

Jaccard Similarity Scoring

  • Compares flattened key paths of gold and generated YAML under detection: section
  • Uses scikit-learn's Jaccard similarity for token-based matching
  • Tends to produce low and sparse rewards due to structural mismatches

LLM-as-a-Judge Scoring

  • Uses binary LLM evaluation for semantic equivalence assessment
  • Returns 1.0 if generated rule is functionally equivalent to gold standard
  • Provides higher-fidelity supervision even when structure varies

Advanced Features

  • Length penalty system for overly verbose outputs
  • Comprehensive evaluation metrics tracking
  • W&B integration for experiment monitoring
  • Configurable token limits and batch sizes

Technical Implementation

Environment Configuration

  • Model: NousResearch/DeepHermes-3-Llama-3-3B-Preview
  • Max Token Length: 2048 tokens
  • Group Size: 8 completions per prompt
  • Batch Size: 12 items per batch
  • Evaluation Frequency: Every 100 steps

System Prompt Structure

The environment uses a detailed system prompt that:

  • Enforces structured reasoning with <think> tags
  • Requires YAML output in \boxed{} environment
  • Provides Sigma rule best practices and examples
  • Specifies exact formatting requirements for parser compatibility

Scoring Pipeline

  1. Extraction: Parse YAML from \boxed{} wrapper using regex
  2. Validation: Attempt YAML parsing and structure validation
  3. Evaluation: Apply either Jaccard similarity or LLM judge scoring
  4. Aggregation: Collect scores for batch-level reward computation

Setup and Usage

Environment Variables

export OPENAI_API_KEY="your-openai-api-key"  # For LLM judge (optional)
export NOUS_API_KEY="your-nous-api-key"     # For model inference

Command Line Usage

# Jaccard similarity reward
python environments/community/cybersecurity_sigma/jaccard_reward_env.py

# LLM judge reward
python environments/community/cybersecurity_sigma/llm_judge_env.py

Dependencies

  • datasets - Hugging Face dataset loading
  • scikit-learn - Jaccard similarity computation (jaccard_reward_env only)
  • latex2sympy2_extended - LaTeX parsing utilities
  • math_verify - YAML extraction from LaTeX boxes
  • openai - LLM judge API calls (llm_judge_env only)

Research Applications

Cybersecurity Training

  • Train models to understand threat detection patterns
  • Generate rules for various attack vectors and techniques
  • Develop automated threat hunting capabilities

Structured Generation Research

  • Study LLM performance on constrained output formats
  • Compare token-based vs. semantic evaluation methods
  • Investigate reasoning quality in cybersecurity domains

Evaluation Methodology Development

  • Benchmark different reward function approaches
  • Analyze correlation between structural and semantic correctness
  • Develop better automated evaluation metrics for domain-specific tasks

Performance Characteristics

Jaccard Similarity Results

  • Typical Rewards: 0.1-0.3 range due to structural sensitivity
  • Strengths: Fast computation, deterministic scoring
  • Limitations: Sensitive to formatting differences, low reward density

LLM Judge Results

  • Typical Rewards: Binary 0.0/1.0 with higher success rates
  • Strengths: Semantic understanding, format flexibility
  • Limitations: API latency, potential inconsistency, cost considerations

Example Outputs

Input Prompt

DotNET Assembly DLL Loaded Via Office Application: Detects any assembly DLL being loaded by an Office Product

Expected Sigma Rule Format

detection:
  condition: selection
  selection:
    process_name:
      - excel.exe
      - word.exe
      - powerpnt.exe
    dll_loaded: "*.dll"
logsource:
  category: process
  product: windows

The environment provides a robust framework for training LLMs on cybersecurity detection rule generation with flexible evaluation mechanisms suited for different research objectives.