Skip to content

LANL Dataset Replay

The seerflow.lanl module ships a parser, converter, host-mapper, and validator for the LANL Unified Host and Network Dataset v2. It exists for one reason: let you verify that the correlation engine actually catches the red-team activity the dataset labels as malicious.

If you contribute to the detection or correlation code, run LANL validation before opening a PR. It is the closest thing Seerflow has to an end-to-end ground-truth benchmark.

Three event types + one label file:

FileContent
auth.txt / redauth.txtAuthentication events (success / failure)
proc.txt / redproc.txtProcess start / stop events
flows.txt / redflows.txtNetwork flow events
redteam.txtHand-labeled red-team compromise events (ground truth)

The dataset is gzip-compressed CSV. Records use anonymized identifiers (U13, C457, etc.) for users and hosts.

from seerflow.lanl import (
parse_auth_line, parse_flow_line, parse_proc_line, parse_redteam_line,
convert_auth_record, convert_flow_record, convert_proc_record,
host_to_ip,
run_validation,
)
from pathlib import Path
import gzip
with gzip.open(Path("auth.txt.gz"), "rt") as fh:
for line in fh:
rec = parse_auth_line(line)
# rec is a frozen AuthRecord

parse_*_line returns frozen, slotted dataclasses safe to share across threads and use as dict / set keys. They never allocate beyond the record itself, so streaming a multi-GB file is constant memory.

from seerflow.lanl import convert_auth_record
event = convert_auth_record(rec)
# event is a SeerflowEvent ready to feed into the pipeline

The converter handles the mapping from LANL’s anonymized identifiers to deterministic UUID5 entity IDs via host_to_ip, so the same C457 always resolves to the same entity across runs.

from seerflow.lanl import run_validation
result = run_validation(
auth_path="data/auth.txt.gz",
proc_path="data/proc.txt.gz",
flow_path="data/flows.txt.gz",
redteam_path="data/redteam.txt.gz",
)
print(result.precision, result.recall, result.f1)
print(result.detected_compromise_count, "/", result.total_compromise_count)

ValidationResult aggregates precision, recall, F1, and per-entity detection traces against the red-team labels. Use it to flag regressions in the correlation engine or in Sigma rules.

There is no top-level seerflow lanl command (the dataset workflow is for contributors, not operators). Drive it from a Python script in scripts/ or a notebook.

  • src/seerflow/lanl/parser.py — record types
  • src/seerflow/lanl/converter.pySeerflowEvent mapping
  • src/seerflow/lanl/validator.py — metric computation
  • LANL paper: Turcotte et al., Unified Host and Network Data Set, 2017