LANL Dataset Replay

The seerflow.lanl module ships a parser, converter, host-mapper, and validator for the LANL Unified Host and Network Dataset v2. It exists for one reason: let you verify that the correlation engine actually catches the red-team activity the dataset labels as malicious.

If you contribute to the detection or correlation code, run LANL validation before opening a PR. It is the closest thing Seerflow has to an end-to-end ground-truth benchmark.

What the dataset is

Three event types + one label file:

File	Content
`auth.txt` / `redauth.txt`	Authentication events (success / failure)
`proc.txt` / `redproc.txt`	Process start / stop events
`flows.txt` / `redflows.txt`	Network flow events
`redteam.txt`	Hand-labeled red-team compromise events (ground truth)

The dataset is gzip-compressed CSV. Records use anonymized identifiers (U13, C457, etc.) for users and hosts.

API

from seerflow.lanl import (
    parse_auth_line, parse_flow_line, parse_proc_line, parse_redteam_line,
    convert_auth_record, convert_flow_record, convert_proc_record,
    host_to_ip,
    run_validation,
)

Parsing

from pathlib import Path
import gzip

with gzip.open(Path("auth.txt.gz"), "rt") as fh:
    for line in fh:
        rec = parse_auth_line(line)
        # rec is a frozen AuthRecord

parse_*_line returns frozen, slotted dataclasses safe to share across threads and use as dict / set keys. They never allocate beyond the record itself, so streaming a multi-GB file is constant memory.

Conversion to `SeerflowEvent`

from seerflow.lanl import convert_auth_record

event = convert_auth_record(rec)
# event is a SeerflowEvent ready to feed into the pipeline

The converter handles the mapping from LANL’s anonymized identifiers to deterministic UUID5 entity IDs via host_to_ip, so the same C457 always resolves to the same entity across runs.

Validation

from seerflow.lanl import run_validation

result = run_validation(
    auth_path="data/auth.txt.gz",
    proc_path="data/proc.txt.gz",
    flow_path="data/flows.txt.gz",
    redteam_path="data/redteam.txt.gz",
)
print(result.precision, result.recall, result.f1)
print(result.detected_compromise_count, "/", result.total_compromise_count)

ValidationResult aggregates precision, recall, F1, and per-entity detection traces against the red-team labels. Use it to flag regressions in the correlation engine or in Sigma rules.

CLI

There is no top-level seerflow lanl command (the dataset workflow is for contributors, not operators). Drive it from a Python script in scripts/ or a notebook.

LANL Dataset Replay

What the dataset is

API

Parsing

Conversion to SeerflowEvent

Validation

CLI

See also

Conversion to `SeerflowEvent`