Agentic Triage
A finished scan typically produces far more matches than you have time to read. Most are real, some are noise (test fixtures, sample data, vendor licence files), and the only way to a clean report is to label every one of them. The TUI's review mode is fast for an experienced user, but it still costs human attention per finding.
Agentic triage moves that loop to a language model. PII Crawler ships a JSON-first CLI under piicrawler findings that is designed to be driven by an LLM agent: pull unreviewed matches as JSON, classify each one, and write the verdicts back in a single atomic batch. The verdicts land in the same scan_false_positives table the TUI and web UI use, so you can hand off mid-flight without losing work.
This guide walks the full loop: the CLI surface, a worked example, and the practical tips for getting good signal out of an LLM.
Keep the data on infrastructure you trust
The whole point of triage is to look at sensitive matches: SSNs, payroll records, customer email addresses, the surrounding sentences they appear in. Sending that context to a third-party LLM endpoint means handing real PII to whoever runs the endpoint. Treat the model as part of your data perimeter, not an external service:
- Local models (Ollama, llama.cpp, vLLM, LM Studio) keep every byte on the machine running the scan. This is the default recommendation, especially for regulated data (HIPAA, GDPR, CCPA, internal compliance regimes).
- Private cloud endpoints (AWS Bedrock, Azure OpenAI, Google Vertex AI, Anthropic via your own enterprise tenancy) keep the data inside an account you control, under a written agreement that prohibits training on your inputs. Confirm "no training" and data-residency terms in writing before pointing the agent at production findings.
- Public consumer APIs are not appropriate for raw scan context. If you can't avoid one, pass
--redactso SSN and DOB digits never leave the binary in plaintext (see Redacting sensitive terms below). Redaction is a backstop, not a replacement for keeping the data inside your perimeter.
The Python example below uses the public Anthropic API for brevity. For a real triage pass, swap the Anthropic() client for a Bedrock, Vertex, or local-runtime client of your choosing — the JSON contract on either side of the model is identical.
When to use an agent vs. the TUI
An LLM is a good fit when the call is largely about context: "is this 123-45-6789 in a unit test, a HR spreadsheet, or a customer letter?" The model reads the surrounding text and decides. It is a poor fit when the call is about company knowledge ("our test SSN is 987-65-4321, anything else is real") — encode those rules at text scope yourself first, then let the agent grind through the long tail.
A pragmatic split:
- You mark obvious noise files at file scope (
Fin the TUI) — vendor SDKs, fixtures,node_modules. - You mark known placeholder values at text scope —
123-45-6789,[email protected]. - The agent triages whatever is left, one match at a time, with the surrounding context as input.
- You spot-check the model's verdicts and export the clean CSV.
The CLI is symmetric: findings unmark reverses anything the agent gets wrong, and findings stats lets the agent (and you) check progress without re-reading every match.
The CLI surface
Four commands cover the whole loop. Full flag reference lives in piicrawler findings; the shapes below are the parts an agent needs.
findings list — read
piicrawler findings list --scan 42 --json --limit 50
Emits one JSON object per match. Defaults to --verdict unreviewed, which is the natural starting point for a triage pass. --context surrounding (the default) returns a small snippet of text around the match — enough for an LLM to decide, without burning tokens on full pages.
{
"id": 101,
"scan_id": 42,
"file_id": 7,
"file_path": "/data/hr/2026-q1.csv",
"pii_type": "ssn",
"term": "123-45-6789",
"start": 1024,
"end": 1035,
"context": { "surrounding": "...Employee SSN: 123-45-6789 filed on 2026-02-04..." },
"verdict": "unreviewed"
}
Use --limit and --offset to page through large scans. --pii-type ssn --pii-type email narrows the batch to one or two types so you can specialise the prompt.
findings mark — write
Three selectors mirror the TUI scopes:
| Selector | Scope | Use when |
|---|---|---|
--match <id> |
one finding | The model has read this exact context |
--text <pii_type> <term> |
every match of that value across the scan | A placeholder like 000-00-0000 is FP everywhere |
--file <id> |
every match in a file | The whole file is noise (a fixture, a vendor manifest) |
For an agent that decides one finding at a time, --match is the right selector. The text and file scopes are useful when a human (or a smarter agent) can make a generalisation across many matches in one call.
findings mark --from-json — bulk write
This is the path agents should default to. One file, one transaction:
piicrawler findings mark --scan 42 --from-json verdicts.json
[
{ "match_id": 101, "verdict": "fp" },
{ "match_id": 102, "verdict": "tp" },
{ "text": { "pii_type": "ssn", "term": "000-00-0000" }, "verdict": "fp" },
{ "file_id": 9, "verdict": "tp" }
]
Each entry carries its own verdict and exactly one selector. If any entry is malformed, the CLI exits non-zero and no verdicts are written. That all-or-nothing behaviour means a flaky agent run never leaves the database half-updated.
- reads JSON from stdin, which is what you want when the agent is the upstream process:
my-agent --scan 42 | piicrawler findings mark --scan 42 --from-json -
findings stats — check progress
piicrawler findings stats --scan 42 --json
{
"scan_id": 42,
"totals": { "unreviewed": 120, "false_positive": 32, "true_positive": 8 },
"by_pii_type": [
{ "pii_type": "ssn", "unreviewed": 50, "false_positive": 10, "true_positive": 5 },
{ "pii_type": "email", "unreviewed": 70, "false_positive": 22, "true_positive": 3 }
]
}
A natural agent loop terminates when totals.unreviewed == 0, or when a per-type budget is hit.
A complete worked example
The agent below uses the Anthropic SDK to triage one batch. It is deliberately small so the moving parts are visible — wrap it in a loop for a full scan.
import json, subprocess
from anthropic import Anthropic
SCAN_ID = 42
BATCH = 25
client = Anthropic()
def fetch_unreviewed():
out = subprocess.check_output([
"piicrawler", "findings", "list",
"--scan", str(SCAN_ID),
"--json", "--limit", str(BATCH),
"--context", "surrounding",
# Drop --redact when running against a local or private-cloud model.
"--redact",
])
return json.loads(out)
def classify(matches):
prompt = (
"For each match, decide if it is real PII (tp) or noise (fp). "
"Noise includes test fixtures, sample data, vendor licences, and "
"obvious placeholders. Reply with a JSON array of "
'{"match_id": <id>, "verdict": "fp"|"tp"} entries — nothing else.\n\n'
+ json.dumps(matches, indent=2)
)
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
messages=[{"role": "user", "content": prompt}],
)
return json.loads(resp.content[0].text)
def apply(verdicts):
proc = subprocess.run(
["piicrawler", "findings", "mark",
"--scan", str(SCAN_ID), "--from-json", "-"],
input=json.dumps(verdicts),
text=True, check=True, capture_output=True,
)
print(proc.stdout)
while True:
batch = fetch_unreviewed()
if not batch:
break
apply(classify(batch))
Run it, then check progress:
piicrawler findings stats --scan 42
Two implementation notes that matter in practice:
- Pin the JSON shape in the prompt. Models drift if you ask them to "return verdicts." Specifying the exact array shape, and parsing it strictly on return, surfaces drift early instead of letting a malformed batch silently corrupt the run.
- One transaction per batch.
--from-jsonrolls the whole batch back if any entry is malformed. That is the property you want — a partial write would mean re-pulling a mixed list of "already done by the agent" and "not yet" findings on the next iteration.
How the verdict model behaves
Verdicts written from the CLI are read by every other surface (TUI, web UI, HTML report, dsar). A few rules to keep in mind when an agent is making decisions:
- Specificity wins on read. A match-scope verdict overrides a text-scope verdict, which overrides a file-scope verdict at the same location. Mass-marking
--text ssn 000-00-0000 fpand then later overriding one occurrence with--match 42 tpworks the way you expect. unmarktakes the same selectors asmark. It also requires--verdict <fp|tp>, because the FP and TP rows are stored separately.--from-jsonignores any--verdictflag on the command line. Each entry's ownverdictfield is the source of truth.- Re-running a scan reapplies text- and file-scope verdicts. Match-scope verdicts only attach to the specific match they were written for, so a re-extracted file produces fresh
unreviewedmatches at match scope. This is intentional: cheap rules generalise, precise rules don't.
Redacting sensitive terms before they leave the machine
The first line of defense is keeping the data on infrastructure you trust. The second line, for cases where you can't avoid an external endpoint, is format-preserving redaction: PII Crawler can rewrite the most sensitive terms in findings list output before they ever leave the binary, so what reaches the model is structurally identical but no longer the original digits.
Pass --redact to opt in:
piicrawler findings list --scan 42 --redact --json --limit 50
What changes in the output:
- Every
ssnterm has its digits replaced with hash-derived digits, preserving dashes (123-45-6789becomes something like847-23-9182). Length and separator positions are unchanged. - Every
dobterm has its digits replaced the same way, preserving slashes, dashes, or dots (01/15/1985becomes something like93/47/2068,January 15, 1985becomesJanuary XX, XXXX). Date validity is not preserved; the agent doesn't need it. - Every occurrence of those plaintext terms inside
context.surrounding(andline/paragraphunder--context full) is rewritten with the same redacted form, so a snippet like"Employee SSN: 123-45-6789 filed..."becomes"Employee SSN: 847-23-9182 filed...". The agent still reads the surrounding sentence and can decide. - Redaction works at two levels for context strings:
- The substitution table is seeded with every distinct ssn/dob term recorded across the whole scan, not just the paginated subset. So a SSN that lives in match A's surrounding text but is itself match #500 (outside
--limit 50) still gets redacted. - Each context string is additionally swept with the SSN/DOB detection regexes. Anything matching the dashed SSN pattern (
XXX-XX-XXXX) or a date shape (US/ISO/month-name) gets redacted on the fly, even if it was never recorded as a match. This catches placeholders the original scanner filtered out (famous test SSNs like987-65-4321, formats the anchored detector didn't trust) before they can leak.
- The substitution table is seeded with every distinct ssn/dob term recorded across the whole scan, not just the paginated subset. So a SSN that lives in match A's surrounding text but is itself match #500 (outside
- All other PII types (email, phone, credit card, etc.) pass through unchanged. Redaction is intentionally narrow — broaden it only when a real triage workflow requires it.
The mapping is deterministic per database: a 32-byte secret is generated the first time --redact is used, stored in the local kv_store, and reused for every subsequent run. That means:
- The same
123-45-6789always redacts to the same value within your database, so the agent still sees "this exact value repeats 400 times across the scan" — the placeholder signal you actually want. - Two different databases produce different mappings, so a redacted output leaked from one machine is useless for fingerprinting values in another.
- The secret never leaves the machine. If the local SQLite file is compromised, the redaction is moot — but in that case the scan itself is the bigger problem.
What redaction does not protect
file_path. Paths often contain identifying information (/srv/hr/jane-doe-i9.pdf) and pass through unchanged. If your paths embed PII, sanitize them upstream or strip them from the JSON before sending.- Other PII types. Email addresses, phone numbers, names, and credit card numbers are not redacted in v1. If your context contains these and you can't run a local model, post-process the JSON before it leaves the machine.
- SSNs in non-dashed form. The context-level regex pass covers the dashed
XXX-XX-XXXXshape only. Bare 9-digit SSNs (123456789) are redacted when they were themselves recorded as matches in the scan, but a bare 9-digit SSN that only appears inside another match's surrounding text and was never recorded is not caught — the bare 9-digit pattern matches too many non-SSN integer sequences to safely redact unanchored. - Inferred values from surrounding sentences. If a context snippet says "Jane Doe, born January 1985," the model still sees the name and the partial date even when the explicit
dobterm is rewritten. Redaction is structural, not semantic.
Agent constraints when redaction is on
The agent must write back via match-scope selectors only:
[{ "match_id": 42, "verdict": "fp" }]
Text-scope (--text ssn 123-45-6789) round-trips the term through the database, and the database only knows the original. A redacted term won't match anything. File-scope (--file 7) still works because it references a numeric file ID. In practice, an LLM agent inspecting one finding at a time should be using match-scope anyway.
findings stats and findings unmark --match work normally with redacted runs — the verdict and stats tables operate on IDs, not terms.
Tuning the agent
A few knobs that change cost and quality:
- Pick a context mode that fits the budget.
--context surrounding(the default) is the right balance for most agents.--context fulladds the line and paragraph the match was extracted from, useful for narrative documents where one sentence isn't enough.--context noneis useful for cheap counts and second-pass classifiers that only need IDs. - Filter by PII type. A prompt that triages only
ssncan be sharper than a prompt that handles every type. Run multiple passes with different prompts:--pii-type ssn, then--pii-type email, then everything else. - Page in batches. Don't pull 10,000 matches in one
findings list. A batch of 25 to 100 fits comfortably in a single LLM call and keeps the cost of a bad classification small. - Spot-check the agent's verdicts. Open the TUI's findings view, press
hto show false positives, and skim the FP rows the agent produced.unmarklets you correct anything that looks wrong without rerunning the whole pass.
See also
- Triaging Findings — the human-driven workflow this guide builds on
piicrawler findingsCLI reference — every flag and JSON shape- Results Storage — schema for the shared
scan_false_positivestable - DSAR Walkthrough — turn the triaged data into a person-centric report