Results Storage
PII Crawler stores everything it produces — scans, findings, triage verdicts, terms lists, custom regex patterns, proximity groups, identity associations, watch policies and violations, SMB credentials — in a single SQLite database:
| Platform | Database location |
|---|---|
| Linux | ~/.piicrawler/piicrawler.db |
| macOS | ~/.piicrawler/piicrawler.db |
| Windows | %USERPROFILE%\.piicrawler\piicrawler.db (for example C:\Users\Alice\.piicrawler\piicrawler.db) |
The shell examples on this page use the Linux/macOS path; on Windows substitute %USERPROFILE%\.piicrawler\piicrawler.db.
There is one database per user, not one per scan. Use the scan_id foreign key on each table to scope a query to a single scan.
PII Crawler intentionally exposes this database so you can:
- Run custom queries against findings without going through the TUI or web UI
- Build alerts and pipelines on top of scan results and watch violations
- Back up, archive, or sync the file with standard SQLite tooling
- Inspect exactly what the application persists
You can think of the database as the application's API surface.
Schema and migrations
The schema is owned by an embedded migration runner (rusqlite_migration) and tracked in PRAGMA user_version. PII Crawler applies any pending migrations on every startup, so the file is always at the latest schema for the binary running on the machine.
To dump the current schema authoritatively, ask SQLite directly:
sqlite3 ~/.piicrawler/piicrawler.db .schema
The CREATE TABLE statements quoted in this document reflect the schema at the time of writing. Treat .schema as the source of truth — this page covers the load-bearing tables and their relationships, not every column verbatim.
Conceptual model
A scans row is the parent record for one invocation of "scan this path." Hung off it:
scan_files— one row per file the scan walked.scan_matches— one row per PII finding, FK to bothscansandscan_files.scan_false_positives— triage verdicts for matches in this scan (FP and TP).scan_pii_types,scan_exclusions,scan_extensions— per-scan configuration captured at scan-creation time.
Configuration that is shared across scans (terms lists, custom regex patterns, proximity groups, SMB credentials) lives in standalone tables. Each scan that uses one is linked through a join table — scan_custom_regex_patterns, scan_proximity_groups — so the same regex or proximity group can be reused across many scans without duplication.
Two subsystems hang off scans with their own parallel hierarchies:
- Identity scans (DSAR linking) —
identity_runsis hung off a scan, withidentity_names,identity_associations,identity_unassociated, andidentity_false_positivesunderneath. - Watch mode is independent of scans —
watch_policiesis global;watch_violationsare produced by the watch daemon and FK back to a policy.
Core scan tables
scans
CREATE TABLE scans (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT NOT NULL,
path TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'pending',
files_total INTEGER NOT NULL DEFAULT 0,
files_scanned INTEGER NOT NULL DEFAULT 0,
findings_total INTEGER NOT NULL DEFAULT 0,
created_at TEXT NOT NULL DEFAULT (datetime('now')),
updated_at TEXT NOT NULL DEFAULT (datetime('now')),
enumeration_started_at TEXT,
enumeration_finished_at TEXT,
scan_started_at TEXT,
scan_finished_at TEXT,
terms_list_id INTEGER REFERENCES terms_lists(id) ON DELETE SET NULL,
scan_type TEXT NOT NULL DEFAULT 'local',
smb_server TEXT,
smb_share TEXT,
smb_subfolder TEXT,
smb_credential_id INTEGER,
max_file_size_bytes INTEGER,
disable_ocr INTEGER NOT NULL DEFAULT 0,
accumulated_duration_ms INTEGER NOT NULL DEFAULT 0,
session_started_at TEXT
);
Notes:
statuscycles throughpending→enumerating→scanning→complete(orerror/cancelled).scan_typeislocalfor a filesystem scan;smbplussmb_server/smb_share/smb_credential_idfor a network share scan.accumulated_duration_msplussession_started_attrack wall-clock time across pause/resume — query their sum to get total scan time.
scan_files
CREATE TABLE scan_files (
id INTEGER PRIMARY KEY AUTOINCREMENT,
scan_id INTEGER NOT NULL REFERENCES scans(id) ON DELETE CASCADE,
path TEXT NOT NULL,
size INTEGER,
extension TEXT,
status TEXT NOT NULL DEFAULT 'pending',
error TEXT,
scanned_at TEXT,
duration_ms INTEGER,
deleted_at TEXT,
UNIQUE(scan_id, path)
);
One row per file the scan walked. status is pending / scanning / done / error. error holds the extraction or scan error message when status = 'error'. deleted_at is set when the file is deleted from disk through the app, and kept in sync when the Files view notices a path has disappeared (or reappeared); it powers the Files view's Deleted visibility filter.
scan_matches
CREATE TABLE scan_matches (
id INTEGER PRIMARY KEY AUTOINCREMENT,
scan_id INTEGER NOT NULL REFERENCES scans(id) ON DELETE CASCADE,
file_id INTEGER NOT NULL REFERENCES scan_files(id) ON DELETE CASCADE,
pii_type TEXT NOT NULL,
term TEXT NOT NULL,
start_pos INTEGER NOT NULL,
end_pos INTEGER NOT NULL,
context_line TEXT,
context_paragraph TEXT,
context_surrounding TEXT
);
One row per PII finding. pii_type uses the slugs documented in Watch Mode & Policies → PII type slugs (ssn, email, regex-<slug>, etc.). term is the actual matched string (unmasked — see Security for what data leaves the machine; this column never does). start_pos / end_pos are character offsets into the extracted text. The three context_* columns are progressively wider snippets from the source for triage and reporting.
scan_pii_types, scan_exclusions, scan_extensions
Per-scan configuration captured when the scan was created.
| Table | Purpose |
|---|---|
scan_pii_types |
Which PII types this scan was configured to detect (subset of the data types). |
scan_exclusions |
Path or glob patterns excluded from this scan — see Exclusion Patterns. |
scan_extensions |
File extensions the scan was restricted to (empty = all supported types). |
All three FK back to scans.id with ON DELETE CASCADE.
Triage: scan_false_positives
Triage verdicts (FP and TP) live in scan_false_positives. Despite the name, the table holds both verdicts since migration M003 — read the verdict column to distinguish them.
CREATE TABLE scan_false_positives (
id INTEGER PRIMARY KEY AUTOINCREMENT,
scan_id INTEGER NOT NULL REFERENCES scans(id) ON DELETE CASCADE,
scope TEXT NOT NULL,
match_id INTEGER REFERENCES scan_matches(id) ON DELETE CASCADE,
file_id INTEGER REFERENCES scan_files(id) ON DELETE CASCADE,
pii_type TEXT,
term TEXT,
created_at TEXT NOT NULL DEFAULT (datetime('now')),
verdict TEXT NOT NULL DEFAULT 'false_positive'
CHECK (verdict IN ('false_positive', 'true_positive')),
reviewed_at TEXT
);
scope is one of:
match— the verdict applies to one specific finding (match_idis set).text— the verdict applies to every finding with the same(pii_type, term)in this scan.file— the verdict applies to every finding infile_id.
See Triaging Findings → The verdict model for how scopes interact at read time.
A partial index speeds up the text-scope subquery used by FP-filtered counts:
CREATE INDEX idx_scan_fp_text_scope
ON scan_false_positives(pii_type, term, verdict)
WHERE scope = 'text';
Shared configuration
These tables hold configuration that survives across scans.
| Table | Purpose |
|---|---|
terms_lists, terms |
Named keyword lists for terms-list scans. terms rows FK to a list; the list itself is referenced by scans.terms_list_id. |
custom_regex_patterns |
User-defined regex rules — see Custom Regex. Linked to scans via scan_custom_regex_patterns. |
proximity_groups, proximity_group_patterns |
Proximity regex groups. Linked to scans via scan_proximity_groups. |
smb_credentials |
Credentials for SMB / network-share scans, with UNIQUE(server, share_name). Referenced by scans.smb_credential_id. |
kv_store |
Generic key/value table the application uses for small persistent settings (e.g. last-used preferences). Schema is (key TEXT PRIMARY KEY, value TEXT NOT NULL). |
Identity scans (DSAR)
Identity scans link PII findings to the names of the people they belong to. The hierarchy is hung off a scans row through identity_runs.scan_id.
| Table | Purpose |
|---|---|
identity_runs |
One run of identity association over a scan. Holds counts, status, and the linking method (line distance vs. paragraph). |
identity_names |
Every name extracted from the run's source text, with the file it was found in and a normalized form for lookup. |
identity_associations |
A name linked to a scan_matches row, with the method and distance used. is_ambiguous = 1 when more than one name was equally close. |
identity_unassociated |
Findings that did not link to any name. |
identity_false_positives |
Triage verdicts scoped to associations: specific (one association), exact_text (one term value), name_type_combo (a name + PII type pair). |
Unique partial indexes enforce one verdict per scope per run on identity_false_positives — see the migrations file for the exact predicates. Note this table does not carry a verdict column; presence implies false positive (the table name is literal here).
See PII Identity Scan for the user-facing workflow.
Watch mode
The watch daemon's persistent state.
CREATE TABLE watch_policies (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT NOT NULL,
pii_type TEXT,
path_pattern TEXT,
max_risk TEXT,
action TEXT NOT NULL DEFAULT 'deny',
severity TEXT NOT NULL DEFAULT 'high',
created_at TEXT NOT NULL DEFAULT (datetime('now'))
);
CREATE UNIQUE INDEX idx_watch_policies_name ON watch_policies(name);
CREATE TABLE watch_violations (
id INTEGER PRIMARY KEY AUTOINCREMENT,
policy_id INTEGER NOT NULL REFERENCES watch_policies(id) ON DELETE CASCADE,
file_path TEXT NOT NULL,
pii_type TEXT NOT NULL,
term TEXT NOT NULL,
severity TEXT NOT NULL,
created_at TEXT NOT NULL DEFAULT (datetime('now'))
);
watch_policies.name is unique — --policy <file> upserts on conflict, so reloading the same file is idempotent. See Watch Mode & Policies for the full lifecycle and policy-file format.
watch_violations.term holds the unmasked matched string (it never leaves the machine); the masked form is what's emitted to webhooks and stdout.
Useful queries
Count remaining true positives by PII type for a scan, after triage (the EXISTS subquery is the FP filter):
SELECT m.pii_type, COUNT(*) AS n
FROM scan_matches m
WHERE m.scan_id = ?
AND NOT EXISTS (
SELECT 1 FROM scan_false_positives fp
WHERE fp.scan_id = m.scan_id
AND fp.verdict = 'false_positive'
AND (
(fp.scope = 'match' AND fp.match_id = m.id)
OR (fp.scope = 'text' AND fp.pii_type = m.pii_type AND fp.term = m.term)
OR (fp.scope = 'file' AND fp.file_id = m.file_id)
)
)
GROUP BY m.pii_type
ORDER BY n DESC;
Recent watch violations with their policy name:
SELECT v.created_at, p.name AS policy, v.file_path, v.pii_type, v.severity
FROM watch_violations v
JOIN watch_policies p ON p.id = v.policy_id
ORDER BY v.created_at DESC
LIMIT 50;
Files in a scan that produced any PII at all:
SELECT f.path, COUNT(m.id) AS findings
FROM scan_files f
JOIN scan_matches m ON m.file_id = f.id
WHERE f.scan_id = ?
GROUP BY f.id
ORDER BY findings DESC;
Deleting scans and disk space
Deleting a scan removes it from the scan list immediately, while the underlying rows (which can number in the millions for a large scan) are removed by a background worker. You can keep browsing other scans and their findings while the cleanup runs. If the app shuts down mid-cleanup, the deletion is finished automatically on the next startup.
After the rows are deleted, PII Crawler compacts the database and returns the freed space to the operating system, so deleting a large scan shrinks piicrawler.db on disk.
Databases created by older versions of PII Crawler are compacted with a one-time full rebuild on the first deletion after upgrading. On a database holding very large scans this can take a few minutes and needs free disk space comparable to the database's size; subsequent deletions reclaim space quickly.
Backups and external access
The database is a normal SQLite file — back it up by copying it (when no PII Crawler process has it open) or use SQLite's online backup:
sqlite3 ~/.piicrawler/piicrawler.db ".backup '/backups/piicrawler-$(date +%F).db'"
.backup is safe to run while PII Crawler is using the database.
For read-only access from another tool while a scan is running, open the database with the mode=ro URI parameter so you don't compete for the writer lock:
sqlite3 'file:/home/you/.piicrawler/piicrawler.db?mode=ro&immutable=0' \
"SELECT id, name, status FROM scans ORDER BY id DESC LIMIT 5"