Reference

Results Storage

Last updated July 2026

PII Crawler stores everything it produces — scans, findings, triage verdicts, terms lists, custom regex patterns, proximity groups, identity associations, watch policies and violations, SMB credentials — in a single SQLite database:

Platform	Database location
Linux	`~/.piicrawler/piicrawler.db`
macOS	`~/.piicrawler/piicrawler.db`
Windows	`%USERPROFILE%\.piicrawler\piicrawler.db` (for example `C:\Users\Alice\.piicrawler\piicrawler.db`)

The shell examples on this page use the Linux/macOS path; on Windows substitute %USERPROFILE%\.piicrawler\piicrawler.db.

There is one database per user, not one per scan. Use the scan_id foreign key on each table to scope a query to a single scan.

PII Crawler intentionally exposes this database so you can:

Run custom queries against findings without going through the TUI or web UI
Build alerts and pipelines on top of scan results and watch violations
Back up, archive, or sync the file with standard SQLite tooling
Inspect exactly what the application persists

You can think of the database as the application's API surface.

Schema and migrations

The schema is owned by an embedded migration runner (rusqlite_migration) and tracked in PRAGMA user_version. PII Crawler applies any pending migrations on every startup, so the file is always at the latest schema for the binary running on the machine.

To dump the current schema authoritatively, ask SQLite directly:

sqlite3 ~/.piicrawler/piicrawler.db .schema

The CREATE TABLE statements quoted in this document reflect the schema at the time of writing. Treat .schema as the source of truth — this page covers the load-bearing tables and their relationships, not every column verbatim.

Conceptual model

A scans row is the parent record for one invocation of "scan this path." Hung off it:

scan_files — one row per file the scan walked.
scan_matches — one row per PII finding, FK to both scans and scan_files.
scan_false_positives — triage verdicts for matches in this scan (FP and TP).
scan_pii_types, scan_exclusions, scan_extensions — per-scan configuration captured at scan-creation time.

Configuration that is shared across scans (terms lists, custom regex patterns, proximity groups, SMB credentials) lives in standalone tables. Each scan that uses one is linked through a join table — scan_custom_regex_patterns, scan_proximity_groups — so the same regex or proximity group can be reused across many scans without duplication.

Two subsystems hang off scans with their own parallel hierarchies:

Identity scans (DSAR linking) — identity_runs is hung off a scan, with identity_names, identity_associations, identity_unassociated, and identity_false_positives underneath.
Watch mode is independent of scans — watch_policies is global; watch_violations are produced by the watch daemon and FK back to a policy.

Core scan tables

`scans`

CREATE TABLE scans (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT NOT NULL,
    path TEXT NOT NULL,
    status TEXT NOT NULL DEFAULT 'pending',
    files_total INTEGER NOT NULL DEFAULT 0,
    files_scanned INTEGER NOT NULL DEFAULT 0,
    findings_total INTEGER NOT NULL DEFAULT 0,
    created_at TEXT NOT NULL DEFAULT (datetime('now')),
    updated_at TEXT NOT NULL DEFAULT (datetime('now')),
    enumeration_started_at TEXT,
    enumeration_finished_at TEXT,
    scan_started_at TEXT,
    scan_finished_at TEXT,
    terms_list_id INTEGER REFERENCES terms_lists(id) ON DELETE SET NULL,
    scan_type TEXT NOT NULL DEFAULT 'local',
    smb_server TEXT,
    smb_share TEXT,
    smb_subfolder TEXT,
    smb_credential_id INTEGER,
    max_file_size_bytes INTEGER,
    disable_ocr INTEGER NOT NULL DEFAULT 0,
    accumulated_duration_ms INTEGER NOT NULL DEFAULT 0,
    session_started_at TEXT
);

Notes:

status cycles through pending → enumerating → scanning → complete (or error / cancelled).
scan_type is local for a filesystem scan; smb plus smb_server / smb_share / smb_credential_id for a network share scan.
accumulated_duration_ms plus session_started_at track wall-clock time across pause/resume — query their sum to get total scan time.

`scan_files`

CREATE TABLE scan_files (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    scan_id INTEGER NOT NULL REFERENCES scans(id) ON DELETE CASCADE,
    path TEXT NOT NULL,
    size INTEGER,
    extension TEXT,
    status TEXT NOT NULL DEFAULT 'pending',
    error TEXT,
    scanned_at TEXT,
    duration_ms INTEGER,
    deleted_at TEXT,
    UNIQUE(scan_id, path)
);

One row per file the scan walked. status is pending / scanning / done / error. error holds the extraction or scan error message when status = 'error'. deleted_at is set when the file is deleted from disk through the app, and kept in sync when the Files view notices a path has disappeared (or reappeared); it powers the Files view's Deleted visibility filter.

`scan_matches`

CREATE TABLE scan_matches (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    scan_id INTEGER NOT NULL REFERENCES scans(id) ON DELETE CASCADE,
    file_id INTEGER NOT NULL REFERENCES scan_files(id) ON DELETE CASCADE,
    pii_type TEXT NOT NULL,
    term TEXT NOT NULL,
    start_pos INTEGER NOT NULL,
    end_pos INTEGER NOT NULL,
    context_line TEXT,
    context_paragraph TEXT,
    context_surrounding TEXT
);

One row per PII finding. pii_type uses the slugs documented in Watch Mode & Policies → PII type slugs (ssn, email, regex-<slug>, etc.). term is the actual matched string (unmasked — see Security for what data leaves the machine; this column never does). start_pos / end_pos are character offsets into the extracted text. The three context_* columns are progressively wider snippets from the source for triage and reporting.

`scan_pii_types`, `scan_exclusions`, `scan_extensions`

Per-scan configuration captured when the scan was created.

Table	Purpose
`scan_pii_types`	Which PII types this scan was configured to detect (subset of the data types).
`scan_exclusions`	Path or glob patterns excluded from this scan — see Exclusion Patterns.
`scan_extensions`	File extensions the scan was restricted to (empty = all supported types).

All three FK back to scans.id with ON DELETE CASCADE.

Triage: `scan_false_positives`

Triage verdicts (FP and TP) live in scan_false_positives. Despite the name, the table holds both verdicts since migration M003 — read the verdict column to distinguish them.

CREATE TABLE scan_false_positives (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    scan_id INTEGER NOT NULL REFERENCES scans(id) ON DELETE CASCADE,
    scope TEXT NOT NULL,
    match_id INTEGER REFERENCES scan_matches(id) ON DELETE CASCADE,
    file_id INTEGER REFERENCES scan_files(id) ON DELETE CASCADE,
    pii_type TEXT,
    term TEXT,
    created_at TEXT NOT NULL DEFAULT (datetime('now')),
    verdict TEXT NOT NULL DEFAULT 'false_positive'
        CHECK (verdict IN ('false_positive', 'true_positive')),
    reviewed_at TEXT
);

scope is one of:

match — the verdict applies to one specific finding (match_id is set).
text — the verdict applies to every finding with the same (pii_type, term) in this scan.
file — the verdict applies to every finding in file_id.

See Triaging Findings → The verdict model for how scopes interact at read time.

A partial index speeds up the text-scope subquery used by FP-filtered counts:

CREATE INDEX idx_scan_fp_text_scope
    ON scan_false_positives(pii_type, term, verdict)
    WHERE scope = 'text';

Shared configuration

These tables hold configuration that survives across scans.

Table	Purpose
`terms_lists`, `terms`	Named keyword lists for terms-list scans. `terms` rows FK to a list; the list itself is referenced by `scans.terms_list_id`.
`custom_regex_patterns`	User-defined regex rules — see Custom Regex. Linked to scans via `scan_custom_regex_patterns`.
`proximity_groups`, `proximity_group_patterns`	Proximity regex groups. Linked to scans via `scan_proximity_groups`.
`smb_credentials`	Credentials for SMB / network-share scans, with `UNIQUE(server, share_name)`. Referenced by `scans.smb_credential_id`.
`kv_store`	Generic key/value table the application uses for small persistent settings (e.g. last-used preferences). Schema is `(key TEXT PRIMARY KEY, value TEXT NOT NULL)`.

Identity scans (DSAR)

Identity scans link PII findings to the names of the people they belong to. The hierarchy is hung off a scans row through identity_runs.scan_id.

Table	Purpose
`identity_runs`	One run of identity association over a scan. Holds counts, status, and the linking method (`line` distance vs. `paragraph`).
`identity_names`	Every name extracted from the run's source text, with the file it was found in and a `normalized` form for lookup.
`identity_associations`	A name linked to a `scan_matches` row, with the `method` and `distance` used. `is_ambiguous = 1` when more than one name was equally close.
`identity_unassociated`	Findings that did not link to any name.
`identity_false_positives`	Triage verdicts scoped to associations: `specific` (one association), `exact_text` (one term value), `name_type_combo` (a name + PII type pair).

Unique partial indexes enforce one verdict per scope per run on identity_false_positives — see the migrations file for the exact predicates. Note this table does not carry a verdict column; presence implies false positive (the table name is literal here).

See PII Identity Scan for the user-facing workflow.

Watch mode

The watch daemon's persistent state.

CREATE TABLE watch_policies (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT NOT NULL,
    pii_type TEXT,
    path_pattern TEXT,
    max_risk TEXT,
    action TEXT NOT NULL DEFAULT 'deny',
    severity TEXT NOT NULL DEFAULT 'high',
    created_at TEXT NOT NULL DEFAULT (datetime('now'))
);

CREATE UNIQUE INDEX idx_watch_policies_name ON watch_policies(name);

CREATE TABLE watch_violations (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    policy_id INTEGER NOT NULL REFERENCES watch_policies(id) ON DELETE CASCADE,
    file_path TEXT NOT NULL,
    pii_type TEXT NOT NULL,
    term TEXT NOT NULL,
    severity TEXT NOT NULL,
    created_at TEXT NOT NULL DEFAULT (datetime('now'))
);

watch_policies.name is unique — --policy <file> upserts on conflict, so reloading the same file is idempotent. See Watch Mode & Policies for the full lifecycle and policy-file format.

watch_violations.term holds the unmasked matched string (it never leaves the machine); the masked form is what's emitted to webhooks and stdout.

Useful queries

Count remaining true positives by PII type for a scan, after triage (the EXISTS subquery is the FP filter):

SELECT m.pii_type, COUNT(*) AS n
FROM scan_matches m
WHERE m.scan_id = ?
  AND NOT EXISTS (
    SELECT 1 FROM scan_false_positives fp
    WHERE fp.scan_id = m.scan_id
      AND fp.verdict = 'false_positive'
      AND (
        (fp.scope = 'match' AND fp.match_id = m.id)
        OR (fp.scope = 'text' AND fp.pii_type = m.pii_type AND fp.term = m.term)
        OR (fp.scope = 'file' AND fp.file_id = m.file_id)
      )
  )
GROUP BY m.pii_type
ORDER BY n DESC;

Recent watch violations with their policy name:

SELECT v.created_at, p.name AS policy, v.file_path, v.pii_type, v.severity
FROM watch_violations v
JOIN watch_policies p ON p.id = v.policy_id
ORDER BY v.created_at DESC
LIMIT 50;

Files in a scan that produced any PII at all:

SELECT f.path, COUNT(m.id) AS findings
FROM scan_files f
JOIN scan_matches m ON m.file_id = f.id
WHERE f.scan_id = ?
GROUP BY f.id
ORDER BY findings DESC;

Deleting scans and disk space

Deleting a scan removes it from the scan list immediately, while the underlying rows (which can number in the millions for a large scan) are removed by a background worker. You can keep browsing other scans and their findings while the cleanup runs. If the app shuts down mid-cleanup, the deletion is finished automatically on the next startup.

After the rows are deleted, PII Crawler compacts the database and returns the freed space to the operating system, so deleting a large scan shrinks piicrawler.db on disk.

Databases created by older versions of PII Crawler are compacted with a one-time full rebuild on the first deletion after upgrading. On a database holding very large scans this can take a few minutes and needs free disk space comparable to the database's size; subsequent deletions reclaim space quickly.

Backups and external access

The database is a normal SQLite file — back it up by copying it (when no PII Crawler process has it open) or use SQLite's online backup:

sqlite3 ~/.piicrawler/piicrawler.db ".backup '/backups/piicrawler-$(date +%F).db'"

.backup is safe to run while PII Crawler is using the database.

For read-only access from another tool while a scan is running, open the database with the mode=ro URI parameter so you don't compete for the writer lock:

sqlite3 'file:/home/you/.piicrawler/piicrawler.db?mode=ro&immutable=0' \
  "SELECT id, name, status FROM scans ORDER BY id DESC LIMIT 5"

← Previous

Tax Form Detection

CLI Reference

Was this page helpful?