Reference

Results Storage

Last updated June 2026

PII Crawler stores everything it produces — scans, findings, triage verdicts, terms lists, custom regex patterns, proximity groups, identity associations, watch policies and violations, SMB credentials — in a single SQLite database:

Platform Database location
Linux ~/.piicrawler/piicrawler.db
macOS ~/.piicrawler/piicrawler.db
Windows %USERPROFILE%\.piicrawler\piicrawler.db (for example C:\Users\Alice\.piicrawler\piicrawler.db)

The shell examples on this page use the Linux/macOS path; on Windows substitute %USERPROFILE%\.piicrawler\piicrawler.db.

There is one database per user, not one per scan. Use the scan_id foreign key on each table to scope a query to a single scan.

PII Crawler intentionally exposes this database so you can:

  • Run custom queries against findings without going through the TUI or web UI
  • Build alerts and pipelines on top of scan results and watch violations
  • Back up, archive, or sync the file with standard SQLite tooling
  • Inspect exactly what the application persists

You can think of the database as the application's API surface.

Schema and migrations

The schema is owned by an embedded migration runner (rusqlite_migration) and tracked in PRAGMA user_version. PII Crawler applies any pending migrations on every startup, so the file is always at the latest schema for the binary running on the machine.

To dump the current schema authoritatively, ask SQLite directly:

sqlite3 ~/.piicrawler/piicrawler.db .schema

The CREATE TABLE statements quoted in this document reflect the schema at the time of writing. Treat .schema as the source of truth — this page covers the load-bearing tables and their relationships, not every column verbatim.

Conceptual model

A scans row is the parent record for one invocation of "scan this path." Hung off it:

  • scan_files — one row per file the scan walked.
  • scan_matches — one row per PII finding, FK to both scans and scan_files.
  • scan_false_positives — triage verdicts for matches in this scan (FP and TP).
  • scan_pii_types, scan_exclusions, scan_extensions — per-scan configuration captured at scan-creation time.

Configuration that is shared across scans (terms lists, custom regex patterns, proximity groups, SMB credentials) lives in standalone tables. Each scan that uses one is linked through a join table — scan_custom_regex_patterns, scan_proximity_groups — so the same regex or proximity group can be reused across many scans without duplication.

Two subsystems hang off scans with their own parallel hierarchies:

  • Identity scans (DSAR linking) — identity_runs is hung off a scan, with identity_names, identity_associations, identity_unassociated, and identity_false_positives underneath.
  • Watch mode is independent of scans — watch_policies is global; watch_violations are produced by the watch daemon and FK back to a policy.

Core scan tables

scans

CREATE TABLE scans (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT NOT NULL,
    path TEXT NOT NULL,
    status TEXT NOT NULL DEFAULT 'pending',
    files_total INTEGER NOT NULL DEFAULT 0,
    files_scanned INTEGER NOT NULL DEFAULT 0,
    findings_total INTEGER NOT NULL DEFAULT 0,
    created_at TEXT NOT NULL DEFAULT (datetime('now')),
    updated_at TEXT NOT NULL DEFAULT (datetime('now')),
    enumeration_started_at TEXT,
    enumeration_finished_at TEXT,
    scan_started_at TEXT,
    scan_finished_at TEXT,
    terms_list_id INTEGER REFERENCES terms_lists(id) ON DELETE SET NULL,
    scan_type TEXT NOT NULL DEFAULT 'local',
    smb_server TEXT,
    smb_share TEXT,
    smb_subfolder TEXT,
    smb_credential_id INTEGER,
    max_file_size_bytes INTEGER,
    disable_ocr INTEGER NOT NULL DEFAULT 0,
    accumulated_duration_ms INTEGER NOT NULL DEFAULT 0,
    session_started_at TEXT
);

Notes:

  • status cycles through pendingenumeratingscanningcomplete (or error / cancelled).
  • scan_type is local for a filesystem scan; smb plus smb_server / smb_share / smb_credential_id for a network share scan.
  • accumulated_duration_ms plus session_started_at track wall-clock time across pause/resume — query their sum to get total scan time.

scan_files

CREATE TABLE scan_files (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    scan_id INTEGER NOT NULL REFERENCES scans(id) ON DELETE CASCADE,
    path TEXT NOT NULL,
    size INTEGER,
    extension TEXT,
    status TEXT NOT NULL DEFAULT 'pending',
    error TEXT,
    scanned_at TEXT,
    duration_ms INTEGER,
    deleted_at TEXT,
    UNIQUE(scan_id, path)
);

One row per file the scan walked. status is pending / scanning / done / error. error holds the extraction or scan error message when status = 'error'. deleted_at is set when the file is deleted from disk through the app, and kept in sync when the Files view notices a path has disappeared (or reappeared); it powers the Files view's Deleted visibility filter.

scan_matches

CREATE TABLE scan_matches (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    scan_id INTEGER NOT NULL REFERENCES scans(id) ON DELETE CASCADE,
    file_id INTEGER NOT NULL REFERENCES scan_files(id) ON DELETE CASCADE,
    pii_type TEXT NOT NULL,
    term TEXT NOT NULL,
    start_pos INTEGER NOT NULL,
    end_pos INTEGER NOT NULL,
    context_line TEXT,
    context_paragraph TEXT,
    context_surrounding TEXT
);

One row per PII finding. pii_type uses the slugs documented in Watch Mode & Policies → PII type slugs (ssn, email, regex-<slug>, etc.). term is the actual matched string (unmasked — see Security for what data leaves the machine; this column never does). start_pos / end_pos are character offsets into the extracted text. The three context_* columns are progressively wider snippets from the source for triage and reporting.

scan_pii_types, scan_exclusions, scan_extensions

Per-scan configuration captured when the scan was created.

Table Purpose
scan_pii_types Which PII types this scan was configured to detect (subset of the data types).
scan_exclusions Path or glob patterns excluded from this scan — see Exclusion Patterns.
scan_extensions File extensions the scan was restricted to (empty = all supported types).

All three FK back to scans.id with ON DELETE CASCADE.

Triage: scan_false_positives

Triage verdicts (FP and TP) live in scan_false_positives. Despite the name, the table holds both verdicts since migration M003 — read the verdict column to distinguish them.

CREATE TABLE scan_false_positives (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    scan_id INTEGER NOT NULL REFERENCES scans(id) ON DELETE CASCADE,
    scope TEXT NOT NULL,
    match_id INTEGER REFERENCES scan_matches(id) ON DELETE CASCADE,
    file_id INTEGER REFERENCES scan_files(id) ON DELETE CASCADE,
    pii_type TEXT,
    term TEXT,
    created_at TEXT NOT NULL DEFAULT (datetime('now')),
    verdict TEXT NOT NULL DEFAULT 'false_positive'
        CHECK (verdict IN ('false_positive', 'true_positive')),
    reviewed_at TEXT
);

scope is one of:

  • match — the verdict applies to one specific finding (match_id is set).
  • text — the verdict applies to every finding with the same (pii_type, term) in this scan.
  • file — the verdict applies to every finding in file_id.

See Triaging Findings → The verdict model for how scopes interact at read time.

A partial index speeds up the text-scope subquery used by FP-filtered counts:

CREATE INDEX idx_scan_fp_text_scope
    ON scan_false_positives(pii_type, term, verdict)
    WHERE scope = 'text';

Shared configuration

These tables hold configuration that survives across scans.

Table Purpose
terms_lists, terms Named keyword lists for terms-list scans. terms rows FK to a list; the list itself is referenced by scans.terms_list_id.
custom_regex_patterns User-defined regex rules — see Custom Regex. Linked to scans via scan_custom_regex_patterns.
proximity_groups, proximity_group_patterns Proximity regex groups. Linked to scans via scan_proximity_groups.
smb_credentials Credentials for SMB / network-share scans, with UNIQUE(server, share_name). Referenced by scans.smb_credential_id.
kv_store Generic key/value table the application uses for small persistent settings (e.g. last-used preferences). Schema is (key TEXT PRIMARY KEY, value TEXT NOT NULL).

Identity scans (DSAR)

Identity scans link PII findings to the names of the people they belong to. The hierarchy is hung off a scans row through identity_runs.scan_id.

Table Purpose
identity_runs One run of identity association over a scan. Holds counts, status, and the linking method (line distance vs. paragraph).
identity_names Every name extracted from the run's source text, with the file it was found in and a normalized form for lookup.
identity_associations A name linked to a scan_matches row, with the method and distance used. is_ambiguous = 1 when more than one name was equally close.
identity_unassociated Findings that did not link to any name.
identity_false_positives Triage verdicts scoped to associations: specific (one association), exact_text (one term value), name_type_combo (a name + PII type pair).

Unique partial indexes enforce one verdict per scope per run on identity_false_positives — see the migrations file for the exact predicates. Note this table does not carry a verdict column; presence implies false positive (the table name is literal here).

See PII Identity Scan for the user-facing workflow.

Watch mode

The watch daemon's persistent state.

CREATE TABLE watch_policies (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT NOT NULL,
    pii_type TEXT,
    path_pattern TEXT,
    max_risk TEXT,
    action TEXT NOT NULL DEFAULT 'deny',
    severity TEXT NOT NULL DEFAULT 'high',
    created_at TEXT NOT NULL DEFAULT (datetime('now'))
);

CREATE UNIQUE INDEX idx_watch_policies_name ON watch_policies(name);

CREATE TABLE watch_violations (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    policy_id INTEGER NOT NULL REFERENCES watch_policies(id) ON DELETE CASCADE,
    file_path TEXT NOT NULL,
    pii_type TEXT NOT NULL,
    term TEXT NOT NULL,
    severity TEXT NOT NULL,
    created_at TEXT NOT NULL DEFAULT (datetime('now'))
);

watch_policies.name is unique — --policy <file> upserts on conflict, so reloading the same file is idempotent. See Watch Mode & Policies for the full lifecycle and policy-file format.

watch_violations.term holds the unmasked matched string (it never leaves the machine); the masked form is what's emitted to webhooks and stdout.

Useful queries

Count remaining true positives by PII type for a scan, after triage (the EXISTS subquery is the FP filter):

SELECT m.pii_type, COUNT(*) AS n
FROM scan_matches m
WHERE m.scan_id = ?
  AND NOT EXISTS (
    SELECT 1 FROM scan_false_positives fp
    WHERE fp.scan_id = m.scan_id
      AND fp.verdict = 'false_positive'
      AND (
        (fp.scope = 'match' AND fp.match_id = m.id)
        OR (fp.scope = 'text' AND fp.pii_type = m.pii_type AND fp.term = m.term)
        OR (fp.scope = 'file' AND fp.file_id = m.file_id)
      )
  )
GROUP BY m.pii_type
ORDER BY n DESC;

Recent watch violations with their policy name:

SELECT v.created_at, p.name AS policy, v.file_path, v.pii_type, v.severity
FROM watch_violations v
JOIN watch_policies p ON p.id = v.policy_id
ORDER BY v.created_at DESC
LIMIT 50;

Files in a scan that produced any PII at all:

SELECT f.path, COUNT(m.id) AS findings
FROM scan_files f
JOIN scan_matches m ON m.file_id = f.id
WHERE f.scan_id = ?
GROUP BY f.id
ORDER BY findings DESC;

Deleting scans and disk space

Deleting a scan removes it from the scan list immediately, while the underlying rows (which can number in the millions for a large scan) are removed by a background worker. You can keep browsing other scans and their findings while the cleanup runs. If the app shuts down mid-cleanup, the deletion is finished automatically on the next startup.

After the rows are deleted, PII Crawler compacts the database and returns the freed space to the operating system, so deleting a large scan shrinks piicrawler.db on disk.

Databases created by older versions of PII Crawler are compacted with a one-time full rebuild on the first deletion after upgrading. On a database holding very large scans this can take a few minutes and needs free disk space comparable to the database's size; subsequent deletions reclaim space quickly.

Backups and external access

The database is a normal SQLite file — back it up by copying it (when no PII Crawler process has it open) or use SQLite's online backup:

sqlite3 ~/.piicrawler/piicrawler.db ".backup '/backups/piicrawler-$(date +%F).db'"

.backup is safe to run while PII Crawler is using the database.

For read-only access from another tool while a scan is running, open the database with the mode=ro URI parameter so you don't compete for the writer lock:

sqlite3 'file:/home/you/.piicrawler/piicrawler.db?mode=ro&immutable=0' \
  "SELECT id, name, status FROM scans ORDER BY id DESC LIMIT 5"
Was this page helpful?