Results Storage

PII Crawler scan results are stored in an SQLite Database. SQLite is a single-file database. Most programing languages have built-in support for connecting to it. By default results will be stored in the user home directory under .piicrawler/scans/<scan-id>.sqlite.

PII Crawler intentionally exposes the application's database to the user so the user can manually or programmatically:

  • View the data that is being collected
  • Run custom queries on the data
  • Create alerts from findings
  • Customize which files are scanned
  • Customize how files are scanned
  • Customize the behavior of PII Crawler

You can think of this database as an API to PII Crawler.

Schema

Files Table

CREATE TABLE IF NOT EXISTS files (
    path TEXT PRIMARY KEY,
    sha256 TEXT,
    size INTEGER NOT NULL,
    extension TEXT NOT NULL,
    mime_type TEXT NOT NULL DEFAULT '',
    skip BOOLEAN DEFAULT 0 NOT NULL,
    last_modified INTEGER DEFAULT 0 NOT NULL,
    parent_path TEXT DEFAULT '' NOT NULL,
    last_scanned_at INTEGER DEFAULT 0 NOT NULL,
    scan_attempts INTEGER DEFAULT 0 NOT NULL,
    last_error TEXT,
    scan_started_at INTEGER,
    scan_finished_at INTEGER
);
ColumnDescription
pathAbsolute path to file (primary key)
sha256SHA256 hash of file content (used for deduplication)
sizeSize of file in bytes
extensionFile extension (ex: .pdf, .csv)
mime_typeDetected file MIME type (ex: application/json, image/jpeg)
skipBoolean flag - if true, file will not be scanned
last_modifiedUnix timestamp of when the file was last modified
parent_pathPath to the parent directory
last_scanned_atUnix timestamp of the last scan attempt
scan_attemptsNumber of times scanning has been attempted for this file
last_errorError message from the last failed scan attempt (if any)
scan_started_atUnix timestamp when the current/last scan started
scan_finished_atUnix timestamp when the current/last scan finished

Matches Table

CREATE TABLE IF NOT EXISTS matches (
    id INTEGER PRIMARY KEY,
    text TEXT NOT NULL,
    kind TEXT NOT NULL,
    ignored INTEGER DEFAULT 0 NOT NULL,
    UNIQUE(text, kind)
);
ColumnDescription
idAuto-incrementing primary key
textThe matched PII text (e.g., actual email address, name, etc.)
kindType of PII match (e.g., "emails", "names", "ssns", "addresses")
ignoredBoolean flag (0 or 1) - if 1, this match will be excluded from results

File Matches Table

CREATE TABLE IF NOT EXISTS file_matches (
    sha256 TEXT,
    match_id INTEGER,
    PRIMARY KEY (sha256, match_id)
);
ColumnDescription
sha256SHA256 hash of file content (references files.sha256)
match_idID of the match (references matches.id)

This junction table links files to their PII matches using content hashing for deduplication. Multiple files with identical content share the same matches.