Reference

CLI Reference

Last updated May 2026

Overview

piicrawler is a single binary that ships in three modes:

  • An interactive terminal UI (no arguments)
  • A web UI server (piicrawler serve)
  • A set of command line subcommands for one-off scans, real-time monitoring, DSAR lookups, and HTML report generation

Run piicrawler help (or -h / --help) to print a usage summary at any time. Each subcommand also accepts --help (e.g. piicrawler scan --help).

Synopsis

piicrawler                                              Launch interactive TUI
piicrawler [scan] <path> [--workers <n>] [--out <file>] [--format jsonl|csv] [--progress tui|plain|none] [--quiet] [--no-ocr]  Scan a file or directory
piicrawler serve [port]                                 Start web UI (default port 3001)
piicrawler watch <path>... [options]                    Monitor directories for PII in real-time
piicrawler dsar <name> [options]                        Search for a person's PII across all scans
piicrawler report <scan_id>                             Generate an HTML report for a scan
piicrawler textextract <path> [--no-ocr] [--out <file>]  Print extracted text from a file, directory, or container
piicrawler update [--yes] [--force]                     Download and install the latest build in place
piicrawler version                                      Print the PII Crawler version
piicrawler help                                         Show the built-in help message

Commands

(no arguments) — Interactive TUI

piicrawler

Launches the interactive terminal UI for browsing scans, viewing findings, and managing the local database. This is the default mode when you run the binary with no arguments.

[scan] <path> — One-shot scan

piicrawler [scan] <path> [--workers <n>] [-j <n>] [--out <file>] [-o <file>] [--format jsonl|csv] [--progress tui|plain|none] [--quiet] [--no-ocr]

Scans a single file, an archive container (e.g. .zip, .tar.gz), or a directory tree, and writes results to stdout as pretty-printed JSON. The leading scan keyword is optional — piicrawler ~/share and piicrawler scan ~/share are equivalent. Progress messages are written to stderr so you can pipe stdout safely:

piicrawler ~/Documents > findings.json

Requires a registered installation. If no license is present, the command exits with an error directing you to register via piicrawler serve.

Options:

  • --workers <n>, -j <n> — Number of worker threads to use when scanning a directory. Defaults to 4. Capped at the number of files found.
  • --out <file>, -o <file> — Stream results to <file> instead of stdout (truncates if it exists). The format is auto-detected from the file extension: .csvCSV (one row per finding, see below), anything else → JSONL (one ScanResult object per line). Each result is written and flushed as soon as its file finishes scanning, so memory does not grow with the size of the tree — recommended for large directory scans. Stdout JSON output is suppressed when this flag is used; stderr progress is unaffected.
  • --format jsonl|csv — Force the output format, overriding extension detection. Requires --out. Useful when piping CSV to a file without a .csv extension, or forcing JSONL output to a .csv filename.
  • --progress tui|plain|none — Control the progress display on stderr. Default: autotui when stderr is an interactive terminal, plain otherwise (e.g. CI logs, piped stderr).
    • tui — In-place status block: counts, a progress bar, and the files currently scanning. Updates in place using ANSI cursor controls and never scrolls.
    • plain — Scrolling per-file [start N/T] / [done N/T] / [running K] lines. Use this for CI logs or anywhere the ANSI-aware mode would cause garbled output.
    • none — Suppress all progress output (equivalent to --quiet).
  • --quiet — Alias for --progress=none.
  • --no-ocr — Skip OCR on images and scanned PDFs. Speeds up scans of mostly-text trees.

Compatibility note. Progress is always written to stderr, never stdout, so piicrawler scan ~/share > findings.json and piicrawler scan ~/share --out report.csv keep working unchanged regardless of the progress mode. The --out file content is identical across modes.

CSV format. When CSV is selected, the file starts with a header row and then carries one row per finding:

file_path,pii_type,term,start,end,error
/srv/share/contracts/2026-q1.pdf,ssn,123-45-6789,1024,1035,
/srv/share/contracts/2026-q1.pdf,email,[email protected],2110,2126,
/srv/share/contracts/clean.txt,,,,,
/srv/share/contracts/locked.pdf,,,,,decryption failed

Files with no findings still produce one row (with empty PII columns) so that scanned-but-clean and unreadable files are visible in the report. Values containing commas, quotes, or newlines are quoted per RFC 4180.

Behaviour:

  • File: extracts text, runs PII detection, prints a single ScanResult JSON object (or one JSONL line with --out).
  • Container: extracts each entry and scans it, prints an array of ScanResult objects (or one JSONL line per entry with --out). Containers include .zip, .7z, .tar.gz / .tgz, and .mbox. For .mbox files each message becomes its own entry; the message ordinal (and Message-ID: when present) is appended to file_path, e.g. mail.mbox::message-000042::<[email protected]>. Output streams as each entry completes, so tail -f works on the JSONL file mid-scan.
  • Directory: recursively walks the tree (skipping symlinks) and scans every supported file type or container in parallel, prints an array of ScanResult objects (or one JSONL line per file as it completes with --out).

serve — Web UI

piicrawler serve [port]

Starts the web UI on the given port (default 3001) and opens an HTTP server you can reach at http://localhost:<port>. The web UI is where you create and manage scans, register your license, and review findings in a browser.

watch — Real-time monitoring

piicrawler watch <path>... [--webhook <url>] [--policy <file>] [--no-json] [--no-ocr] [--debounce <ms>]

Watches one or more directories for file system changes and scans newly created or modified files for PII as they appear. Results are streamed as JSON to stdout by default and recorded in the local database. Press Ctrl+C to stop the daemon.

Options:

  • --webhook <url> — POST findings to the given webhook URL as they are produced.
  • --policy <file> — Load alert policies from a config file (TOML-style [[policy]] tables). Each policy can match on pii_type and path_pattern, and tags every matching violation with an action and severity label. Loaded policies are appended to the local database for the daemon to consume — see Watch Mode & Policies for the file format, the webhook payload schema, and how to safely reload policies.
  • --no-json — Disable the JSON stdout stream (use this when you only want webhook delivery or database persistence).
  • --no-ocr — Skip OCR on images and scanned PDFs.
  • --debounce <ms> — Debounce window for file events in milliseconds. Defaults to 500. Useful when editors save in bursts.

dsar — Data Subject Access Request

piicrawler dsar "Person Name" [--assert-clean] [--report <file>] [--json]

Searches every recorded scan in the local database for PII associated with the given person and prints a summary to stderr. Use this to fulfil GDPR/CCPA right-to-know requests or to check whether a specific person's data has leaked into a watched location. For an end-to-end walkthrough (scan → identity → DSAR report), see the DSAR Walkthrough.

Options:

  • --assert-clean — Exit with status 1 if any findings are returned (and 0 with a CLEAN: line if not). Designed for use in CI pipelines.
  • --report <file> — Write a self-contained HTML report to <file>.
  • --json — Print structured findings as JSON to stdout in addition to the stderr summary.

report — HTML risk report

piicrawler report <scan_id>

Generates a standalone HTML risk report for the scan with the given numeric ID and writes it to piicrawler-report-<scan_id>.html in the current working directory. The scan ID can be found in the TUI or web UI.

textextract <path> — Print extracted text

piicrawler textextract <path> [--no-ocr] [--out <file>] [-o <file>]

Prints the raw text that PII Crawler's extractors would feed into the scanner, without running PII detection. Useful for verifying that a format is parsed correctly, debugging unexpected scan results, or piping cleaned text into another tool.

<path> may be:

  • a single file (PDF, Office document, image, plain text, etc.)
  • a directory (walked recursively, every text-extractable file is emitted)
  • a container archive (.zip, .7z, .tar.gz, .mbox) — each inner entry is emitted with the <archive>!/<inner-path> convention used by scan

Output goes to stdout by default, with one block per file:

===== /path/to/file.txt =====
<extracted text>

===== archive.zip!/inner/notes.docx =====
<extracted text>

===== /path/to/scan.pdf =====
[error: Extraction failed: encrypted PDF]

Files that fail extraction are still listed, with the error written in place of the body so the failure is visible in the output.

Options:

  • --no-ocr — Skip OCR on images and scanned PDFs. Speeds up extraction when you only care about text-bearing files.
  • --out <file>, -o <file> — Write output to <file> instead of stdout (truncates if it exists).

update — In-place binary upgrade

piicrawler update [--yes] [--force]

Checks downloads.eligian.com for the latest build for your platform, compares it to the running binary, and (if newer) downloads the matching archive, verifies its SHA-256, extracts the binary, and atomically swaps it into place. Your database, license, terms lists, and triage verdicts in ~/.piicrawler/ are never touched.

Options:

  • --yes, -y — Skip the interactive [y/N] confirmation. Useful for scripted upgrades.
  • --force, -f — Reinstall even when the local build is already at or newer than the published one.

Behaviour by platform:

  • Linux and macOS swap the binary atomically (rename(2)). Any already-running piicrawler processes keep using the old binary until they exit; new invocations pick up the new build.
  • Windows cannot overwrite a running .exe, so the live binary is renamed to piicrawler.exe.old next to the original and the new bytes are written at the original path. The .old file can be deleted once no piicrawler.exe processes remain. update always pulls the Azure-Trusted-Signing-signed piicrawler-cli-windows-signed.zip, so the binary you end up with after an upgrade is signed by the same publisher as your initial install.

If your platform or architecture is not currently published (e.g. Linux ARM), update exits with a friendly error pointing you to the download page.

version — Print version

piicrawler version
piicrawler -V
piicrawler --version

Prints the running build's version string (e.g. 26.0507.1432) to stdout and exits. Stdout-only output keeps it pipe-friendly for shell scripts that need to read the version.

help

piicrawler help
piicrawler -h
piicrawler --help

Prints the built-in usage summary to stdout. Pass a subcommand name (e.g. piicrawler help scan) for command-specific help.

Output

piicrawler <path> prints a JSON document with one entry per scanned file. Each entry has the shape:

{
  "file_path": "/absolute/path/to/file.pdf",
  "findings": [ ... ],
  "full_names": [ ... ],
  "char_count": 12345,
  "error": null
}

If extraction fails for a file, error is set to a short message and findings is empty. Container scans return the same shape, one entry per archive member.

See PII Data Types for the structure of individual findings and Results Storage for the database schema used by serve, watch, and the TUI.

Environment variables

PII Crawler reads a small set of environment variables on startup. None of them are required for normal use; they're escape hatches for daemonized, headless, or noisy deployments.

Logging

  • PIICRAWLER_LOG_FILE — If set to a non-empty path, structured logs are appended to the given file in addition to being shown in the TUI Logs view. Useful when running watch or serve as a long-lived daemon. Failure to open the file is logged and the binary continues without file logging.
  • PIICRAWLER_LOG_FILTER — Accepts any value parseable by log::LevelFilter: off, error, warn, info, debug, trace. Defaults to info. Set to debug or trace to surface SMB protocol traces during a network-share scan.

Credential store (headless / CI)

PII Crawler keeps SMB credentials encrypted under a credential password — see Scan an SMB Network Share → How credentials are protected. For headless deployments where you can't type the password interactively, set one of:

  • PIICRAWLER_CRED_PASSWORD — Auto-unlock the credential store with this password the first time anything in the session touches a stored credential. The password is never written back to disk.
  • PIICRAWLER_CRED_KEY_BASE64 — Inject a 32-byte base64-encoded data-encryption key (DEK) directly. Skips the password prompt entirely. Intended for CI tests against an isolated database where the DEK is managed externally; do not use in production.

If both are set, PIICRAWLER_CRED_KEY_BASE64 wins. If neither is set, the credential store stays locked until the user unlocks it via the TUI, the web UI, or POST /api/cred/unlock.

Crash reporting

  • SENTRY_DSN — If PII Crawler crashes in the wild we use this to receive crash reports. It's the only way for us to know if we have a crashing bug in the wild. Please leave this set if possible. However, if you are in an air-gapped environment or want to disable all outbound calls: Override the built-in Sentry DSN that receives unhandled-exception reports (see Security → Error Reporting). Set to an empty string (SENTRY_DSN="") to disable error reporting entirely. Reports never include scan results, file paths, or PII.

Examples

Scan a directory and save the findings to a file:

piicrawler ~/Downloads --workers 8 > findings.json

Stream a large directory scan to a JSONL file (results are appended as each file finishes, so memory stays flat):

piicrawler /srv/shared --workers 8 --out findings.jsonl
jq -c 'select(.findings | length > 0)' findings.jsonl

Export findings as CSV for spreadsheet review (auto-detected from the .csv extension):

piicrawler scan ~/share --workers 8 --out report.csv

Force CSV format when the output file has a non-standard extension:

piicrawler scan ~/share --out report.txt --format csv

Scan a single archive without OCR and pipe to jq:

piicrawler backups/2026-04.zip --no-ocr --quiet | jq '.[] | select(.findings | length > 0)'

Watch two directories with a webhook and a policy file:

piicrawler watch /srv/uploads /srv/exports \
  --webhook https://alerts.example.com/piicrawler \
  --policy ./policies.toml \
  --debounce 1000

Fail a CI job if any PII is found for a given person:

piicrawler dsar "Jane Doe" --assert-clean

Generate an HTML report for scan ID 42:

piicrawler report 42
Was this page helpful?