CLI Reference
Overview
piicrawler is a single binary that ships in three modes:
- An interactive terminal UI (no arguments)
- A web UI server (
piicrawler serve) - A set of command line subcommands for one-off scans, real-time monitoring, DSAR lookups, and HTML report generation
Run piicrawler help (or -h / --help) to print a usage summary at any time. Each subcommand also accepts --help (e.g. piicrawler scan --help).
Synopsis
piicrawler Launch interactive TUI
piicrawler [scan] <path> [--workers <n>] [--out <file>] [--format jsonl|csv] [--progress tui|plain|none] [--quiet] [--no-ocr] Scan a file or directory
piicrawler serve [port] Start web UI (default port 3001)
piicrawler watch <path>... [options] Monitor directories for PII in real-time
piicrawler dsar <name> [options] Search for a person's PII across all scans
piicrawler report <scan_id> Generate an HTML report for a scan
piicrawler textextract <path> [--no-ocr] [--out <file>] Print extracted text from a file, directory, or container
piicrawler update [--yes] [--force] Download and install the latest build in place
piicrawler version Print the PII Crawler version
piicrawler help Show the built-in help message
Commands
(no arguments) — Interactive TUI
piicrawler
Launches the interactive terminal UI for browsing scans, viewing findings, and managing the local database. This is the default mode when you run the binary with no arguments.
[scan] <path> — One-shot scan
piicrawler [scan] <path> [--workers <n>] [-j <n>] [--out <file>] [-o <file>] [--format jsonl|csv] [--progress tui|plain|none] [--quiet] [--no-ocr]
Scans a single file, an archive container (e.g. .zip, .tar.gz), or a directory tree, and writes results to stdout as pretty-printed JSON. The leading scan keyword is optional — piicrawler ~/share and piicrawler scan ~/share are equivalent. Progress messages are written to stderr so you can pipe stdout safely:
piicrawler ~/Documents > findings.json
Requires a registered installation. If no license is present, the command exits with an error directing you to register via piicrawler serve.
Options:
--workers <n>,-j <n>— Number of worker threads to use when scanning a directory. Defaults to4. Capped at the number of files found.--out <file>,-o <file>— Stream results to<file>instead of stdout (truncates if it exists). The format is auto-detected from the file extension:.csv→ CSV (one row per finding, see below), anything else → JSONL (oneScanResultobject per line). Each result is written and flushed as soon as its file finishes scanning, so memory does not grow with the size of the tree — recommended for large directory scans. Stdout JSON output is suppressed when this flag is used; stderr progress is unaffected.--format jsonl|csv— Force the output format, overriding extension detection. Requires--out. Useful when piping CSV to a file without a.csvextension, or forcing JSONL output to a.csvfilename.--progress tui|plain|none— Control the progress display on stderr. Default: auto —tuiwhen stderr is an interactive terminal,plainotherwise (e.g. CI logs, piped stderr).tui— In-place status block: counts, a progress bar, and the files currently scanning. Updates in place using ANSI cursor controls and never scrolls.plain— Scrolling per-file[start N/T]/[done N/T]/[running K]lines. Use this for CI logs or anywhere the ANSI-aware mode would cause garbled output.none— Suppress all progress output (equivalent to--quiet).
--quiet— Alias for--progress=none.--no-ocr— Skip OCR on images and scanned PDFs. Speeds up scans of mostly-text trees.
Compatibility note. Progress is always written to stderr, never stdout, so piicrawler scan ~/share > findings.json and piicrawler scan ~/share --out report.csv keep working unchanged regardless of the progress mode. The --out file content is identical across modes.
CSV format. When CSV is selected, the file starts with a header row and then carries one row per finding:
file_path,pii_type,term,start,end,error
/srv/share/contracts/2026-q1.pdf,ssn,123-45-6789,1024,1035,
/srv/share/contracts/2026-q1.pdf,email,[email protected],2110,2126,
/srv/share/contracts/clean.txt,,,,,
/srv/share/contracts/locked.pdf,,,,,decryption failed
Files with no findings still produce one row (with empty PII columns) so that scanned-but-clean and unreadable files are visible in the report. Values containing commas, quotes, or newlines are quoted per RFC 4180.
Behaviour:
- File: extracts text, runs PII detection, prints a single
ScanResultJSON object (or one JSONL line with--out). - Container: extracts each entry and scans it, prints an array of
ScanResultobjects (or one JSONL line per entry with--out). Containers include.zip,.7z,.tar.gz/.tgz, and.mbox. For.mboxfiles each message becomes its own entry; the message ordinal (andMessage-ID:when present) is appended tofile_path, e.g.mail.mbox::message-000042::<[email protected]>. Output streams as each entry completes, sotail -fworks on the JSONL file mid-scan. - Directory: recursively walks the tree (skipping symlinks) and scans every supported file type or container in parallel, prints an array of
ScanResultobjects (or one JSONL line per file as it completes with--out).
serve — Web UI
piicrawler serve [port]
Starts the web UI on the given port (default 3001) and opens an HTTP server you can reach at http://localhost:<port>. The web UI is where you create and manage scans, register your license, and review findings in a browser.
watch — Real-time monitoring
piicrawler watch <path>... [--webhook <url>] [--policy <file>] [--no-json] [--no-ocr] [--debounce <ms>]
Watches one or more directories for file system changes and scans newly created or modified files for PII as they appear. Results are streamed as JSON to stdout by default and recorded in the local database. Press Ctrl+C to stop the daemon.
Options:
--webhook <url>— POST findings to the given webhook URL as they are produced.--policy <file>— Load alert policies from a config file (TOML-style[[policy]]tables). Each policy can match onpii_typeandpath_pattern, and tags every matching violation with anactionandseveritylabel. Loaded policies are appended to the local database for the daemon to consume — see Watch Mode & Policies for the file format, the webhook payload schema, and how to safely reload policies.--no-json— Disable the JSON stdout stream (use this when you only want webhook delivery or database persistence).--no-ocr— Skip OCR on images and scanned PDFs.--debounce <ms>— Debounce window for file events in milliseconds. Defaults to500. Useful when editors save in bursts.
dsar — Data Subject Access Request
piicrawler dsar "Person Name" [--assert-clean] [--report <file>] [--json]
Searches every recorded scan in the local database for PII associated with the given person and prints a summary to stderr. Use this to fulfil GDPR/CCPA right-to-know requests or to check whether a specific person's data has leaked into a watched location. For an end-to-end walkthrough (scan → identity → DSAR report), see the DSAR Walkthrough.
Options:
--assert-clean— Exit with status1if any findings are returned (and0with aCLEAN:line if not). Designed for use in CI pipelines.--report <file>— Write a self-contained HTML report to<file>.--json— Print structured findings as JSON to stdout in addition to the stderr summary.
report — HTML risk report
piicrawler report <scan_id>
Generates a standalone HTML risk report for the scan with the given numeric ID and writes it to piicrawler-report-<scan_id>.html in the current working directory. The scan ID can be found in the TUI or web UI.
textextract <path> — Print extracted text
piicrawler textextract <path> [--no-ocr] [--out <file>] [-o <file>]
Prints the raw text that PII Crawler's extractors would feed into the scanner, without running PII detection. Useful for verifying that a format is parsed correctly, debugging unexpected scan results, or piping cleaned text into another tool.
<path> may be:
- a single file (PDF, Office document, image, plain text, etc.)
- a directory (walked recursively, every text-extractable file is emitted)
- a container archive (
.zip,.7z,.tar.gz,.mbox) — each inner entry is emitted with the<archive>!/<inner-path>convention used byscan
Output goes to stdout by default, with one block per file:
===== /path/to/file.txt =====
<extracted text>
===== archive.zip!/inner/notes.docx =====
<extracted text>
===== /path/to/scan.pdf =====
[error: Extraction failed: encrypted PDF]
Files that fail extraction are still listed, with the error written in place of the body so the failure is visible in the output.
Options:
--no-ocr— Skip OCR on images and scanned PDFs. Speeds up extraction when you only care about text-bearing files.--out <file>,-o <file>— Write output to<file>instead of stdout (truncates if it exists).
update — In-place binary upgrade
piicrawler update [--yes] [--force]
Checks downloads.eligian.com for the latest build for your platform, compares it to the running binary, and (if newer) downloads the matching archive, verifies its SHA-256, extracts the binary, and atomically swaps it into place. Your database, license, terms lists, and triage verdicts in ~/.piicrawler/ are never touched.
Options:
--yes,-y— Skip the interactive[y/N]confirmation. Useful for scripted upgrades.--force,-f— Reinstall even when the local build is already at or newer than the published one.
Behaviour by platform:
- Linux and macOS swap the binary atomically (
rename(2)). Any already-runningpiicrawlerprocesses keep using the old binary until they exit; new invocations pick up the new build. - Windows cannot overwrite a running
.exe, so the live binary is renamed topiicrawler.exe.oldnext to the original and the new bytes are written at the original path. The.oldfile can be deleted once nopiicrawler.exeprocesses remain.updatealways pulls the Azure-Trusted-Signing-signedpiicrawler-cli-windows-signed.zip, so the binary you end up with after an upgrade is signed by the same publisher as your initial install.
If your platform or architecture is not currently published (e.g. Linux ARM), update exits with a friendly error pointing you to the download page.
version — Print version
piicrawler version
piicrawler -V
piicrawler --version
Prints the running build's version string (e.g. 26.0507.1432) to stdout and exits. Stdout-only output keeps it pipe-friendly for shell scripts that need to read the version.
help
piicrawler help
piicrawler -h
piicrawler --help
Prints the built-in usage summary to stdout. Pass a subcommand name (e.g. piicrawler help scan) for command-specific help.
Output
piicrawler <path> prints a JSON document with one entry per scanned file. Each entry has the shape:
{
"file_path": "/absolute/path/to/file.pdf",
"findings": [ ... ],
"full_names": [ ... ],
"char_count": 12345,
"error": null
}
If extraction fails for a file, error is set to a short message and findings is empty. Container scans return the same shape, one entry per archive member.
See PII Data Types for the structure of individual findings and Results Storage for the database schema used by serve, watch, and the TUI.
Environment variables
PII Crawler reads a small set of environment variables on startup. None of them are required for normal use; they're escape hatches for daemonized, headless, or noisy deployments.
Logging
PIICRAWLER_LOG_FILE— If set to a non-empty path, structured logs are appended to the given file in addition to being shown in the TUI Logs view. Useful when runningwatchorserveas a long-lived daemon. Failure to open the file is logged and the binary continues without file logging.PIICRAWLER_LOG_FILTER— Accepts any value parseable bylog::LevelFilter:off,error,warn,info,debug,trace. Defaults toinfo. Set todebugortraceto surface SMB protocol traces during a network-share scan.
Credential store (headless / CI)
PII Crawler keeps SMB credentials encrypted under a credential password — see Scan an SMB Network Share → How credentials are protected. For headless deployments where you can't type the password interactively, set one of:
PIICRAWLER_CRED_PASSWORD— Auto-unlock the credential store with this password the first time anything in the session touches a stored credential. The password is never written back to disk.PIICRAWLER_CRED_KEY_BASE64— Inject a 32-byte base64-encoded data-encryption key (DEK) directly. Skips the password prompt entirely. Intended for CI tests against an isolated database where the DEK is managed externally; do not use in production.
If both are set, PIICRAWLER_CRED_KEY_BASE64 wins. If neither is set, the credential store stays locked until the user unlocks it via the TUI, the web UI, or POST /api/cred/unlock.
Crash reporting
SENTRY_DSN— If PII Crawler crashes in the wild we use this to receive crash reports. It's the only way for us to know if we have a crashing bug in the wild. Please leave this set if possible. However, if you are in an air-gapped environment or want to disable all outbound calls: Override the built-in Sentry DSN that receives unhandled-exception reports (see Security → Error Reporting). Set to an empty string (SENTRY_DSN="") to disable error reporting entirely. Reports never include scan results, file paths, or PII.
Examples
Scan a directory and save the findings to a file:
piicrawler ~/Downloads --workers 8 > findings.json
Stream a large directory scan to a JSONL file (results are appended as each file finishes, so memory stays flat):
piicrawler /srv/shared --workers 8 --out findings.jsonl
jq -c 'select(.findings | length > 0)' findings.jsonl
Export findings as CSV for spreadsheet review (auto-detected from the .csv extension):
piicrawler scan ~/share --workers 8 --out report.csv
Force CSV format when the output file has a non-standard extension:
piicrawler scan ~/share --out report.txt --format csv
Scan a single archive without OCR and pipe to jq:
piicrawler backups/2026-04.zip --no-ocr --quiet | jq '.[] | select(.findings | length > 0)'
Watch two directories with a webhook and a policy file:
piicrawler watch /srv/uploads /srv/exports \
--webhook https://alerts.example.com/piicrawler \
--policy ./policies.toml \
--debounce 1000
Fail a CI job if any PII is found for a given person:
piicrawler dsar "Jane Doe" --assert-clean
Generate an HTML report for scan ID 42:
piicrawler report 42