Reference

CLI Reference

Last updated June 2026

Overview

piicrawler is a single binary that ships in three modes:

  • An interactive terminal UI (no arguments)
  • A web UI server (piicrawler serve)
  • A set of command line subcommands for one-off scans, real-time monitoring, DSAR lookups, and HTML report generation

Run piicrawler help (or -h / --help) to print a usage summary at any time. Each subcommand also accepts --help (e.g. piicrawler scan --help).

Synopsis

piicrawler                                              Launch interactive TUI
piicrawler demo                                         Scan throwaway synthetic data (no setup, no registration)
piicrawler [scan] <path>... [--workers <n>] [--out <file>] [--format jsonl|csv] [--progress tui|plain|none] [--quiet] [--no-ocr]  Scan files or directories
piicrawler smb <server> <share> [-u <user>] [-p <pass>] [--subfolder <path>] [options]  Scan an SMB network share (saves to the database)
piicrawler serve [port]                                 Start web UI (default port 3001)
piicrawler watch <path>... [options]                    Monitor directories for PII in real-time
piicrawler dsar <name> [options]                        Search for a person's PII across all scans
piicrawler scans [--json]                               List recorded scans and their IDs
piicrawler report <scan_id> [--out <file>]              Generate an HTML report for a scan
piicrawler export <scan_id> [--out <file>] [--exclude-fp]  Export a scan's findings as CSV
piicrawler textextract <path> [--no-ocr] [--out <file>]  Print extracted text from a file, directory, or container
piicrawler findings list   --scan <id> [filters] [--json]   List findings for a scan (LLM triage)
piicrawler findings mark   --scan <id> --verdict <fp|tp> <selector>   Mark findings as fp or tp
piicrawler findings unmark --scan <id> --verdict <fp|tp> <selector>   Clear a verdict (back to unreviewed)
piicrawler findings stats  --scan <id> [--json]          Show verdict tallies for a scan
piicrawler register <email> [--timeout <secs>] [--force]  Register this installation by email
piicrawler update [--yes] [--force]                     Download and install the latest build in place
piicrawler version                                      Print the PII Crawler version
piicrawler completions <shell>                          Print a shell completion script (bash, zsh, fish, ...)
piicrawler help                                         Show the built-in help message

Global options

These flags work with any command (and on their own, e.g. before launching the TUI).

  • --log-level <level> — Minimum log level to record: error, warn, info, debug, trace, or off. Defaults to info. Logs appear in the TUI's Logs view and, when PIICRAWLER_LOG_FILE is set, are also appended to that file. This flag overrides the PIICRAWLER_LOG_FILTER environment variable when both are set. Use debug or trace to surface SMB protocol traces during a network-share scan. A misspelled level fails immediately with the list of valid values.

    piicrawler --log-level debug                 # launch the TUI with debug logging
    piicrawler scan ~/share --log-level trace    # trace a one-shot scan
    
  • --color <when> — Control color in human-facing output (the scan report card, doctor, the registration notice): auto (the default), always, or never. Under auto, color is on when the output is an interactive terminal and off when it is piped or redirected. The conventional NO_COLOR environment variable disables color, and CLICOLOR_FORCE enables it; an explicit --color always overrides NO_COLOR. Color never touches machine output (stdout JSON, CSV, or --out files), so it is always safe to leave on.

Commands

(no arguments) — Interactive TUI

piicrawler

Launches the interactive terminal UI for browsing scans, viewing findings, and managing the local database. This is the default mode when you run the binary with no arguments.

demo — See it work on sample data

piicrawler demo

The fastest way to see what PII Crawler does. It generates a set of synthetic files (a fake HR export, a customer list, a payroll memo) seeded with obviously made-up but realistically formatted PII, alongside a few hundred ordinary PII-free files. It scans them all with the live progress bar on, so you watch it work for a beat, then prints the same report card a real scan produces. The sample files are written to a temporary directory and deleted as soon as the command finishes, so nothing on your disk is read or left behind.

demo needs no setup: it does not require registration and never opens your database. It is the recommended first command for a new install.

  PII Crawler demo
  Generating a sample data breach — all synthetic, nothing on your disk is touched.

  synthetic sample data
  ✖  CRITICAL RISK   52 findings · 3 of 304 files   1.0s

     email           15  ██████████████████
     ssn             13  ████████████████
     name            12  ██████████████
     address          3  ████
     city_state_zip   3  ████
     phone            3  ████
     credit-card      2  ██
     dob              1  █

  Hottest files
     36  hr/employee-records.csv
     11  exports/customer-export.txt
      5  notes/payroll-memo.txt

  That came from throwaway files we generated and just deleted.
  Point it at something real →  piicrawler ~/Documents

When you are ready to scan your own files, point scan at a real folder. Color follows the --color setting.

[scan] <path>... — One-shot scan

piicrawler [scan] <path>... [--workers <n>] [-j <n>] [--out <file>] [-o <file>] [--format jsonl|csv] [--progress tui|plain|none] [--quiet] [--no-ocr] [--exclude <regex>]... [--ext <list>] [--max-size <MB>] [--only <types>] [--exclude-type <types>] [--all] [--regex <label=pattern>]... [--terms-file <path>] [--detect-forms] [--summary] [--json] [--fail-on-findings] [--fail-on-risk low|medium|high|critical] [--save] [--name <label>]

Scans one or more files, archive containers (e.g. .zip, .tar.gz), or directory trees. The leading scan keyword is optional, so piicrawler ~/share and piicrawler scan ~/share are equivalent.

What it prints depends on where the output goes. On an interactive terminal it shows a colored report card on stderr: a risk verdict, a per-type breakdown, the files with the most findings, and the suggested next command (see Interactive report card below). When stdout is piped or redirected it writes the full results to stdout as JSON, so scripts keep working unchanged. Pass --json to force JSON on a terminal too. Progress is always written to stderr, so piping stdout is always safe:

piicrawler ~/Documents > findings.json   # stdout not a terminal, so JSON is written
piicrawler ~/Documents                   # terminal, so the report card is shown

Multiple paths and stdin. scan accepts several targets in one invocation, scanned in the order given. A literal - reads newline-separated paths from stdin (blank lines ignored), so it composes with tools like find:

piicrawler scan ~/share ~/Downloads          # several targets at once
find ~/share -name '*.pdf' | piicrawler scan -   # paths from stdin

- may be mixed with literal paths (the stdin paths are appended after them). The JSON shape adapts to the input: a single plain file prints one ScanResult object (unchanged from before), while multiple paths, a directory, or a container print a JSON array of results.

Requires a registered installation. If no license is present, the command exits with an error. Register from the command line with piicrawler register <email> (documented below), or interactively via piicrawler serve (web UI) or the TUI.

Options:

  • --workers <n>, -j <n> — Number of worker threads to use when scanning a directory. Defaults to 4. Capped at the number of files found.
  • --out <file>, -o <file> — Stream results to <file> instead of stdout (truncates if it exists). The format is auto-detected from the file extension: .csvCSV (one row per finding, see below), anything else → JSONL (one ScanResult object per line). Each result is written and flushed as soon as its file finishes scanning, so memory does not grow with the size of the tree — recommended for large directory scans. Stdout JSON output is suppressed when this flag is used; stderr progress is unaffected.
  • --format jsonl|csv — Force the output format, overriding extension detection. Requires --out. Useful when piping CSV to a file without a .csv extension, or forcing JSONL output to a .csv filename.
  • --progress tui|plain|none — Control the progress display on stderr. Default: autotui when stderr is an interactive terminal, plain otherwise (e.g. CI logs, piped stderr).
    • tui — In-place status block: a spinner, counts, a progress bar, live throughput (files per second), an estimated time remaining, and the files currently scanning. Updates in place using ANSI cursor controls and never scrolls. The throughput and ETA appear once a couple of files are done and disappear at completion. File counts are comma-grouped (e.g. 155,693) for readability on large scans. Before the bar can appear, the whole tree is walked to count what will be scanned; during that phase a live Discovering files under … N line counts up, so a large directory (a whole home folder can hold hundreds of thousands of files) shows activity immediately instead of looking hung.
    • plain — Scrolling per-file [start N/T] / [done N/T] / [running K] lines, preceded by a Discovering files under … line and occasional …N files so far milestones while the tree is enumerated. Use this for CI logs or anywhere the ANSI-aware mode would cause garbled output.
    • none — Suppress all progress output (equivalent to --quiet).
  • --quiet — Alias for --progress=none.
  • --no-ocr — Skip OCR on images and scanned PDFs. Speeds up scans of mostly-text trees. Images encountered under --no-ocr are counted as skipped, not as errors, so the report card does not report them as "unreadable" (see Interactive report card).
  • --exclude <regex> — Skip files and directories whose full path matches the regex. Repeatable; a path matching any one pattern is skipped. A matched directory is pruned, so the walk never descends into it (e.g. --exclude node_modules --exclude '\.git/' skips dependency and VCS trees entirely). An invalid regex fails the scan before any work begins. (PII Crawler's own data directory, ~/.piicrawler, is pruned automatically — see File filters below — so you do not need to exclude it by hand.)
  • --ext <list> — Only scan files whose extension is in this comma-separated list (case-insensitive, leading dot optional, e.g. --ext csv,pdf or --ext .csv,.pdf). Files with no extension are skipped once the list is set. A list with no usable entries (e.g. --ext ,) disables the filter rather than skipping everything.
  • --max-size <MB> — Skip files larger than this many megabytes. The limit is inclusive, so --max-size 50 keeps a file of exactly 50 MB.
  • --only <types> — Run only the comma-separated PII detectors listed (e.g. --only ssn,credit-card). Conflicts with --all and --exclude-type. An unknown slug fails the scan and prints the full list of valid names.
  • --exclude-type <types> — Skip the comma-separated detectors listed. Applied on top of the default-enabled set, or on top of --all when combined with it (e.g. --all --exclude-type name runs every detector except full-name matching).
  • --all — Enable every detector, including the region-specific ones that are off by default: NZ IRD, AU Tax File Number, DE Tax ID (Steuer-ID), and the Australian, German, and New Zealand passport and driver-licence detectors. This is the only way to reach those region detectors from the CLI; the default scan runs the broadly-applicable set only.
  • --regex <label=pattern> — Add a custom regex detector. Repeatable, so you can supply several. The text before the first = is a label and the rest is the pattern, so the pattern may itself contain =. Matches surface under a regex-<label> PII type, where the label is slugified (lowercased, with anything that is not a letter dropped, so emp-id becomes empid). An invalid pattern fails the scan before any work begins. Example: --regex 'empid=EMP-\d{6}' flags strings like EMP-123456 as regex-empid.
  • --terms-file <path> — Match a list of keywords loaded from a text file, one term per line. Blank lines and surrounding whitespace are ignored. Matches surface under a terms-list-<name> PII type, where <name> is the slugified file stem (so medical-terms.txt becomes terms-list-medicalterms). An empty or unreadable file fails the scan. This is the file-based equivalent of the terms lists managed in the TUI and web UI.
  • --detect-forms — Also run document-similarity matching against the bundled US tax-form templates (1040, W-2, W-4, W-9, and others) and emit a us-tax-forms finding per detected form. Off by default because the extra pass adds overhead even when no forms are present.
  • --summary — Print a compact per-PII-type count table instead of the full JSON finding dump, for quick triage and CI logs. Each line is a detector slug and its finding count, ordered by descending count (ties broken alphabetically), closed by a TOTAL line; a clean scan prints No PII findings.. The table is results, not progress, so it is written to stdout and is not suppressed by --quiet. Combine it with --out to still stream the full results to a file while seeing the summary on stdout. Under --save (where stdout is reserved for the scan ID) the same table is written to stderr alongside the saved-scan summary.
  • --json — Emit the full results to stdout as JSON even when stdout is a terminal, and suppress the report card. Without this flag, an interactive scan shows the report card and JSON is written only when stdout is piped or redirected. Use --json when you want machine-readable output while still watching the terminal. Conflicts with --out and --summary, which own stdout themselves.
  • --fail-on-findings — Exit with status 1 if the scan turns up any PII, and 0 when it is clean. Output (stdout JSON and any --out file) is unchanged; only the exit code differs. Designed for CI pipelines that should break the build when PII leaks into a repository or artifact.
  • --fail-on-risk <level> — Exit with status 1 when the scan's overall risk level reaches the given threshold: low, medium, high, or critical. The level is computed from the same composite risk score the HTML report headlines (the sum of every finding's PII-type risk weight), so piicrawler report and this gate agree on the same files. Use it to tolerate a few low-risk hits while still failing on, say, a directory full of Social Security numbers. A FAIL: line naming the level and score is written to stderr when the threshold is met.
  • --save — Persist the scan to the local database instead of only streaming results. This is what makes report, findings, and dsar work on a CLI scan: a saved scan gets a record (with an ID), per-file rows, and stored matches, exactly like a scan started from the TUI or web UI. The scan runs through the same engine those interfaces use, so it also honours stored terms lists and proximity groups attached to the database. The new scan's numeric ID is printed to stdout (so you can capture it in a script); a human-readable summary and the live progress display go to stderr. Conflicts with --out/--format — a saved scan keeps its results in the database, so retrieve them afterwards with piicrawler report <id> or piicrawler findings list --scan <id> --json.
  • --name <label> — Label for the saved scan, shown by piicrawler scans. Implies --save. Defaults to the scan path when omitted.

--save records a single scan against one root, so it accepts exactly one path; passing several paths (or - resolving to more than one) is an error. Save each separately, or drop --save to stream all of them.

Interactive report card

When stdout is a terminal, a scan ends with a report card on stderr instead of dumping JSON. It summarizes the whole invocation: a risk verdict (the same level the HTML report uses), a per-type breakdown with proportional bars, the files holding the most findings, and the next command to run.

  ~/Documents
  ⚠  HIGH RISK   47 findings · 12 of 1,240 files   18.3s

     ssn           18  ████████████████
     credit-card   12  ███████████
     email          9  ████████
     phone          8  ███████

  Hottest files
     19  budget/2024-payroll.xlsx
     11  hr/onboarding.pdf

  Next  →  piicrawler scan ~/Documents --save

A clean scan collapses to a single line, for example ✓ All clear no PII found across 1,240 files (18.3s). The reassuring headline rotates between runs (All clear, Squeaky clean, Nothing to see here, and so on), so a clean result is a small fresh reward rather than the same line every time. The card is suppressed by --quiet, --summary, and --json, and color follows the --color setting (so a piped or NO_COLOR terminal gets plain text). It never writes to stdout, so it does not interfere with --out files or piped JSON.

Skipped vs. unreadable files. Two counts can follow the file total, and they mean different things:

  • skipped — files PII Crawler deliberately did not read: binary formats it does not extract text from, and images when --no-ocr is set. These are expected and not a problem.
  • unreadable — files it genuinely could not read. When any are present, the card lists them broken down by cause, so a large count is explained rather than mysterious:
  ✓  All clear   no PII found across 595,000 files, 74,000 skipped, 1,132 unreadable  (2m05s)
       permission denied  1,015
       timed out             97
       too large             20

permission denied covers files the operating system refused (run with the right permissions, or --exclude those paths); timed out covers files whose text extraction exceeded its time budget; too large covers files over the size limit; and other collects everything else (corrupt files, decode failures). The same breakdown appears under the verdict line on a scan that also has findings.

The card is a summary: it caps the breakdown at the top 8 PII types and the 5 files with the most findings. An interactive scan that only shows the card does not keep the individual findings (they are tallied as they are scanned and released, so a home-folder-sized tree does not balloon memory). To get every finding, capture the full results with --out <file>, force JSON with --json, pipe stdout (piicrawler scan … | less), or persist with --save and then use piicrawler findings list --scan <id> and piicrawler report <id>.

File filters. --exclude, --ext, and --max-size apply only while walking a directory tree. A single file or container archive named directly on the command line is always scanned, on the assumption that an explicitly named target is intentional. The same filters are available in the TUI and web UI via scan configuration.

Archive entries are bounded independently. --max-size measures a file's size on disk, but a compressed archive entry can be small on disk and enormous once decompressed (a highly compressible file, or a crafted "zip bomb"). To keep one such entry from exhausting memory, each entry inside a .zip, .7z, or .tar.gz is read with a fixed 100 MB decompressed cap. An entry that expands past that cap is reported as too large and skipped rather than read into memory whole, so the rest of the archive and the rest of the scan continue normally.

The app's own data is skipped by default. PII Crawler keeps its database, logs, and downloaded models in ~/.piicrawler (%USERPROFILE%\.piicrawler on Windows). A directory walk prunes that folder automatically, so piicrawler scan ~ does not scan PII Crawler's own database. The exclusion is lifted only when you target the data directory directly — piicrawler scan ~/.piicrawler still works — so an explicit request is always honored. This applies to plain scans, --save scans, and the TUI/web UI; it does not apply to smb shares (which are remote and never contain the local data directory).

Did you mean? A mistyped path that does not exist stops the scan before any work begins, with a suggestion drawn from the parent directory when a close match is found, for example:

  error: no such path to scan:

    testdata/files-to-scan-for-testng
      did you mean testdata/files-to-scan-for-testing?

Only paths named directly on the command line are checked this way; paths read from stdin (piicrawler scan -) stay lenient, so a find … | piicrawler scan - pipeline keeps going if a file disappears mid-run. A mistyped subcommand is caught the same way (piicrawler scna … suggests scan) rather than being treated as a file to scan. Both cases exit with status 2.

Detector slugs. --only and --exclude-type accept these names: address, au-drivers-license, au-passport, au-tfn, aws-credential, city_state_zip, credit-card, de-drivers-license, de-passport, de-steuer-id, dob, drivers-license, ein, email, name, nz-drivers-license, nz-ird, nz-passport, passport, phone, ssn. Region detectors (nz-ird, au-tfn, de-steuer-id, and the AU/DE/NZ passport and driver-licence detectors) are off in a default scan and run only under --all or when named in --only.

Custom detection. --regex, --terms-file, and --detect-forms add your own detectors on top of the built-in ones; they compose with the selection flags above, so a scan can run, say, --only ssn --regex 'empid=EMP-\d{6}' to look for SSNs plus your employee-ID format. These are the CLI equivalents of the custom patterns, terms lists, and tax-form detection configured in the TUI and web UI. They apply to every scan target (single files, directories, and container archives) and work with --save too, in which case the resulting regex-*, terms-list-*, and us-tax-forms findings are stored in the database like any other. All three are validated up front, so a bad pattern or missing terms file fails before the scan starts.

Exit codes. A successful scan exits 0. The two --fail-on-* flags above opt into a 1 exit on findings or risk; without them scan always exits 0 even when PII is present. An exit code of 1 from an unflagged scan indicates a runtime error (e.g. missing registration), not findings. Both --fail-on-* flags work the same way under --save — the risk score is read back from the saved scan.

Saving to the database (--save). By default scan is stateless: it streams results to stdout (or --out) and forgets them. Pass --save to persist the scan instead, which unlocks the natural pipeline:

SCAN_ID=$(piicrawler scan ~/share --save --name "Q3 audit" --quiet)
piicrawler findings list --scan "$SCAN_ID" --json   # triage
piicrawler report "$SCAN_ID"                         # HTML risk report
piicrawler dsar "Jane Doe"                            # search across all saved scans

The bare scan ID is the only thing written to stdout, so $( … ) captures it cleanly. A saved scan is identical to one created from the TUI or web UI, so it shows up in piicrawler scans and can be reported on, triaged, exported, or searched by dsar. Because it runs through the database-backed engine, it also applies any terms lists or proximity groups already stored in the database. --save does not stream per-finding output, so it is mutually exclusive with --out/--format.

Compatibility note. Progress is always written to stderr, never stdout, so piicrawler scan ~/share > findings.json and piicrawler scan ~/share --out report.csv keep working unchanged regardless of the progress mode. The --out file content is identical across modes.

CSV format. When CSV is selected, the file starts with a header row and then carries one row per finding:

file_path,pii_type,term,start,end,error
/srv/share/contracts/2026-q1.pdf,ssn,123-45-6789,1024,1035,
/srv/share/contracts/2026-q1.pdf,email,[email protected],2110,2126,
/srv/share/contracts/clean.txt,,,,,
/srv/share/contracts/locked.pdf,,,,,decryption failed

Files with no findings still produce one row (with empty PII columns) so that scanned-but-clean and unreadable files are visible in the report. Values containing commas, quotes, or newlines are quoted per RFC 4180.

Behaviour:

  • File: extracts text, runs PII detection, prints a single ScanResult JSON object (or one JSONL line with --out).
  • Container: extracts each entry and scans it, prints an array of ScanResult objects (or one JSONL line per entry with --out). Containers include .zip, .7z, .tar.gz / .tgz, and .mbox. For .mbox files each message becomes its own entry; the message ordinal (and Message-ID: when present) is appended to file_path, e.g. mail.mbox::message-000042::<[email protected]>. Output streams as each entry completes, so tail -f works on the JSONL file mid-scan.
  • Directory: recursively walks the tree (skipping symlinks) and scans every supported file type or container in parallel, prints an array of ScanResult objects (or one JSONL line per file as it completes with --out).

smb <server> <share> — Scan an SMB network share

piicrawler smb <server> <share> [--subfolder <path>] [-u <user>] [-p <pass>] [--domain <domain>] [--max-concurrent <n>] [--delay-ms <ms>] [--bandwidth-mbps <mbps>] [--workers <n>] [--no-ocr] [--only <types>] [--exclude-type <types>] [--all] [--regex <label=pattern>]... [--terms-file <path>] [--detect-forms] [--exclude <regex>]... [--ext <list>] [--max-size <MB>] [--name <label>] [--progress tui|plain|none] [--quiet] [--summary] [--fail-on-findings] [--fail-on-risk low|medium|high|critical]

Scans an SMB / CIFS network share (the same network-share scanning the TUI and web UI offer) without a browser, which makes it the right choice for servers, containers, and CI. The share is enumerated, each file is downloaded and scanned with the configured throttle, and the results are always saved to the local database — so a network scan behaves like scan --save: the new scan's numeric ID is printed to stdout, while the live progress display and a human summary go to stderr.

piicrawler smb fileserver Finance -u alice            # password from PIICRAWLER_SMB_PASSWORD
piicrawler smb fileserver Finance -u alice -p 'secret' --subfolder HR/2025

Requires a registered installation (same gate as scan). The connection is tested up front, so a bad host, share, or credential fails immediately with a friendly error and no scan record is created.

Credentials. Provide -u/--username (and optionally --domain) for authenticated access, or omit them for an anonymous / guest connection. The password comes from -p/--password, or — when --username is given without --password — from the PIICRAWLER_SMB_PASSWORD environment variable, so it stays out of your shell history. Unlike the TUI and web UI, the CLI does not persist credentials to the database: the connected client is handed straight to the scan engine, so the password is never written to disk. (The scan record, file rows, and findings are saved as usual.)

Options:

  • --subfolder <path> — Scan only this subfolder within the share. Defaults to the share root. Forward or backslashes both work.
  • --username <user>, -u <user> — Username for authentication. Omit for an anonymous / guest connection.
  • --password <pass>, -p <pass> — Password for authentication. If --username is given without --password, the password is read from PIICRAWLER_SMB_PASSWORD instead.
  • --domain <domain> — Windows domain for NTLM authentication.
  • --max-concurrent <n> — Maximum number of files to read from the share concurrently. Defaults to 2. Raise it for fast, lightly-loaded shares; lower it to be gentle on the server or the network.
  • --delay-ms <ms> — Politeness delay inserted between files, in milliseconds. Defaults to 100. Set to 0 to disable.
  • --bandwidth-mbps <mbps> — Cap download bandwidth at this many megabits per second. No cap when unset (or when set to a non-positive value).
  • --workers <n>, -j <n> — Worker threads for scanning the downloaded files. Defaults to 4.
  • --no-ocr — Skip OCR on images and scanned PDFs.
  • --only, --exclude-type, --all, --regex, --terms-file, --detect-forms — Detector selection and custom detection, identical to the scan command flags of the same name.
  • --exclude <regex>, --ext <list>, --max-size <MB> — File filters, identical to the scan command flags of the same name. They apply while enumerating the share.
  • --name <label> — Label for the saved scan, shown by piicrawler scans. Defaults to the UNC path (e.g. \\fileserver\Finance).
  • --progress tui|plain|none, --quiet, --summary — Progress and summary output, identical to the scan command. The --summary table is written to stderr (stdout is reserved for the scan ID).
  • --fail-on-findings, --fail-on-risk <level> — CI exit gates, identical to the scan command. The risk score is read back from the saved scan.

Once a network scan completes, the natural follow-ups are the same as any saved scan:

SCAN_ID=$(piicrawler smb fileserver Finance -u alice --quiet)
piicrawler report "$SCAN_ID"
piicrawler findings list --scan "$SCAN_ID" --json

To see SMB protocol traces while diagnosing a connection problem, add --log-level debug (or set PIICRAWLER_LOG_FILTER=debug). For the interactive equivalent and how credentials are protected there, see Scan an SMB Network Share.

serve — Web UI

piicrawler serve [port] [--bind <address>] [--open]

Starts the web UI on the given port (default 3001) and opens an HTTP server you can reach at http://localhost:<port>. The web UI is where you create and manage scans, register your license, and review findings in a browser.

To start the server and open the web UI in your default browser in one step, add --open (or -o):

piicrawler serve --open          # serve on 3001 and open the browser
piicrawler serve 8080 --open     # serve on 8080 and open the browser

The Windows installer's Start Menu shortcut and the macOS PIICrawler.app bundle launch PII Crawler this way (with no console or Terminal window), starting the server and opening the UI for you. If no browser can be launched (for example on a headless machine) the server still starts and the failure is noted in the logs.

If PII Crawler is already running on that port, --open does not try to start a second server. It detects the running instance and opens the browser to it instead, so running it again simply brings the UI back up.

By default the server binds to 127.0.0.1 (loopback only), so the UI is reachable from the local machine. To expose it on the LAN for testing, use --bind:

piicrawler serve --bind 0.0.0.0 8080      # all interfaces
piicrawler serve --bind 192.168.1.10      # a specific NIC
PIICRAWLER_BIND=0.0.0.0 piicrawler serve  # via env var

Options:

  • --bind <address>, -b <address> — Network address to bind to for this invocation only. Overrides the persisted Settings → Bind address from the web UI. Reads PIICRAWLER_BIND as a fallback. Resolution order: --bind/PIICRAWLER_BIND > persisted setting > 127.0.0.1.
  • --open, -o — Once the server is listening, open the web UI in your default browser.

Anything reachable at the bound address can hit the UI. A password set under Settings → Login password gates the API routes for unauthenticated callers, but the UI itself, the static assets, and the login endpoint stay reachable. If no password has been set, exposing 0.0.0.0 leaves the entire UI open — set a password first, or scope the bind address to the smallest network you need.

watch — Real-time monitoring

piicrawler watch <path>... [--webhook <url>] [--policy <file>] [--no-json] [--no-ocr] [--debounce <ms>]

Watches one or more directories for file system changes and scans newly created or modified files for PII as they appear. Results are streamed as JSON to stdout by default and recorded in the local database. Press Ctrl+C to stop the daemon.

Options:

  • --webhook <url> — POST findings to the given webhook URL as they are produced.
  • --policy <file> — Load alert policies from a config file (TOML-style [[policy]] tables). Each policy can match on pii_type and path_pattern, and tags every matching violation with an action and severity label. Loaded policies are appended to the local database for the daemon to consume — see Watch Mode & Policies for the file format, the webhook payload schema, and how to safely reload policies.
  • --no-json — Disable the JSON stdout stream (use this when you only want webhook delivery or database persistence).
  • --no-ocr — Skip OCR on images and scanned PDFs.
  • --debounce <ms> — Debounce window for file events in milliseconds. Defaults to 500. Useful when editors save in bursts.

dsar — Data Subject Access Request

piicrawler dsar "Person Name" [--assert-clean] [--report <file>] [--json]

Searches every recorded scan in the local database for PII associated with the given person and prints a summary to stderr. Use this to fulfil GDPR/CCPA right-to-know requests or to check whether a specific person's data has leaked into a watched location. For an end-to-end walkthrough (scan → identity → DSAR report), see the DSAR Walkthrough.

Options:

  • --assert-clean — Exit with status 1 if any findings are returned (and 0 with a CLEAN: line if not). Designed for use in CI pipelines.
  • --report <file> — Write a self-contained HTML report to <file>.
  • --json — Print structured findings as JSON to stdout in addition to the stderr summary.

scans — List recorded scans

piicrawler scans [--json]

Lists every scan recorded in the local database, most recent first. Use this to look up the numeric scan ID required by piicrawler report.

The default output is a text table with the following columns:

  • ID — numeric scan ID (pass to piicrawler report)
  • STATUSpending, enumerating, scanning, completed, stopped, etc.
  • FILESscanned/total file counts
  • FINDINGS — total PII matches recorded
  • CREATED — timestamp of scan creation
  • NAME — display name (often the path or a TUI-supplied label)
  • PATH — root path that was scanned

On an interactive terminal the table is colored to be scannable at a glance: the status is green when finished, yellow while in flight, and red on failure, and a nonzero finding count is highlighted. Color follows the --color setting and is based on the table's own stream, so piicrawler scans | cat stays plain text while --color always keeps it colored through a pipe.

Options:

  • --json — Print the same listing as a JSON array on stdout for scripting (each entry includes id, name, path, status, files_total, files_scanned, findings_total, scan_type, created_at, updated_at).

Example — find the scan ID and generate its report:

piicrawler scans
piicrawler report 42

report — HTML risk report

piicrawler report <scan_id> [--out <file>] [-o <file>]

Generates a standalone HTML risk report for the scan with the given numeric ID. By default it writes piicrawler-report-<scan_id>.html in the current working directory. Run piicrawler scans to see the available scan IDs (or look them up in the TUI or web UI).

Options:

  • --out <file> (or -o <file>) — Write the report to this path instead of the default. The path is used verbatim, so you choose both the directory and file name (for example piicrawler report 42 --out /tmp/q3-audit.html). The destination directory must already exist.

export <scan_id> — Export findings as CSV

piicrawler export <scan_id> [--out <file>] [-o <file>] [--exclude-fp]

Exports every finding from a saved scan as CSV, one row per match. Run piicrawler scans to look up the scan ID. By default the CSV streams to stdout, so it can be redirected or piped:

piicrawler export 42 --out findings.csv
piicrawler export 42 > findings.csv
piicrawler export 42 | grep ssn

The columns match the TUI and web UI exports, so files from any of the three can be combined or processed identically:

file_path,pii_type,match,start_pos,end_pos,context

Rows stream straight from the database to the output as they are read, so the export stays fast and uses constant memory no matter how large the scan is. This makes it the recommended way to pull findings out of very large scans (hundreds of thousands or millions of files), where downloading the same CSV through a browser can be slow.

Options:

  • --out <file> (or -o <file>) — Write the CSV to this path instead of stdout (truncates if it exists). Prints a confirmation with the number of findings exported.
  • --exclude-fp — Leave out findings that have been marked as false positives (in the TUI, web UI, or with piicrawler findings mark). By default every finding is included.

Example — pull a clean CSV of triaged findings:

piicrawler scans
piicrawler export 42 --exclude-fp --out q3-findings.csv

textextract <path> — Print extracted text

piicrawler textextract <path> [--no-ocr] [--out <file>] [-o <file>]

Prints the raw text that PII Crawler's extractors would feed into the scanner, without running PII detection. Useful for verifying that a format is parsed correctly, debugging unexpected scan results, or piping cleaned text into another tool.

<path> may be:

  • a single file (PDF, Office document, image, plain text, etc.)
  • a directory (walked recursively, every text-extractable file is emitted)
  • a container archive (.zip, .7z, .tar.gz, .mbox) — each inner entry is emitted with the <archive>!/<inner-path> convention used by scan

Output goes to stdout by default, with one block per file:

===== /path/to/file.txt =====
<extracted text>

===== archive.zip!/inner/notes.docx =====
<extracted text>

===== /path/to/scan.pdf =====
[error: Extraction failed: encrypted PDF]

Files that fail extraction are still listed, with the error written in place of the body so the failure is visible in the output.

Options:

  • --no-ocr — Skip OCR on images and scanned PDFs. Speeds up extraction when you only care about text-bearing files.
  • --out <file>, -o <file> — Write output to <file> instead of stdout (truncates if it exists).

findings — Triage scan findings (LLM-friendly)

piicrawler findings list   --scan <id> [--verdict <unreviewed|fp|tp|all>] [--pii-type <type>...] [--limit <n>] [--offset <n>] [--context <none|surrounding|full>] [--json]
piicrawler findings mark   --scan <id> --verdict <fp|tp> <--match <id>|--text <pii_type> <term>|--file <id>>
piicrawler findings mark   --scan <id> --from-json <path|->
piicrawler findings unmark --scan <id> --verdict <fp|tp> <--match <id>|--text <pii_type> <term>|--file <id>>
piicrawler findings stats  --scan <id> [--json]

Pulls scan findings as JSON and writes triage verdicts back into the local database. Designed for agentic LLM workflows: a model can fetch unreviewed matches with findings list --json, classify each one, and post the verdicts back via findings mark --from-json. The same scan_false_positives table is shared with the TUI's review mode and the web UI, so verdicts written from the CLI show up everywhere. For an end-to-end walkthrough that wires these commands to an LLM, see Agentic Triage.

Get the --scan ID from piicrawler scans.

findings list

Returns matches for a scan. JSON when stdout is not a TTY (or when --json is passed); a compact table otherwise. The default --verdict filter is unreviewed, which is the natural starting point for a triage loop — pass --verdict all to see everything.

Options:

  • --scan <id> — Required. Scan ID to read from.
  • --verdict <unreviewed|fp|tp|all> — Filter by verdict status. Defaults to unreviewed.
  • --pii-type <type> — Restrict to one or more PII types (e.g. --pii-type ssn --pii-type email). Repeatable.
  • --limit <n> / --offset <n> — Page through large result sets. --offset defaults to 0.
  • --context <none|surrounding|full> — Controls which context fields appear in JSON output. surrounding (default) returns the small snippet around the match (best signal-to-noise for an LLM); full adds the line and paragraph contexts; none omits the context key entirely for cheap counts.
  • --json — Force JSON output even on a TTY.
  • --redact — Apply format-preserving redaction to ssn and dob matches before emission. Replaces every digit in the term with a hash-derived digit (keeping dashes, slashes, and other separators), and rewrites every occurrence of those plaintext terms in context.surrounding / line / paragraph. The substitution table covers every distinct ssn/dob term recorded across the whole scan (not just the paginated subset), and each context string is additionally swept with the SSN/DOB detection regexes to catch dashed SSNs and date shapes that were never recorded as matches (placeholders the original scanner filtered, formats outside the anchored detector). The substitution is deterministic per database (a 32-byte secret is generated on first use and stored in kv_store), so the same input always maps to the same output and an LLM still sees "this value repeats N times" without reading the original digits. Other PII types pass through unchanged; file_path is not redacted. Intended for the case where you must send context to an external LLM endpoint; if you can run a local or private-cloud model, use that instead. See Agentic Triage → Redacting sensitive terms.

JSON shape (one object per match):

{
  "id": 42,
  "scan_id": 1,
  "file_id": 7,
  "file_path": "/data/hr/2025.csv",
  "pii_type": "ssn",
  "term": "123-45-6789",
  "start": 10,
  "end": 21,
  "context": { "surrounding": "...Employee SSN: 123-45-6789 filed on..." },
  "verdict": "unreviewed"
}

verdict is one of unreviewed, false_positive, or true_positive. With --context full the context object also includes line and paragraph. With --context none the context key is omitted.

findings mark and findings unmark

Writes (or removes) a verdict. Three scope selectors mirror the TUI's review-mode keys (f/t for the current match, F/T for the whole file, plus an implicit text scope when the same term repeats across files):

  • --match <id> — Apply to a single match by its ID (from findings list).
  • --text <pii_type> <term> — Apply to every match in the scan with the given (pii_type, term) pair, across all files. Use this when a term is unambiguously a false positive (e.g. an example SSN like 000-00-0000) regardless of where it appears.
  • --file <id> — Apply to every match in a file by its file ID. The TUI's F/T shortcut writes this scope.

Specificity wins on read: a match-scope verdict overrides text scope, which overrides file scope. So mark --text ssn 000-00-0000 --verdict fp followed by mark --match 42 --verdict tp leaves match 42 as a true positive while every other match of 000-00-0000 stays a false positive.

unmark takes the same selectors and the same --verdict <fp|tp> flag — the FP and TP verdicts are stored as separate rows, so you must say which one you're clearing.

Bulk path (--from-json <path|->):

piicrawler findings mark --scan 1 --from-json verdicts.json
piicrawler findings mark --scan 1 --from-json -    # read JSON from stdin

The input is a JSON array of entries; each carries its own verdict and exactly one selector:

[
  { "match_id": 42, "verdict": "fp" },
  { "match_id": 43, "verdict": "tp" },
  { "text": { "pii_type": "ssn", "term": "000-00-0000" }, "verdict": "fp" },
  { "file_id": 7, "verdict": "tp" }
]

The whole batch applies in a single SQLite transaction. If any entry is malformed (or contains more than one selector) the CLI exits non-zero and no verdicts are written. --verdict on the command line is ignored for --from-json; only the per-entry verdict matters.

Output (single mark):

{ "marked": 1, "verdict": "fp" }

Output (bulk):

{ "marked": 4, "by_verdict": { "fp": 3, "tp": 1 } }

findings stats

Tallies verdicts for a scan. Useful for an LLM to check "how many unreviewed remain" before/after a triage pass.

Options:

  • --scan <id> — Required. Scan ID.
  • --json — Force JSON output.

JSON shape:

{
  "scan_id": 1,
  "totals": { "unreviewed": 120, "false_positive": 32, "true_positive": 8 },
  "by_pii_type": [
    { "pii_type": "ssn",   "unreviewed": 50, "false_positive": 10, "true_positive": 5 },
    { "pii_type": "email", "unreviewed": 70, "false_positive": 22, "true_positive": 3 }
  ]
}

Example: agentic triage loop

# 1. Pull unreviewed findings as JSON
piicrawler findings list --scan 42 --json --limit 50 > batch.json

# 2. Have your LLM classify each one and emit verdicts.json
#    (e.g. [{"match_id": 101, "verdict": "fp"}, ...])

# 3. Apply the verdicts atomically
piicrawler findings mark --scan 42 --from-json verdicts.json

# 4. Check progress
piicrawler findings stats --scan 42

register <email> — Register this installation

piicrawler register <email> [--timeout <secs>] [--force]

Registers this machine so the licensed commands (scan, watch, and the web UI's scans) will run. It performs the same two-step flow as the TUI and web UI, but headlessly, which makes it the right choice for servers, containers, and CI where no browser or interactive terminal is available.

  1. A verification link is emailed to <email>.
  2. The command then waits, polling until you click that link, and on success stores the signed license and registered email in the local database (~/.piicrawler/). Nothing is written until verification completes.
piicrawler register [email protected]

The signed license is tied to this machine's OS and architecture, so run register on each machine you install on (the same email can register multiple machines).

Options:

  • --timeout <secs> — How long to wait for you to click the verification link before giving up. Default: 600 (10 minutes). On timeout the command exits with status 1 and no license is stored; click the link and run register again to resume.
  • --force — Register again even when a valid license is already present. Without it, register detects an existing valid license, prints the masked registered email, and exits 0 without contacting the network.

Progress and prompts are written to stderr; the final success line is written to stdout. An obviously malformed email is rejected immediately, before any network call.

update — In-place binary upgrade

piicrawler update [--yes] [--force]

Checks downloads.eligian.com for the latest build for your platform, compares it to the running binary, and (if newer) downloads the matching archive, verifies its SHA-256, extracts the binary, and atomically swaps it into place. Your database, license, terms lists, and triage verdicts in ~/.piicrawler/ are never touched.

Options:

  • --yes, -y — Skip the interactive [y/N] confirmation. Useful for scripted upgrades.
  • --force, -f — Reinstall even when the local build is already at or newer than the published one.

Behaviour by platform:

  • Linux and macOS swap the binary atomically (rename(2)). Any already-running piicrawler processes keep using the old binary until they exit; new invocations pick up the new build.
  • Windows cannot overwrite a running .exe, so the live binary is renamed to piicrawler.exe.old next to the original and the new bytes are written at the original path. The .old file can be deleted once no piicrawler.exe processes remain. update always pulls the Azure-Trusted-Signing-signed piicrawler-cli-windows-signed.zip, so the binary you end up with after an upgrade is signed by the same publisher as your initial install.

If your platform or architecture is not currently published (e.g. Linux ARM), update exits with a friendly error pointing you to the download page.

version — Print version

piicrawler version
piicrawler -V
piicrawler --version

Prints the running build's version string (e.g. 26.0507.1432) to stdout and exits. Stdout-only output keeps it pipe-friendly for shell scripts that need to read the version.

doctor — Check the installation

piicrawler doctor

Runs a quick health check and prints the result, one line per item. It is the fastest way to answer "is this set up correctly?" before a scan, or to gather facts for a support request. It reports:

  • Version and how long ago the binary was built.
  • Database location and the number of saved scans.
  • Registration status, including the masked email if registered. An unregistered install shows a red ✗ and the exact piicrawler register command to fix it.
  • OCR availability (the text-detection and text-recognition models are built into the binary, so this is always present).
  PII Crawler doctor

  ✓  Version        26.0525.0433  (built 2h ago)
  ✓  Database       ~/.piicrawler/piicrawler.db  (4 saved scans)
  ✓  Registration   registered as ja**@example.com
  ✓  OCR            built in (reads text from images and scanned PDFs)

  Everything looks good. Try:  piicrawler ~/Documents

The closing line is contextual. When everything passes and you have not saved any scans yet, it points you at piicrawler demo; once you have saved scans, it suggests scanning a real folder. If any check fails it tells you to resolve the items marked ✗ and re-run.

doctor is diagnostic, so it always exits 0; problems show as a red ✗ in the output rather than a non-zero status. Color follows the --color setting.

completions <shell> — Shell completion script

piicrawler completions bash|zsh|fish|powershell|elvish

Prints a tab-completion script for the named shell to stdout. The script is generated from the live command definition, so it always covers the current subcommands, flags, and value choices. Pipe or redirect it into the location your shell loads completions from. The accepted shells are bash, zsh, fish, powershell, and elvish.

Typical installation:

# Bash (user-local; create the directory once if needed)
piicrawler completions bash > ~/.local/share/bash-completion/completions/piicrawler

# Zsh (anywhere on your $fpath, e.g. a personal completions dir)
piicrawler completions zsh > ~/.zfunc/_piicrawler

# Fish
piicrawler completions fish > ~/.config/fish/completions/piicrawler.fish

# PowerShell (append to your profile)
piicrawler completions powershell >> $PROFILE

After installing, restart the shell (or re-source the file) so completions load. Because the script is regenerated from the binary, re-run the command after upgrading to pick up new flags.

help

piicrawler help
piicrawler -h
piicrawler --help

Prints the built-in usage summary to stdout. Pass a subcommand name (e.g. piicrawler help scan) for command-specific help.

Output

On an interactive terminal, piicrawler <path> ends with the report card on stderr and writes nothing to stdout. When stdout is piped or redirected (or you pass --json), it instead prints a JSON document to stdout with one entry per scanned file. Each entry has the shape:

{
  "file_path": "/absolute/path/to/file.pdf",
  "findings": [ ... ],
  "full_names": [ ... ],
  "char_count": 12345,
  "error": null
}

If extraction fails for a file, error is set to a short message and findings is empty. Container scans return the same shape, one entry per archive member.

See PII Data Types for the structure of individual findings and Results Storage for the database schema used by serve, watch, and the TUI.

Environment variables

PII Crawler reads a small set of environment variables on startup. None of them are required for normal use; they're escape hatches for daemonized, headless, or noisy deployments.

Logging

  • PIICRAWLER_LOG_FILE — If set to a non-empty path, structured logs are appended to the given file in addition to being shown in the TUI Logs view. Useful when running watch or serve as a long-lived daemon. Failure to open the file is logged and the binary continues without file logging.
  • PIICRAWLER_LOG_FILTER — Accepts any value parseable by log::LevelFilter: off, error, warn, info, debug, trace. Defaults to info. Set to debug or trace to surface SMB protocol traces during a network-share scan. The --log-level global flag overrides this when both are set.

Credential store (headless / CI)

PII Crawler keeps SMB credentials encrypted under a credential password — see Scan an SMB Network Share → How credentials are protected. For headless deployments where you can't type the password interactively, set one of:

  • PIICRAWLER_CRED_PASSWORD — Auto-unlock the credential store with this password the first time anything in the session touches a stored credential. The password is never written back to disk.
  • PIICRAWLER_CRED_KEY_BASE64 — Inject a 32-byte base64-encoded data-encryption key (DEK) directly. Skips the password prompt entirely. Intended for CI tests against an isolated database where the DEK is managed externally; do not use in production.

If both are set, PIICRAWLER_CRED_KEY_BASE64 wins. If neither is set, the credential store stays locked until the user unlocks it via the TUI, the web UI, or POST /api/cred/unlock.

Crash reporting

  • SENTRY_DSN — If PII Crawler crashes in the wild we use this to receive crash reports. It's the only way for us to know if we have a crashing bug in the wild. Please leave this set if possible. However, if you are in an air-gapped environment or want to disable all outbound calls: Override the built-in Sentry DSN that receives unhandled-exception reports (see Security → Error Reporting). Set to an empty string (SENTRY_DSN="") to disable error reporting entirely. Reports never include scan results, file paths, or PII.

Examples

Scan a directory and save the findings to a file:

piicrawler ~/Downloads --workers 8 > findings.json

Stream a large directory scan to a JSONL file (results are appended as each file finishes, so memory stays flat):

piicrawler /srv/shared --workers 8 --out findings.jsonl
jq -c 'select(.findings | length > 0)' findings.jsonl

Export findings as CSV for spreadsheet review (auto-detected from the .csv extension):

piicrawler scan ~/share --workers 8 --out report.csv

Force CSV format when the output file has a non-standard extension:

piicrawler scan ~/share --out report.txt --format csv

Scan a single archive without OCR and pipe to jq:

piicrawler backups/2026-04.zip --no-ocr --quiet | jq '.[] | select(.findings | length > 0)'

Scan only for card numbers and SSNs, skipping the noisier detectors:

piicrawler scan ~/share --only ssn,credit-card

Get a quick per-type count for triage or a CI log instead of the full JSON dump, and break the build if anything turns up:

piicrawler scan ~/share --summary --fail-on-findings

Enable the New Zealand and Australian region detectors (off by default) by turning on the full detector set:

piicrawler scan ~/share --all

Scan a source tree for spreadsheets and PDFs only, skipping dependency and VCS directories and any file over 25 MB:

piicrawler scan ~/project \
  --exclude node_modules --exclude '\.git/' \
  --ext csv,xlsx,pdf \
  --max-size 25

Add custom detectors: an employee-ID regex, a keyword list, and US tax-form detection:

piicrawler scan ~/share \
  --regex 'empid=EMP-\d{6}' \
  --terms-file ./medical-terms.txt \
  --detect-forms

Save a scan to the database, then generate a report and triage findings from its ID:

SCAN_ID=$(piicrawler scan ~/share --save --name "Q3 audit" --quiet)
piicrawler report "$SCAN_ID"
piicrawler findings list --scan "$SCAN_ID" --json

Scan an authenticated SMB share for spreadsheets and PDFs only, capping bandwidth, and report on it (the password is read from the environment so it stays out of shell history):

export PIICRAWLER_SMB_PASSWORD='…'
SCAN_ID=$(piicrawler smb fileserver Finance -u alice --domain CORP \
  --subfolder HR/2025 --ext csv,xlsx,pdf --bandwidth-mbps 50 --quiet)
piicrawler report "$SCAN_ID"

Watch two directories with a webhook and a policy file:

piicrawler watch /srv/uploads /srv/exports \
  --webhook https://alerts.example.com/piicrawler \
  --policy ./policies.toml \
  --debounce 1000

Fail a CI job if any PII is found for a given person:

piicrawler dsar "Jane Doe" --assert-clean

List the recorded scans to find one to report on:

piicrawler scans

Generate an HTML report for scan ID 42:

piicrawler report 42
Was this page helpful?