Reference

Watch Mode & Policies

Last updated April 2026

piicrawler watch is the daemon mode of PII Crawler. It monitors one or more directories for file changes, scans new or modified files for PII, evaluates each finding against a set of policies, and dispatches violations to JSON stdout, a webhook, and the local database.

This page is the complete reference for the policy file format, the violation payload schema, and the operational behaviour of the daemon. For the CLI flags themselves see piicrawler watch.

How the pipeline works

Each cycle the daemon:

  1. Polls every watched directory for file changes (created, modified, removed). Polling — not OS-native inotify/FSEvents — is used so the same code path works identically across Linux, macOS, and Windows.
  2. For each created or modified file, extracts text using the same extraction stack as a one-shot scan (PDF, Office, archive, image OCR, etc.). Symlinks and offline cloud-only files are skipped.
  3. Scans the extracted text for PII.
  4. Evaluates every finding against the loaded policies. A finding becomes a violation when it matches a policy.
  5. Dispatches the violations: writes them to the watch_violations table, emits them on stdout as JSON (unless --no-json), and POSTs them to the webhook (if --webhook was given).

The daemon runs until you send Ctrl+C.

The policy file

--policy <file> loads alert policies from a TOML-style config file. Each policy describes a condition (which PII types in which paths) and metadata (an action label and a severity label) that gets attached to every matching violation.

Example

# policies.toml — drop in any path you like; the file is just read and parsed.

[[policy]]
name = "no-ssn-in-public-shares"
pii_type = "ssn"
path_pattern = "/srv/public/.*"
action = "deny"
severity = "critical"

[[policy]]
name = "any-credit-card"
pii_type = "credit-card"
action = "alert"
severity = "high"

[[policy]]
# Catch-all: alert on anything PII-shaped landing in /srv/uploads/
name = "uploads-anything"
path_pattern = "/srv/uploads/.*"
severity = "medium"

Load it on startup:

piicrawler watch /srv/public /srv/uploads --policy ./policies.toml

Schema

Each [[policy]] table accepts the following keys:

Key Required Default Type Purpose
name yes string Identifier shown in alerts and stored with each violation row. Tables without a name are silently skipped.
pii_type no (any) string Restrict the policy to a single PII type slug (see PII type slugs below). Compared case-insensitively. Omit to match every PII type.
path_pattern no (any) regex Restrict the policy to files whose path matches this Rust regex. The match uses is_match semantics — it succeeds if the regex matches anywhere in the path, so /public/.* matches /srv/public/data.txt. Omit to match every path. Invalid regexes are skipped with a warning at load time and the resulting policy effectively matches every path.
action no "deny" string Free-form label attached to each violation. Pass-through only — see Actions and severity are labels.
severity no "high" string Free-form severity label attached to each violation. Pass-through only.
max_risk no string Currently parsed but not evaluated. Reserved for a future risk-score filter; setting it has no effect today.

Keys are read line-by-line; values may be quoted (name = "no-ssn") or bare (name = no-ssn). Lines starting with # are comments. Blank lines are allowed.

A finding becomes a violation for a given policy when both of the following hold:

  • The finding's PII type equals pii_type (case-insensitive), or pii_type is omitted.
  • The file's path matches path_pattern, or path_pattern is omitted.

A single finding can violate multiple policies — each match produces its own violation row and its own alert.

PII type slugs

pii_type accepts any of the built-in slugs:

Slug What it detects
ssn U.S. Social Security Number
credit-card Credit card number (Luhn-checked)
aws-credential AWS access keys / secret keys
passport U.S. passport number
ein Employer ID Number
drivers-license Driver's license number
dob Date of birth
address Street address
phone Phone number
city_state_zip City / state / ZIP cluster
email Email address
name Full name

It also accepts the dynamic slugs PII Crawler generates for user-defined detectors:

The slug is whatever the rule was named, lowercased and dashed. For example, a custom regex rule called "Account Number" produces the slug regex-account-number.

Actions and severity are labels

action and severity are emitted to alerts and stored with each violation, but PII Crawler itself does not act on them. Setting action = "deny" does not block, quarantine, or modify the file — the file is left exactly where it is. Both fields exist so a downstream system (your webhook receiver, SIEM, ticketing pipeline, etc.) can decide what to do based on the label.

If you want the daemon to actually move or delete a file when PII is found, do it on the receiving end of your webhook.

Reloading policies

--policy <file> is upsert-by-name: each [[policy]] table is inserted if its name is new, or its fields are updated in place if a policy with that name already exists. Re-running the daemon with the same file is therefore idempotent — no duplicate alerts.

Policies that you remove from the file are not deleted from the database automatically. If you rename or drop a policy and want the old row gone, delete it explicitly with your SQLite client of choice. The database lives at ~/.piicrawler/piicrawler.db (see Results Storage):

sqlite3 ~/.piicrawler/piicrawler.db "DELETE FROM watch_policies WHERE name = 'old-name';"

Policies are stored in the local database and shared across runs, but the TUI and Web UI do not currently expose a policy editor — --policy <file> is the only way to load or change them.

Webhook payload

When --webhook <url> is set, the daemon sends each batch of violations from a single file event as one HTTP request:

POST <url>
Content-Type: application/json

{"violations":[
  {
    "event": "policy_violation",
    "policy": "no-ssn-in-public-shares",
    "file": "/srv/public/handover/employees.csv",
    "pii_type": "ssn",
    "term": "***-**-6789",
    "severity": "critical",
    "action": "deny"
  }
]}

Field reference

Field Source
event Always the literal string "policy_violation".
policy The policy's name.
file Absolute path of the changed file.
pii_type The detector slug (e.g. ssn, email, regex-account-number).
term The matched text, masked for transport (e.g. ***-**-6789, jo*****e@ex***le.com). The unmasked term is never sent over the wire.
severity The policy's severity label, verbatim.
action The policy's action label, verbatim.

Batching

One POST is made per file event. If a single file produces N policy violations (e.g. it contains both an SSN and a credit card and you have policies for both), all N appear in the same request's violations array. Files that produce zero violations result in no request at all.

Delivery semantics

  • The webhook is fire-and-forget. There is no retry on failure: a non-2xx response or a connection error is logged at error level and the violations are dropped from the webhook stream. They are still recorded in watch_violations and (if enabled) printed to stdout.
  • The request is synchronous — the daemon blocks on the POST before processing the next file event. A slow webhook will throttle scanning. Run your receiver behind a fast queue if alert volume could be high.
  • There is no signing header or shared secret. If the receiver is reachable from anywhere other than localhost, terminate it behind a reverse proxy that enforces auth.

JSON stdout stream

Unless you pass --no-json, every violation is also written to stdout as a single line of JSON, in the same shape as one element of the webhook violations array:

{"event":"policy_violation","policy":"no-ssn-in-public-shares","file":"/srv/public/handover/employees.csv","pii_type":"ssn","term":"***-**-6789","severity":"critical","action":"deny"}

This is JSONL (one violation per line) — pipe it through jq -c or into a log shipper like Vector or Filebeat. As with the webhook payload, the term field is masked.

Progress and operational logs are written to stderr, so stdout stays clean for piping:

piicrawler watch /srv/uploads --policy ./policies.toml > violations.jsonl 2> watch.log

Database persistence

Independent of stdout and the webhook, every violation is inserted into the watch_violations table in the local database, with the unmasked term and a foreign key to the policy that fired:

CREATE TABLE watch_violations (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    policy_id INTEGER NOT NULL REFERENCES watch_policies(id) ON DELETE CASCADE,
    file_path TEXT NOT NULL,
    pii_type TEXT NOT NULL,
    term TEXT NOT NULL,
    severity TEXT NOT NULL,
    created_at TEXT NOT NULL DEFAULT (datetime('now'))
);

The unmasked term is kept locally so triage and DSAR workflows have the underlying value, but it is never transmitted off the host. Schema details live in Results Storage.

Operational notes

  • Polling, not OS events. The daemon walks every watched directory each cycle and diffs against the previous snapshot on (size, mtime). This keeps behaviour identical across platforms but means CPU cost grows with the size of the watched tree.
  • Debounce floor. --debounce <ms> sets the polling interval. Values below 500 are clamped to 500ms, so --debounce 100 and --debounce 500 behave identically. Defaults to 500.
  • Symlinks are skipped. This prevents traversal escape from the watched root.
  • Offline cloud files are skipped. Files marked as cloud-only / placeholder by macOS iCloud, OneDrive, Dropbox, etc. are not downloaded by the daemon.
  • Unsupported file types are skipped. Only files matching PII Crawler's supported file types are extracted and scanned; everything else short-circuits before any work is done.
  • Removed files don't trigger scans. Deletions are detected and emitted internally but are not surfaced as violations — there is no PII to find.
  • Each scan uses the same extraction timeout as one-shot mode. A single pathological file can stall its file-event slot but will not block other events.

Examples

Dry-run on stdout, no webhook, no policies

Useful for verifying file events are firing as you expect:

piicrawler watch /srv/uploads

With no --policy, no policies are loaded, so no violations are produced and stdout will be quiet. Operational logs ("Watching directory", "PII found in changed file") still go to stderr.

Single-policy alert to a webhook, no JSON stream

piicrawler watch /srv/uploads \
  --policy ./policies.toml \
  --webhook https://alerts.example.com/piicrawler \
  --no-json

Catch-all policy, all PII types, only in a sensitive subtree

[[policy]]
name = "anything-in-finance"
path_pattern = "/srv/shared/finance/.*"
severity = "high"

CI-style: fail loudly on first violation

watch itself does not exit on a violation — it's a daemon. For "fail the build if PII is present" use piicrawler dsar --assert-clean or run a one-shot scan and check the output.

See also

Was this page helpful?