Detection

Tax Form Detection

Last updated April 2026

Tax form detection is an opt-in scan feature that flags documents containing well-known U.S. tax forms. When enabled, PII Crawler emits a US Tax Form finding (one per detected form) on top of the usual PII findings, so reviewers can spot tax records mixed in with other documents during a sweep of a shared drive or mailbox.

Why this is off by default

The detector extracts a feature set of word bigrams from every scanned file and compares it against the bundled form templates. That extra work is small per file but measurable across a directory of millions of files, and most scans are not tax related. Form detection is therefore a per-scan toggle that defaults to off — turn it on for HR drives, accounting share folders, or before-tax-season audits, and leave it off everywhere else.

What gets detected

The detector ships with the following IRS templates baked into the binary, so the feature works fully offline:

  • Form 1040 (U.S. Individual Income Tax Return)
  • Form 1040 Schedules 1, 2, and 3
  • Form W-4 (Employee's Withholding Certificate)
  • Form W-9 (Request for Taxpayer Identification Number)

Multiple pages of the same form collapse to a single finding, so a scanned multi-page 1040 is reported once as IRS Form 1040.

How it works

Each bundled template is reduced to a feature set of lowercased word bigrams plus any colon-suffixed labels, which usually correspond to fillable form fields like name: or address:. For each document, PII Crawler computes the containment ratio — what fraction of the template's features are also present in the document — and reports a match when that ratio crosses 0.6.

Enabling it

Tax form detection lives in the Detection section of the new-scan form, alongside the PII type picker, terms lists, and proximity groups. Toggle "Tax form detection is enabled" before starting the scan. Findings show up in the matches view under the US Tax Form type (slug us-tax-forms), and contribute to the scan's risk-weighted summary.

Was this page helpful?