Guides

DSAR Walkthrough

Last updated April 2026

A Data Subject Access Request (DSAR) is the legal mechanism behind GDPR Article 15 and CCPA's Right to Know: a person asks "what data do you hold on me?" and you have to produce an answer, usually within 30–45 days. The hard part isn't the legal text; it's the actual mechanics of finding every file, mailbox, and shared drive that mentions one specific human and reporting it back.

This guide walks the full workflow end-to-end. By the end you'll have:

  1. A scan that covers every place the data could be
  2. An identity index that links every finding to the person it belongs to
  3. An HTML report you can hand to legal or attach to the response

The model

PII Crawler treats DSARs as a read over the scan database, not as a separate scan. The flow is:

   one or more         identity            piicrawler         HTML
   regular scans  →    scan(s)        →    dsar           →   report
   (find PII)          (link PII to        (filter by
                       names)              person)

Three reasons to keep these as separate phases:

  • Scans are expensive, queries are cheap. A 5 TB share takes hours; a DSAR query takes seconds. You don't want to re-scan for every request.
  • Identity association is reversible. You can re-run the linker with different settings (different distance, different scope) without rescanning files.
  • One scan satisfies many DSARs. Person A asks today, Person B asks next month — both are answered from the same data.

Step 1 — Scan the data

If you don't already have scan coverage of the systems in scope, run those scans first. Coverage is the most common gap on real DSARs: you remember the file server but forget the inboxes, or the production database export but forget the backups. Common targets:

  • Network shares (piicrawler TUI / web UI → New Network Share Scan)
  • Email exports (piicrawler scan ~/mailbox-takeout)
  • Local file trees (piicrawler scan /srv/finance)
  • Database flat-files (SQLite/Access; see Supported File Types)

You don't need to know whose data is in there yet. The point of this phase is just to surface every piece of PII in scope. See Quickstart for the basics and Scan an SMB Network Share for network coverage.

A practical tip: enable detection for every PII type you might need to report on. Reports later in the workflow surface only the types that the underlying scan was configured to find — turning a type back on after the fact requires a rescan.

Step 2 — Run an identity scan

A regular scan tells you "there's an SSN in employees.csv at byte 1024." That's not yet useful for a DSAR — you want to know whose SSN. PII Identity Scan is the phase that links findings to names.

From the TUI, open the scan and press i (or PII Identity Scan in the web UI). You configure two things:

  • PII types to detect. Match the types from the regular scan, plus Full Names.
  • Association method. Same Line for spreadsheets and tabular exports, Same Paragraph for narrative documents, Character Distance (default 200) for everything else. You can run association multiple times with different settings without rescanning the source files.

For a full reference, see PII Identity Scan. For a DSAR specifically, the heuristic is:

Source Recommended method
HR / payroll spreadsheets, customer CSVs Same Line
Contracts, narrative reports, letters Same Paragraph
Mixed mailboxes, support tickets Character Distance (200–400)

The output is an identity run: every PII finding in the scan, tagged with the name (or names) it was nearest to, plus an Unassociated bucket for findings the linker couldn't resolve.

Don't skip the unassociated review on a real DSAR. Anything in there might still belong to the subject and just lacks proximity context — typically database dumps and CSV files where the column header carries the identity instead of a per-row name. Mark them up before you generate the report.

Step 3 — Search for the subject

Once you have one or more identity runs, the search is a single command:

piicrawler dsar "Jane Doe"

This searches every scan in the local database for findings associated with that name and prints a confidence-tagged summary to stderr:

  DSAR Search Results: "Jane Doe"
  ──────────────────────────────────────────────────
  Findings:     17
  Files:        4
  Scans:        2
  PII Types:    address, dob, email, phone, ssn

  /srv/hr/2024-onboarding/jane-doe-i9.pdf
    [  High] ssn              ***-**-6789
    [  High] dob              **/**/****
    [  High] address          *** Main St
  ...

Confidence levels

PII Crawler ranks each result. Always include the confidence column in the response you send to legal — it's the difference between "we found Jane's SSN" and "we found an SSN that might be Jane's."

Confidence Source Meaning
High Identity association, unambiguous (one name in scope) The SSN is in line with a single occurrence of "Jane Doe." Treat as the subject's data.
Medium Identity association, ambiguous (two or more names equally close) The SSN is in line with both "Jane Doe" and another name. Manual review.
Low No identity association — the subject's name appeared in the finding's surrounding context A weaker signal: a piece of PII that mentions the name nearby in the text but didn't make it into a formal association. Manual review.

Low-confidence results are the safety net for findings that escaped identity association (older scans without an identity run, names embedded in non-standard places). Don't ignore them, but expect more noise.

Useful flags

piicrawler dsar "Jane Doe" --json                  # also print structured JSON to stdout
piicrawler dsar "Jane Doe" --report jane-doe.html  # write a self-contained HTML report
piicrawler dsar "Jane Doe" --assert-clean          # exit 1 if anything found (CI gate)

--assert-clean is meant for the inverse use case: failing a CI build or a release pipeline if any PII is found for an internal test identity that was never supposed to land in the codebase. Pair it with a watch policy if you want continuous protection. See piicrawler dsar.

Name matching

The query is split on whitespace and turned into a SQL LIKE pattern. "Jane Doe" matches "Jane Doe", "Jane M. Doe", and "Jane Doe-Smith", but not "Jane Doerr" or "Jane" alone. Quote multi-word names. For people who go by a nickname, run a second query with the nickname.

Step 4 — Generate the HTML report

For the actual response you hand to legal, use --report:

piicrawler dsar "Jane Doe" --report jane-doe-2026-05-04.html

The output is a single self-contained HTML file with:

  • An executive summary (total findings, files affected, scans searched, confidence breakdown)
  • A breakdown by PII type and severity
  • A by-file listing of every finding, with the masked value and surrounding context
  • The date range of the underlying scan data, so legal can see exactly what window the response covers

The values in the report are masked the same way they are in stdout (***-**-6789, jo*****e@ex***le.com, etc.). Unmasked values stay in the local database — they never appear in the report. This is intentional: the report is meant to be reviewed and shared, the database is not.

Open the HTML, do one final read-through, and attach it to your DSAR response.

Operational notes

  • Each DSAR is a fresh query, so re-running the same command later picks up any scans you've added since. There's no caching to invalidate.
  • The local database is the source of truth. If someone deletes a scan, its findings disappear from future DSAR queries. Back up ~/.piicrawler/piicrawler.db before pruning anything you might still be obligated to report on. See Results Storage → Backups.
  • Identity runs are scoped to a single scan. If your data spans multiple scans, run an identity scan on each one. The DSAR command searches across all of them automatically.
  • Triage carries forward. Findings you've marked false positive in a regular scan still appear in DSAR results — the DSAR command surfaces every match by default so you don't accidentally exclude something that should be reported. Use the --json output and your own filter if you need to drop FP-marked findings.

Worked example

You receive a GDPR request from [email protected] on 2026-05-04. Your scan coverage already includes the HR file server, the support-team mailbox export, and a recent SQL dump of the customer database.

# 1. Confirm scans exist for everything in scope
piicrawler        # press Enter to open the scan list, eyeball coverage

# 2. From the TUI, run an identity scan on each scan that doesn't have one yet
#    (press 'i' in scan detail; method = "Same Line" for the SQL dump,
#     "Same Paragraph" for the support mailbox)

# 3. Run the DSAR query
piicrawler dsar "Jane Doe" --report responses/jane-doe-2026-05-04.html

# 4. Review the HTML, then attach to your DSAR response email

The HTML report is the deliverable. The stderr summary is the sanity check (counts match expectations? confidence skews high? no surprise file showing up?). The local database is the audit trail: if Jane comes back six months later and asks the same question, the DB still has the snapshot of what was true on 2026-05-04 even after the source files have changed.

See also

Was this page helpful?