Scanning Unstructured Data for PII: PDFs, Word Docs, and Spreadsheets

If you ask a compliance officer where their organization's personally identifiable information lives, they'll point to the databases. Customer tables, CRM records, HR systems. They're not wrong. But they're not telling the whole story, either, because the messiest and most dangerous PII isn't sitting in neat rows and columns. It's buried inside a PDF someone emailed to accounting three years ago. It's in a spreadsheet a contractor put together during onboarding. It's in a Word doc with tracked changes that still contains a client's Social Security number in a deleted paragraph that never actually got deleted.

This is the unstructured data problem, and it's the blind spot in most compliance programs.

Structured vs. Unstructured: Why It Matters

Structured data is the stuff in databases. It has schemas, field names, data types. When you need to find every record that contains an SSN, you query a column. You know where to look because someone designed the table that way.

Unstructured data is everything else. Documents, spreadsheets, slide decks, scanned forms, email attachments, PDFs exported from systems that no longer exist. According to industry estimates, somewhere between 80% and 90% of enterprise data is unstructured. And unlike a database, nobody designed a schema for the folder full of scanned intake forms sitting on a shared drive in the Denver office.

The compliance risk here is real. Regulations like GDPR, CCPA, and HIPAA don't draw a distinction between structured and unstructured data. If a person's PII is in a Word document on a file server, it falls under the same obligations as PII in your production database. You need to know it's there, you need to protect it, and you need to be able to find it if someone submits a data subject access request.

Why Unstructured PII Scanning Is Hard

Scanning a database for PII is relatively straightforward. You sample column values, run pattern matching, and classify fields. The data is already organized for you.

Scanning documents is a different problem entirely. Each file format presents its own extraction challenges, and the PII could be anywhere inside the content. There's no column header that says "SSN" – just free-form text where a Social Security number might appear on page 47 of a 200-page PDF, embedded in a sentence.

Here are the specific challenges by file type:

PDFs: The Most Common and Most Difficult

PDFs are everywhere in business. Contracts, invoices, tax forms, HR paperwork, compliance reports. They're also the hardest file type to scan for PII reliably, because not all PDFs are created equal.

Text-based PDFs are the easier case. These are PDFs generated from digital sources – exported from Word, generated by a web application, created by a print driver. The text content is embedded in the file and can be extracted programmatically with libraries like Apache PDFBox. You pull the text out page by page and run your PII detection against it.

Scanned PDFs are the hard case. These are essentially images wrapped in a PDF container. Someone fed a paper form through a scanner, and the resulting file looks like a document but contains no extractable text. To find PII in a scanned PDF, you need optical character recognition (OCR) to convert the image back into text before you can scan it. OCR adds processing time, introduces accuracy issues (especially with poor scan quality, handwriting, or unusual fonts), and can dramatically slow down a large-scale scan.

Then there are the hybrid PDFs – documents that mix text layers with scanned pages, or PDFs with form fields that contain data separate from the visible text. A thorough scanner needs to handle all of these cases.

There's also the size problem. It's not unusual to encounter PDFs that are hundreds of megabytes – merged document packages, image-heavy reports, architectural plans with embedded data. A scanner that tries to load the entire file into memory will crash. You need page-by-page processing with proper memory management, ideally spilling to disk for files over a certain size threshold.

Word Documents: Hidden PII in Metadata and Revisions

Word documents (DOC and DOCX) seem straightforward on the surface. Open the file, extract the text, scan it. But Word files carry a lot more data than what's visible on the page.

Track changes and revisions are the biggest hidden risk. When someone edits a document with track changes enabled, the original text is preserved in the file even after changes are "accepted" in some workflows. A document that appears to have a client's name removed might still contain that name in the revision history embedded in the DOCX XML. If you're only scanning the visible text layer, you'll miss it.

Comments and annotations are another source of hidden PII. Reviewers often paste sensitive information into comments – account numbers for reference, personal details to verify against, contact information. These live in a separate XML stream within the DOCX file and won't show up in a naive text extraction.

Document metadata – author names, company names, file paths in recent documents lists – can also constitute PII or reveal information about PII that exists elsewhere.

A proper scanner needs to extract text from all of these layers, not just the document body.

Spreadsheets: PII Scattered Across Sheets and Hidden Columns

Spreadsheets are in some ways closer to structured data, but they present their own challenges for PII scanning.

Multiple sheets are the obvious issue. A workbook might have 30 tabs, and the PII could be on any of them, including sheets that someone hid because they "weren't needed anymore" but never deleted.

Hidden rows and columns are a real problem in practice. People hide columns containing sensitive data instead of deleting them. The data is still in the file, still extractable, still a compliance risk – it's just not visible when you open the spreadsheet normally.

Embedded objects – charts, images, OLE objects from other applications – can contain text or data that doesn't appear in the cell grid. A chart title might include a person's name. An embedded PDF object inside a spreadsheet creates a file-within-a-file situation.

Formulas and cell references can obscure PII. A cell might display a masked value while the underlying formula references a cell on another sheet that contains the full unmasked value.

CSV files deserve a mention here too. They look simple, but CSVs exported from databases are one of the most common vectors for PII exposure. Someone runs a query, exports to CSV for "quick analysis," and that file with 50,000 customer records ends up on a shared drive indefinitely. Unlike a spreadsheet with formatting cues, a CSV is just raw data with no visual indicators of sensitivity.

Other Formats You Can't Ignore

Beyond the big three, PII hides in plenty of other file types:

  • Email files (EML, MSG): Email bodies and attachments are a goldmine of PII. People email spreadsheets with customer data, paste account numbers into message bodies, and forward sensitive documents without thinking about where those files end up when the email is saved to disk.
  • Presentations (PPT, PPTX): Sales decks with customer case studies, HR presentations with employee data, board decks with financial details tied to individuals.
  • Rich text and legacy formats (RTF, ODT, ODS): Older document formats that are still floating around on file servers from before the organization standardized on Office 365.
  • Archive files (ZIP, 7z, RAR): Compressed archives are containers for other files. A thorough scan needs to enumerate the files inside an archive and scan each one individually without necessarily decompressing the entire archive to disk.

The Cloud Upload Problem

Many enterprise PII scanning tools operate as cloud services. You point them at your storage, they pull the data up to their infrastructure for analysis, and they send results back. For structured data in databases, this can work acceptably. For unstructured documents, it creates a problem that should make any security-conscious organization uncomfortable.

Uploading documents to a third-party service for PII scanning means sending your most sensitive files – the ones you're scanning precisely because they might contain PII – through the internet to someone else's servers. You're trying to find sensitive data, and the first step is transmitting that sensitive data externally. The irony isn't lost on auditors.

This is especially problematic for organizations subject to data residency requirements, those handling classified or regulated information, or companies that have contractual obligations about where client data can be processed.

Local scanning solves this entirely. A tool that runs on your own hardware, processes documents locally, and never transmits file contents externally eliminates the data-in-transit risk completely. The scan results stay on your machine. The documents never leave your network.

PII Crawler takes this approach – it runs locally on Windows, macOS, or Linux, processes all 50+ supported file formats on-device, and the only network communication is a one-time license registration. It can even run in fully air-gapped environments. Scan data, file contents, and PII findings are never transmitted anywhere.

What Good Unstructured PII Scanning Looks Like

Based on the challenges above, here's what to look for in a scanner that handles unstructured data effectively:

Broad format support. You need coverage across document types (PDF, DOC/DOCX, ODT), spreadsheets (XLS/XLSX, ODS, CSV), presentations (PPT/PPTX), email (EML, MSG), archives (ZIP, 7z, RAR), images (with OCR), and plain text formats. If your tool only handles PDFs and spreadsheets, you're leaving gaps.

Deep extraction, not surface-level. The scanner should extract text from revision history, comments, hidden sheets, metadata, and embedded objects. Scanning only the visible content of a document misses the PII that's most likely to cause problems – the stuff people thought they removed but didn't.

OCR capabilities for scanned documents. If your organization deals with any paper-to-digital workflows – scanned forms, faxed documents, photographed records – you need OCR in the pipeline. It should be configurable, though, since OCR dramatically increases scan time and isn't needed for born-digital documents.

Handling large and complex files. The scanner shouldn't choke on a 500MB PDF or a workbook with 50 sheets. Look for page-by-page processing, memory management that spills to disk, and timeout handling for files that can't be processed.

Archive traversal. The ability to look inside ZIP files, 7z archives, and other containers without manually extracting them first. PII is often hiding in archives precisely because someone zipped up a folder of sensitive documents and forgot about them.

Low false positive rates. This is where the detection method matters more than the extraction method. Pattern matching with regex alone produces enormous numbers of false positives – every 9-digit number looks like an SSN to a regex. Better tools use named entity recognition (NER), finite state machines, and contextual analysis to distinguish actual PII from similar-looking data.

Getting Started

If you haven't scanned your unstructured data for PII, the best place to start is with the locations where documents accumulate: shared drives, departmental folders, desktop directories, and email archives. These are the places where PII-laden documents go to be forgotten.

Pick a scanner that handles the file types you actually have (not just the ones you think you have – you'll be surprised), run it locally to avoid creating new data exposure risks, and prepare yourself for the results. Every organization that does this for the first time finds more PII in documents than they expected.

The databases were never the whole story. The documents are where the real exposure hides.