Reference

Supported File Types

Last updated June 2026

📄 Documents

Common Name	MIME Type
PDF	`application/pdf`
Word (DOC)	`application/msword`
Word (DOCX)	`application/vnd.openxmlformats-officedocument.wordprocessingml.document`
Excel (XLS)	`application/vnd.ms-excel`
Excel (XLSX)	`application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`
PowerPoint (PPT)	`application/vnd.ms-powerpoint`
PowerPoint (PPTX)	`application/vnd.openxmlformats-officedocument.presentationml.presentation`
Rich Text Format (RTF)	`application/rtf`
Apple Keynote	`application/x-iwork-keynote-sffkey`
Apple Pages	`application/x-iwork-pages-sffpages`
Apple Numbers	`application/x-iwork-numbers-sffnumbers`
OpenDocument Text (ODT)	`application/vnd.oasis.opendocument.text`
OpenDocument Spreadsheet (ODS)	`application/vnd.oasis.opendocument.spreadsheet`
OpenDocument Presentation (ODP)	`application/vnd.oasis.opendocument.presentation`
Hancom Hangul Word Processor	`application/x-hwp`
EPUB	`application/epub+zip`
HTML	`text/html`
Plain text	`text/plain`
Markdown	`text/x-markdown`
CSV	`text/csv`

🗃️ Archives / Packages

Common Name	MIME Type
ZIP	`application/zip`
GZIP	`application/x-gzip`
BZIP2	`application/x-bzip2`
7z	`application/x-7z-compressed`
RAR	`application/x-rar-compressed`
TAR	`application/x-tar`
ISO image	`application/x-iso9660-image`
JAR	`application/java-archive`
Apple DMG	`application/x-apple-diskimage`
CPIO	`application/x-cpio`

.zip archives are flattened during enumeration: each supported file inside becomes its own scan unit, identified by a virtual path like bundle.zip!/reports/people.txt. Workers scan entries in parallel instead of serializing through the whole archive, and progress counters reflect entries rather than the archive as a single file. Encrypted or corrupt zips fall back to a single error entry. .7z, .tar.gz, and .mbox keep their existing per-archive behavior.

In the web UI and TUI, virtual zip-entry paths are first-class: each entry has its own row in the file list, opens to its own findings page, and previews its decoded contents through the "Re-scan this file" action. The "delete from disk" action is disabled for entries inside an archive, since removing one entry would require rewriting the whole .zip.

🖼️ Images (with OCR text extraction)

Common Name	MIME Type
JPEG	`image/jpeg`
PNG	`image/png`
GIF	`image/gif`
TIFF	`image/tiff`
BMP	`image/bmp`
BMP (Windows)	`image/x-ms-bmp`
ICO	`image/x-icon`
Photoshop (PSD)	`image/vnd.adobe.photoshop`
GIMP (XCF)	`image/x-xcf`
AutoCAD (DWG)	`image/vnd.dwg`

👨‍💻 Code / Source Files

Common Name	MIME Type
Java source	`text/x-java-source`
Java class	`application/java-vm`
C source	`text/x-c`
C++ source	`text/x-c++src`
Python source	`text/x-python`
JavaScript	`application/javascript`
Shell script	`text/x-shellscript`
PHP	`application/x-php`
Object code	`application/x-object`

Including any plain-text file types

✉️ Email / Messaging

Common Name	Extension(s)	MIME Type
Email (EML)	`.eml`	`message/rfc822`
Outlook Message (MSG)	`.msg`	`application/vnd.ms-outlook`
Gmail / Thunderbird mailbox	`.mbox`	`application/mbox`
Outlook data file	`.pst`, `.ost`	`application/vnd.ms-outlook-pst`

.mbox archives are treated as containers: each message inside becomes its own scan unit, just like an entry inside a .zip. Findings stream to disk per message, and the file_path in JSONL output carries the message ordinal (and Message-ID: when present) — for example mail.mbox::message-000042::<[email protected]>. Headers (From, To, Cc, Bcc, Subject, Reply-To), bodies, and decoded attachments are all scanned. The streaming reader handles multi-gigabyte mboxes (Gmail Takeout exports) without loading the whole file into memory. See Scan Gmail for PII for a walkthrough.

🗄️ Database flat-files

Common Name	Extension(s)
SQLite	`.sqlite`, `.sqlite3`, `.db`, `.db3`
Microsoft Access	`.mdb`, `.accdb`

Database files are opened read-only, every user table is dumped to text, and the resulting text is run through PII detection.

🪶 Very large plain-text files

Most files are read in full, extracted, and scanned. That is fine for documents and spreadsheets, but a single plain-text file can be enormous: a multi-gigabyte web server log, a database .sql dump, a giant .csv or .json export. Reading one of those into memory all at once can request more memory than the machine has and stop the scan.

To avoid that, plain-text files at or above 32 MB are scanned in bounded, line-aligned windows. The file is read a few megabytes at a time, split on line boundaries, and each window is run through the same PII detection as a normal scan. Peak memory stays at a few megabytes no matter how large the file is, so a 100 GB log scans in the same memory footprint as a 100 MB one. This applies to both local scans and network share scans, where the file is also streamed over SMB rather than downloaded whole.

Windows overlap by a wide margin so a value that lands on a window boundary (for example an email address split across the seam) is still detected exactly once. Structured formats that need a full parse (PDF, Office documents, embedded databases, mail archives) are not windowed; .mbox mailboxes get their own per-message streaming described above, and disk images and other binary formats are skipped entirely.

If you want to skip large files altogether instead of streaming them, set a smaller Max file size on the scan and any file above that limit is ignored.

🪶 Spreadsheets with an oversized declared range

An .xlsx spreadsheet records the rectangle of cells it uses. The reader allocates a grid sized to that declared range before reading any data, so a corrupt or maliciously crafted file that declares an enormous range (for example a single stray cell far out at the bottom-right of the sheet) can ask for many gigabytes of memory at once and stop the scan.

To prevent that, each worksheet's declared range is checked before the file is opened. A sheet that declares more than roughly 64 million cells, far beyond any genuine spreadsheet, is skipped and recorded as an extraction error for that file rather than allowed to exhaust memory. The rest of the scan continues normally. Real spreadsheets, including large exports, are unaffected.

🪶 Crash isolation for document parsing

Reading text out of formats like PDF, Office documents, embedded databases, and images relies on format parsers that can occasionally fail badly on a corrupt or hostile file. Some of those failures, such as a runaway memory request or a stack overflow deep inside a parser, cannot be caught and recovered the way an ordinary error can. Left unchecked, a single bad file could end the whole scan.

To contain that, each of these files is parsed in a short-lived helper process rather than in the main application. If a file makes the parser crash or run out of memory, only that helper process is affected. The main scan records the file as unreadable and moves on, so one malformed document never takes down an entire scan or the web interface. Plain-text files are read directly and do not use a helper process.

This protection is on by default and needs no configuration. If you ever need to turn it off, for example to compare behavior while diagnosing an issue, set the environment variable PIICRAWLER_DISABLE_EXTRACT_ISOLATION=1 before starting a scan. The built-in safeguards such as the spreadsheet range check above still apply either way.

← Previous

PII Data Types

Exclusion Patterns

Was this page helpful?