Reference

Supported File Types

Last updated June 2026

📄 Documents

Common Name MIME Type
PDF application/pdf
Word (DOC) application/msword
Word (DOCX) application/vnd.openxmlformats-officedocument.wordprocessingml.document
Excel (XLS) application/vnd.ms-excel
Excel (XLSX) application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
PowerPoint (PPT) application/vnd.ms-powerpoint
PowerPoint (PPTX) application/vnd.openxmlformats-officedocument.presentationml.presentation
Rich Text Format (RTF) application/rtf
Apple Keynote application/x-iwork-keynote-sffkey
Apple Pages application/x-iwork-pages-sffpages
Apple Numbers application/x-iwork-numbers-sffnumbers
OpenDocument Text (ODT) application/vnd.oasis.opendocument.text
OpenDocument Spreadsheet (ODS) application/vnd.oasis.opendocument.spreadsheet
OpenDocument Presentation (ODP) application/vnd.oasis.opendocument.presentation
Hancom Hangul Word Processor application/x-hwp
EPUB application/epub+zip
HTML text/html
Plain text text/plain
Markdown text/x-markdown
CSV text/csv

🗃️ Archives / Packages

Common Name MIME Type
ZIP application/zip
GZIP application/x-gzip
BZIP2 application/x-bzip2
7z application/x-7z-compressed
RAR application/x-rar-compressed
TAR application/x-tar
ISO image application/x-iso9660-image
JAR application/java-archive
Apple DMG application/x-apple-diskimage
CPIO application/x-cpio

.zip archives are flattened during enumeration: each supported file inside becomes its own scan unit, identified by a virtual path like bundle.zip!/reports/people.txt. Workers scan entries in parallel instead of serializing through the whole archive, and progress counters reflect entries rather than the archive as a single file. Encrypted or corrupt zips fall back to a single error entry. .7z, .tar.gz, and .mbox keep their existing per-archive behavior.

In the web UI and TUI, virtual zip-entry paths are first-class: each entry has its own row in the file list, opens to its own findings page, and previews its decoded contents through the "Re-scan this file" action. The "delete from disk" action is disabled for entries inside an archive, since removing one entry would require rewriting the whole .zip.


🖼️ Images (with OCR text extraction)

Common Name MIME Type
JPEG image/jpeg
PNG image/png
GIF image/gif
TIFF image/tiff
BMP image/bmp
BMP (Windows) image/x-ms-bmp
ICO image/x-icon
Photoshop (PSD) image/vnd.adobe.photoshop
GIMP (XCF) image/x-xcf
AutoCAD (DWG) image/vnd.dwg

👨‍💻 Code / Source Files

Common Name MIME Type
Java source text/x-java-source
Java class application/java-vm
C source text/x-c
C++ source text/x-c++src
Python source text/x-python
JavaScript application/javascript
Shell script text/x-shellscript
PHP application/x-php
Object code application/x-object

Including any plain-text file types


✉️ Email / Messaging

Common Name Extension(s) MIME Type
Email (EML) .eml message/rfc822
Outlook Message (MSG) .msg application/vnd.ms-outlook
Gmail / Thunderbird mailbox .mbox application/mbox
Outlook data file .pst, .ost application/vnd.ms-outlook-pst

.mbox archives are treated as containers: each message inside becomes its own scan unit, just like an entry inside a .zip. Findings stream to disk per message, and the file_path in JSONL output carries the message ordinal (and Message-ID: when present) — for example mail.mbox::message-000042::<[email protected]>. Headers (From, To, Cc, Bcc, Subject, Reply-To), bodies, and decoded attachments are all scanned. The streaming reader handles multi-gigabyte mboxes (Gmail Takeout exports) without loading the whole file into memory. See Scan Gmail for PII for a walkthrough.


🗄️ Database flat-files

Common Name Extension(s)
SQLite .sqlite, .sqlite3, .db, .db3
Microsoft Access .mdb, .accdb

Database files are opened read-only, every user table is dumped to text, and the resulting text is run through PII detection.


🪶 Very large plain-text files

Most files are read in full, extracted, and scanned. That is fine for documents and spreadsheets, but a single plain-text file can be enormous: a multi-gigabyte web server log, a database .sql dump, a giant .csv or .json export. Reading one of those into memory all at once can request more memory than the machine has and stop the scan.

To avoid that, plain-text files at or above 32 MB are scanned in bounded, line-aligned windows. The file is read a few megabytes at a time, split on line boundaries, and each window is run through the same PII detection as a normal scan. Peak memory stays at a few megabytes no matter how large the file is, so a 100 GB log scans in the same memory footprint as a 100 MB one. This applies to both local scans and network share scans, where the file is also streamed over SMB rather than downloaded whole.

Windows overlap by a wide margin so a value that lands on a window boundary (for example an email address split across the seam) is still detected exactly once. Structured formats that need a full parse (PDF, Office documents, embedded databases, mail archives) are not windowed; .mbox mailboxes get their own per-message streaming described above, and disk images and other binary formats are skipped entirely.

If you want to skip large files altogether instead of streaming them, set a smaller Max file size on the scan and any file above that limit is ignored.


🪶 Spreadsheets with an oversized declared range

An .xlsx spreadsheet records the rectangle of cells it uses. The reader allocates a grid sized to that declared range before reading any data, so a corrupt or maliciously crafted file that declares an enormous range (for example a single stray cell far out at the bottom-right of the sheet) can ask for many gigabytes of memory at once and stop the scan.

To prevent that, each worksheet's declared range is checked before the file is opened. A sheet that declares more than roughly 64 million cells, far beyond any genuine spreadsheet, is skipped and recorded as an extraction error for that file rather than allowed to exhaust memory. The rest of the scan continues normally. Real spreadsheets, including large exports, are unaffected.


🪶 Crash isolation for document parsing

Reading text out of formats like PDF, Office documents, embedded databases, and images relies on format parsers that can occasionally fail badly on a corrupt or hostile file. Some of those failures, such as a runaway memory request or a stack overflow deep inside a parser, cannot be caught and recovered the way an ordinary error can. Left unchecked, a single bad file could end the whole scan.

To contain that, each of these files is parsed in a short-lived helper process rather than in the main application. If a file makes the parser crash or run out of memory, only that helper process is affected. The main scan records the file as unreadable and moves on, so one malformed document never takes down an entire scan or the web interface. Plain-text files are read directly and do not use a helper process.

This protection is on by default and needs no configuration. If you ever need to turn it off, for example to compare behavior while diagnosing an issue, set the environment variable PIICRAWLER_DISABLE_EXTRACT_ISOLATION=1 before starting a scan. The built-in safeguards such as the spreadsheet range check above still apply either way.

Was this page helpful?