Results Storage

PII Crawler scan results are stored in an SQLite Database. SQLite is a single-file database. Most programing languages have built-in support for connecting to it. By default this file (piicrawler.db) will be created in the same directory that you run PII Crawler from.

PII Crawler intentionally exposes the application's database to the user so the user can manually or programmatically:

  • View the data that is being collected
  • Run custom queries on the data
  • Create alerts from findings
  • Customize which files are scanned
  • Customize how files are scanned
  • Customize the behavior of PII Crawler

You can think of this database as an API to PII Crawler.

Schema

CREATE TABLE IF NOT EXISTS "files" (
		path TEXT primary key,
		scan_started_at INTEGER,
		scan_finished_at INTEGER,
		size INTEGER,
		extension TEXT,
		mime_type TEXT,
		csz_clusters INTEGER,
		unique_csz_clusters INTEGER,
		unique_common_first_names INTEGER,
		unique_common_last_names INTEGER,
		potential_tax_ids_or_ssns INTEGER,
		text_extracted BOOLEAN default 0 NOT NULL,
		unique_common_email_domain_suffixes INTEGER,
		unique_emails INTEGER,
		unique_addresses INTEGER,
		results TEXT,
		skip BOOLEAN default 0 NOT NULL
	);
ColumnDescription
pathabsolute path to file
scanstartedatunix timestamp of when the scan started
scanfinishedatunix timestamp of when the scan finished
sizesize of file in bytes
extensionfile extension (ex: .pdf, .csv)
mime_typedetected file mimetype (ex: application/json, image/jpeg)
csz_clusterscity, state, zip combination matches
uniquecszclustersunique city, state, zip combination matches
uniquecommonfirst_namesunique common first names
uniquecommonlast_namesunique common last names
potentialtaxidsorssnsSSNs or Tax IDs
text_extractedbool if file parsing, text extraction, or OCR was used
uniquecommonemaildomainsuffixescount of common email suffixes found (supplemental to unique_emails)
unique_emailsunique full email addresses
unique_addressesunique street addresses with match city state zip
resultsnot yet used
skipbool if true file will not be scanned