PII Data Types
PII Crawler currently supports the following PII data types. We plan to support all PII data types defined by CPPA.
U.S. Social Security Number (SSN)
9 digit numerical usually in the format NNN-NN-NNNN. The prefix used to have meaning but was removed in the randomization process June 25, 2011 where previously unassigned area numbers were introduced for assignment excluding area numbers 000, 666 and 900-999. There are three parts:
Area Number (NNN) - Initially assigned based on geographical regions, it indicated the state of application. Since 2011, it has been assigned randomly.
Group Number (NN) - This two-digit number ranges from 01 to 99 and is not assigned consecutively. It follows a specific issuing pattern for administrative purposes.
Serial Number (NNNN) - This four-digit number is assigned sequentially and can be consecutive.
Valid (this specific number is not): 078-05-1120
Not Valid: 666-12-1234
U.S. City, State, Zip Cluster (CSZ)
A cluster is a set of distinct pieces of data that by themselves don't represent much but when found or linked together can produce something meaningful.
90210 by itself doesn't mean much but when 90210 is found near the words Beverly Hills we know we have a city and zip code. PII Crawler uses this clustering method to find City, State, and Zip codes. We call this a city, state, zip cluster or CSZ.
Street Address
Meaningful street addresses are often found near CSZ clusters.
First Name
PII Crawler uses common name lists and NER techniques to find names
Last Name
PII Crawler uses common name lists and NER techniques to find names
Date of Birth
PII Crawler detects date of birth using term lists and date recognition.
Email Address
PII Crawler uses a FSM to find email addresses.
US Passport
Begins with a letter followed by eight numbers
Credit Card
PII Crawler uses a custom FSM to find credit card numbers. It checks for valid IIN prefixes, lengths, and a checksum digit.
Driver's License
PII Crawler uses a combination of FSM and Aho-Corasick multi-pattern matching to find driver's license numbers. It uses a custom FSM to check for valid DLN formats and then uses an Aho-Corasick multi-pattern matcher to find the numbers near terms like "drivers license" or "driver's license".
AWS Credentials
PII Crawler uses a custom FSM built around AWS unique ID prefixes to find AWS credentials.
Custom Regex
You can specify your own custom regex rules to match your specific data types. Simply add them in the scan options when creating a new scan:
