PII Data Types
PII Crawler currently supports the following PII data types. We plan to support all PII data types defined by CPPA.
U.S. Social Security Number (SSN)
9 digit numerical usually in the format NNN-NN-NNNN. The prefix used to have meaning but was removed in the randomization process June 25, 2011 where previously unassigned area numbers were introduced for assignment excluding area numbers 000, 666 and 900-999. There are three parts:
Area Number (NNN) - Initially assigned based on geographical regions, it indicated the state of application. Since 2011, it has been assigned randomly.
Group Number (NN) - This two-digit number ranges from 01 to 99 and is not assigned consecutively. It follows a specific issuing pattern for administrative purposes.
Serial Number (NNNN) - This four-digit number is assigned sequentially and can be consecutive.
Valid (this specific number is not): 078-05-1120
Not Valid: 666-12-1234
U.S. City, State, Zip Cluster (CSZ)
A cluster is a set of distinct pieces of data that by themselves don't represent much but when found or linked together can produce something meaningful.
90210 by itself doesn't mean much but when 90210 is found near the words Beverly Hills we know we have a city and zip code. PII Crawler uses this clustering method to find City, State, and Zip codes. We call this a city, state, zip cluster or CSZ.
Street Address
Meaningful street addresses are often found near CSZ clusters.
First Name
PII Crawler uses common name lists and NER techniques to find names
Last Name
PII Crawler uses common name lists and NER techniques to find names
Date of Birth
PII Crawler detects date of birth using term lists and date recognition.
Email Address
PII Crawler uses a FSM to find email addresses.
US Passport
Begins with a letter followed by eight numbers
Credit Card
PII Crawler uses a custom FSM to find credit card numbers. It checks for valid IIN prefixes, lengths, and a checksum digit.
Driver's License
PII Crawler uses a combination of FSM and Aho-Corasick multi-pattern matching to find driver's license numbers. It uses a custom FSM to check for valid DLN formats and then uses an Aho-Corasick multi-pattern matcher to find the numbers near terms like "drivers license" or "driver's license".
NZ Inland Revenue Department Number (IRD)
Off by default. Enable in scan options if you need to detect New Zealand IRD Numbers.
NZ IRD Numbers are 8 or 9 digit identifiers issued by Inland Revenue and used as both personal and business tax IDs. They appear formatted with hyphens (XX-XXX-XXX or XXX-XXX-XXX) or as a flat run of digits. PII Crawler validates the official IRD modulo-11 checksum using the two-pass primary and secondary weight scheme.
AU Tax File Number (TFN)
Off by default. Enable in scan options if you need to detect Australian Tax File Numbers.
AU TFNs are 8 or 9 digit identifiers issued by the Australian Taxation Office. They are typically formatted as space-separated triplets (NNN NNN NNN or NNN NNN NN), or as a flat run of digits. PII Crawler validates the official ATO mod-11 checksum using the published weight sequences for both lengths.
DE Tax Identification Number (Steuer-ID)
Off by default. Enable in scan options if you need to detect German Tax Identification Numbers (Steuerliche Identifikationsnummer, "Steuer-ID" or "IdNr.").
DE Steuer-IDs are 11-digit identifiers issued by the Bundeszentralamt für Steuern. They appear as digit runs, space-separated groups (NN NNN NNN NNN), or hyphen-separated groups. PII Crawler validates the ISO 7064 MOD 11,10 check digit and enforces the official repetition rule on the first 10 digits (exactly one digit may appear two or three times; every other digit at most once).
NZ Passport Number
Off by default. Enable in scan options if you need to detect New Zealand passport numbers.
NZ passports issued from 2005 onwards (EA, LA, and RA biometric series) carry a number consisting of two uppercase letters followed by six digits (for example, LH615098). Pre-2005 passports used a one-letter, seven-digit format and have all expired given the ten-year adult validity, so PII Crawler matches the modern two-letter, six-digit form only. Detection requires a New Zealand passport context term such as "passport", "New Zealand passport", or "DIA" within 150 characters.
AU Passport Number
Off by default. Enable in scan options if you need to detect Australian passport numbers.
AU passports use either a single-letter prefix (N, E, D, F, A, C, U, or X) followed by seven digits, or a two-letter prefix beginning with P (PA, PB, PC, PD, PE, PF, PU, PW, PX, PZ) followed by seven digits. There is no published checksum, so detection requires a context term such as "passport", "Australian passport", or "DFAT" within 150 characters.
DE Passport (Reisepass) Number
Off by default. Enable in scan options if you need to detect German passport (Reisepass) numbers.
German passports issued from 2021-11-01 onwards use a nine-character alphanumeric document number. The first character is a letter and the remaining eight characters are drawn from the digits 1-9 and the letters C F G H J K L M N P R T V W X Y Z. The digit 0 and the letter O are deliberately excluded to avoid visual ambiguity (see Wikipedia: German passport). Detection requires a context term such as "Reisepass", "Passnummer", or "passport number" within 150 characters.
NZ Driver Licence Number
Off by default. Enable in scan options if you need to detect New Zealand driver licence numbers.
NZ driver licence numbers are eight characters: two uppercase letters followed by six digits (for example, BQ739482). The card-version code printed on the physical card is a separate three-digit reissue counter and is not part of the licence number. Detection requires a context term such as "driver licence", "drivers licence", "driver's licence", or "Waka Kotahi" (the NZ Transport Agency) within 150 characters.
The licence number shares its shape with the modern NZ passport pattern, so the same string can be reported as either type depending on the surrounding keywords. PII Crawler keeps the two detectors separate by matching only on licence-related terms here. Use the American spelling ("license") if you want the generic Driver's License detector to fire instead.
AU Driver Licence Number
Off by default. Enable in scan options if you need to detect Australian driver licence numbers.
Australia has no national format. Each state and territory issues its own scheme; PII Crawler matches the union of formats:
| Variant | States that use it |
|---|---|
| 6 to 10 digits | NSW (8), VIC (9), QLD (9), SA / WA / TAS (7), ACT (8 or 9), NT (6 or 7) |
| 1 letter + 5 digits | older licences |
| 2 letters + 4 digits | older licences |
| 4 digits + 2 letters | older NSW (pre-1990s) |
No state publishes a checksum, so detection relies on context. A licence-related term such as "driver licence", "drivers licence", "driver's licence", or "Australian Automobile Association" must appear within 150 characters.
DE Driver Licence Number (Führerscheinnummer)
Off by default. Enable in scan options if you need to detect German driver licence numbers (Führerscheinnummer).
German driver licence numbers are exactly 11 alphanumeric characters with a fixed positional structure:
- Position 1: state authority code (a letter from
Afor Baden-Württemberg throughPfor Thuringia, or a digit). - Positions 2 to 3: district code (two digits).
- Positions 4 to 9: sequential number (six digits or letters; letters appear once a district passes one million issued licences).
- Position 10: check digit (
0to9, orXwhen the modulo-11 remainder is 10). - Position 11: issue number (digit, then
AtoZafter the tenth re-issue).
For example, B072RRE2I55 is a Bavarian licence in district 07 with sequential number 2RRE2I, check digit 5, and issue number 5. PII Crawler validates the published modulo-11 checksum using weights 9 down to 1 over positions 1 through 9, with letters scored as alphabet position plus nine (so A is 10, B is 11, and Z is 35). A context term such as "Führerschein", "Führerscheinnummer", "Fahrerlaubnis", or the English "driver licence" must appear within 150 characters.
AWS Credentials
PII Crawler uses a custom FSM built around AWS unique ID prefixes to find AWS credentials.
Custom Regex
You can specify your own custom regex rules to match your specific data types. Simply add them in the scan options when creating a new scan:
