Reference

PII Data Types

Last updated June 2026

PII Crawler currently supports the following PII data types. We plan to support all PII data types defined by CPPA.

U.S. Social Security Number (SSN)

9 digit numerical usually in the format NNN-NN-NNNN. The prefix used to have meaning but was removed in the randomization process June 25, 2011 where previously unassigned area numbers were introduced for assignment excluding area numbers 000, 666 and 900-999. There are three parts:

  • Area Number (NNN) - Initially assigned based on geographical regions, it indicated the state of application. Since 2011, it has been assigned randomly.

  • Group Number (NN) - This two-digit number ranges from 01 to 99 and is not assigned consecutively. It follows a specific issuing pattern for administrative purposes.

  • Serial Number (NNNN) - This four-digit number is assigned sequentially and can be consecutive.

  • Valid (this specific number is not): 078-05-1120

  • Not Valid: 666-12-1234

U.S. City, State, Zip Cluster (CSZ)

A cluster is a set of distinct pieces of data that by themselves don't represent much but when found or linked together can produce something meaningful.

90210 by itself doesn't mean much but when 90210 is found near the words Beverly Hills we know we have a city and zip code. PII Crawler uses this clustering method to find City, State, and Zip codes. We call this a city, state, zip cluster or CSZ.

Street Address

Meaningful street addresses are often found near CSZ clusters.

First Name

PII Crawler uses common name lists and NER techniques to find names

Last Name

PII Crawler uses common name lists and NER techniques to find names

Date of Birth

PII Crawler detects date of birth using term lists and date recognition.

Email Address

PII Crawler uses a FSM to find email addresses.

US Passport

Begins with a letter followed by eight numbers

Credit Card

PII Crawler uses a custom FSM to find credit card numbers. It checks for valid IIN prefixes, lengths, and a checksum digit.

Driver's License

PII Crawler uses a combination of FSM and Aho-Corasick multi-pattern matching to find driver's license numbers. It uses a custom FSM to check for valid DLN formats and then uses an Aho-Corasick multi-pattern matcher to find the numbers near terms like "drivers license" or "driver's license".

NZ Inland Revenue Department Number (IRD)

Off by default. Enable in scan options if you need to detect New Zealand IRD Numbers.

NZ IRD Numbers are 8 or 9 digit identifiers issued by Inland Revenue and used as both personal and business tax IDs. They appear formatted with hyphens (XX-XXX-XXX or XXX-XXX-XXX) or as a flat run of digits. PII Crawler validates the official IRD modulo-11 checksum using the two-pass primary and secondary weight scheme.

AU Tax File Number (TFN)

Off by default. Enable in scan options if you need to detect Australian Tax File Numbers.

AU TFNs are 8 or 9 digit identifiers issued by the Australian Taxation Office. They are typically formatted as space-separated triplets (NNN NNN NNN or NNN NNN NN), or as a flat run of digits. PII Crawler validates the official ATO mod-11 checksum using the published weight sequences for both lengths.

DE Tax Identification Number (Steuer-ID)

Off by default. Enable in scan options if you need to detect German Tax Identification Numbers (Steuerliche Identifikationsnummer, "Steuer-ID" or "IdNr.").

DE Steuer-IDs are 11-digit identifiers issued by the Bundeszentralamt für Steuern. They appear as digit runs, space-separated groups (NN NNN NNN NNN), or hyphen-separated groups. PII Crawler validates the ISO 7064 MOD 11,10 check digit and enforces the official repetition rule on the first 10 digits (exactly one digit may appear two or three times; every other digit at most once).

NZ Passport Number

Off by default. Enable in scan options if you need to detect New Zealand passport numbers.

NZ passports issued from 2005 onwards (EA, LA, and RA biometric series) carry a number consisting of two uppercase letters followed by six digits (for example, LH615098). Pre-2005 passports used a one-letter, seven-digit format and have all expired given the ten-year adult validity, so PII Crawler matches the modern two-letter, six-digit form only. Detection requires a New Zealand passport context term such as "passport", "New Zealand passport", or "DIA" within 150 characters.

AU Passport Number

Off by default. Enable in scan options if you need to detect Australian passport numbers.

AU passports use either a single-letter prefix (N, E, D, F, A, C, U, or X) followed by seven digits, or a two-letter prefix beginning with P (PA, PB, PC, PD, PE, PF, PU, PW, PX, PZ) followed by seven digits. There is no published checksum, so detection requires a context term such as "passport", "Australian passport", or "DFAT" within 150 characters.

DE Passport (Reisepass) Number

Off by default. Enable in scan options if you need to detect German passport (Reisepass) numbers.

German passports issued from 2021-11-01 onwards use a nine-character alphanumeric document number. The first character is a letter and the remaining eight characters are drawn from the digits 1-9 and the letters C F G H J K L M N P R T V W X Y Z. The digit 0 and the letter O are deliberately excluded to avoid visual ambiguity (see Wikipedia: German passport). Detection requires a context term such as "Reisepass", "Passnummer", or "passport number" within 150 characters.

NZ Driver Licence Number

Off by default. Enable in scan options if you need to detect New Zealand driver licence numbers.

NZ driver licence numbers are eight characters: two uppercase letters followed by six digits (for example, BQ739482). The card-version code printed on the physical card is a separate three-digit reissue counter and is not part of the licence number. Detection requires a context term such as "driver licence", "drivers licence", "driver's licence", or "Waka Kotahi" (the NZ Transport Agency) within 150 characters.

The licence number shares its shape with the modern NZ passport pattern, so the same string can be reported as either type depending on the surrounding keywords. PII Crawler keeps the two detectors separate by matching only on licence-related terms here. Use the American spelling ("license") if you want the generic Driver's License detector to fire instead.

AU Driver Licence Number

Off by default. Enable in scan options if you need to detect Australian driver licence numbers.

Australia has no national format. Each state and territory issues its own scheme; PII Crawler matches the union of formats:

Variant States that use it
6 to 10 digits NSW (8), VIC (9), QLD (9), SA / WA / TAS (7), ACT (8 or 9), NT (6 or 7)
1 letter + 5 digits older licences
2 letters + 4 digits older licences
4 digits + 2 letters older NSW (pre-1990s)

No state publishes a checksum, so detection relies on context. A licence-related term such as "driver licence", "drivers licence", "driver's licence", or "Australian Automobile Association" must appear within 150 characters.

DE Driver Licence Number (Führerscheinnummer)

Off by default. Enable in scan options if you need to detect German driver licence numbers (Führerscheinnummer).

German driver licence numbers are exactly 11 alphanumeric characters with a fixed positional structure:

  • Position 1: state authority code (a letter from A for Baden-Württemberg through P for Thuringia, or a digit).
  • Positions 2 to 3: district code (two digits).
  • Positions 4 to 9: sequential number (six digits or letters; letters appear once a district passes one million issued licences).
  • Position 10: check digit (0 to 9, or X when the modulo-11 remainder is 10).
  • Position 11: issue number (digit, then A to Z after the tenth re-issue).

For example, B072RRE2I55 is a Bavarian licence in district 07 with sequential number 2RRE2I, check digit 5, and issue number 5. PII Crawler validates the published modulo-11 checksum using weights 9 down to 1 over positions 1 through 9, with letters scored as alphabet position plus nine (so A is 10, B is 11, and Z is 35). A context term such as "Führerschein", "Führerscheinnummer", "Fahrerlaubnis", or the English "driver licence" must appear within 150 characters.

AWS Credentials

PII Crawler uses a custom FSM built around AWS unique ID prefixes to find AWS credentials.

Custom Regex

You can specify your own custom regex rules to match your specific data types. Simply add them in the scan options when creating a new scan:

custom-regex

Was this page helpful?