Regex Proximity Groups

Regex Proximity Groups allow you to detect when multiple regex patterns appear within a specified distance of each other in your scanned files. This is useful for finding related information that needs to be near each other to be meaningful, such as:

Social Security Numbers appearing near email addresses
Names appearing near phone numbers
Credit card numbers appearing near CVV codes
Custom patterns that indicate sensitive data when found together

How It Works

A Regex Proximity Group consists of:

Name - A descriptive name for the group (e.g., "SSN + Email Proximity")
Description - Optional details about what the group detects
Distance - Maximum character distance within which ALL patterns must appear
Patterns - A collection of regex patterns (all must match within the distance)

When scanning files, PII Crawler checks if all patterns in a group appear within the specified character distance. Only when ALL patterns are found within that window is a match recorded.

Creating a Proximity Group

You can create proximity groups through the web interface:

Navigate to Proximity Groups from the home page
Click New Proximity Group
Enter a name and optional description
Set the distance (in characters) - default is 600
Add regex patterns (one per line)
Click Create Proximity Group

create-proximity-group

Example: SSN + Email Detection

Name: SSN + Email Proximity
Description: Finds SSNs within 600 characters of an email address
Distance: 600

Patterns:
\b(?!(?:000|666|9\d{2}))\d{3}[-\s]?(?!00)\d{2}[-\s]?(?!0000)\d{4}\b
[\w.\-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}

This group will only match when both an SSN and an email address are found within 600 characters of each other.

Using Proximity Groups in Scans

When creating or configuring a scan:

Navigate to the scan configuration page
In the Proximity Groups section, select which groups to include
The selected groups will be active during the scan
Matches will appear in the results with a kind based on the group's slug

create-scan-proximity-group

Understanding Distance

The distance parameter defines a sliding window in characters:

Smaller distances (e.g., 100-300 chars) - More restrictive, patterns must be very close
Medium distances (e.g., 600-1000 chars) - Good for general proximity detection
Larger distances (e.g., 2000+ chars) - Patterns can be far apart, more matches but less meaningful

The distance is measured in characters, not words or lines. Whitespace, punctuation, and all other characters count toward the distance.

Pattern Validation

When creating or updating proximity groups, PII Crawler validates all regex patterns:

Invalid regex syntax will be rejected with an error message
All patterns must be valid Java regex patterns
Patterns are case-sensitive by default (use (?i) flag for case-insensitive matching)

Examples

Name + Phone Number

Name: Name + Phone Proximity
Description: Detects names near phone numbers
Distance: 400

Patterns:
\b[A-Z][a-z]+\s+[A-Z][a-z]+\b
\b\d{3}[-.]?\d{3}[-.]?\d{4}\b

Address + SSN

Name: Address + SSN Proximity
Description: Finds street addresses near SSNs
Distance: 800

Patterns:
\b\d{1,5}\s+[A-Za-z\s]+(?:Street|St|Avenue|Ave|Road|Rd|Boulevard|Blvd|Lane|Ln|Drive|Dr|Court|Ct|Circle|Cir)\b
\b(?!(?:000|666|9\d{2}))\d{3}[-\s]?(?!00)\d{2}[-\s]?(?!0000)\d{4}\b

Multiple Keywords Cluster

Name: Sensitive Terms Cluster
Description: Detects when multiple sensitive keywords appear together
Distance: 500

Patterns:
\b(?i)(confidential|secret|private)\b
\b(?i)(password|credential|token)\b
\b(?i)(api[_-]?key|access[_-]?key)\b

Viewing Groups

Navigate to Proximity Groups to see all available groups with:

Name and description
Number of patterns in each group
Distance configuration
Actions (View, Edit, Delete)

list-proximity-groups

Editing Groups

Click Edit on any proximity group
Modify name, description, distance, or patterns
Changes apply to future scans only (existing scan results are not affected)

Deleting Groups

Deleting a proximity group:

Removes the group and all its patterns
Does not affect historical scan results
Cannot be undone

Tips and Best Practices

Start with reasonable distances - 600-1000 characters works well for most use cases
Test your patterns - Use the regex tester to validate patterns before adding them
Be specific - More specific patterns reduce false positives
Consider context - Think about how far apart related data typically appears
Name descriptively - Use clear names that explain what the group detects
Document patterns - Use the description field to explain complex pattern combinations