Limit Rules
Limit rules prevent over-merging by constraining the size and connectivity of identity clusters. Without limits, a single shared identifier — like a generic email address or a household device — can link thousands of unrelated records into one massive cluster (a “super-node”). Limit rules cap cluster growth to keep resolution results meaningful.
Why Limit Rules Matter
Consider a scenario where noreply@company.com appears as the email address on 50,000 records. Without a limit rule, the email merge rule would link all 50,000 records into a single cluster — clearly incorrect, as they don’t all represent the same person.
Limit rules detect when a cluster grows too large or accumulates too many identifiers and stop adding new records. The excess records remain unresolved, preserving the integrity of the clusters that are formed.
Types of Limit Rules
Max Cluster Size
Sets the maximum number of source records that can belong to a single cluster.
| Setting | Description | Recommended Range |
|---|---|---|
| Max records | Maximum total records in one cluster | 20-200 |
How it works: When the connected-components algorithm would add a record to a cluster that has already reached the maximum size, the edge is not created. The record either remains in its own cluster or joins a smaller one through a different edge.
Example: With a max cluster size of 50, a cluster cannot contain more than 50 source records. If Record #51 would link to the cluster via a shared email, that edge is dropped.
Max Identifiers per Family
Sets the maximum number of unique identifier values within a single identifier family that a cluster can contain.
| Setting | Description | Recommended Range |
|---|---|---|
| Family | Which identifier family this limit applies to | — |
| Max unique values | Maximum distinct identifier values in that family | 3-20 |
How it works: When adding a record to a cluster would introduce a new unique identifier value that exceeds the family limit, the edge is not created.
Example: With a limit of 10 emails per cluster, a cluster cannot contain more than 10 unique email addresses. This prevents a shared device (which links records with many different email addresses) from merging all those records together.
Max Records per Entity Type
Sets the maximum number of records from a single entity type that can belong to a cluster.
| Setting | Description | Recommended Range |
|---|---|---|
| Entity type | Which entity type this limit applies to | — |
| Max records | Maximum records from this entity type in one cluster | 3-20 |
How it works: When adding a record of a specific entity type would exceed the per-entity-type limit, the edge is not created.
Example: With a limit of 5 CRM Contacts per cluster, a cluster cannot contain more than 5 records from the CRM Contacts entity type. This prevents a single loyalty ID from merging dozens of CRM contacts (which might indicate data quality issues in the CRM).
How Limits Are Enforced
Limit rules are enforced during the graph construction phase, before the final connected-components computation:
- Records are processed in order of merge rule priority (highest priority first)
- For each potential edge (two records sharing an identifier), the algorithm checks all applicable limit rules
- If creating the edge would cause any limit to be exceeded in the resulting cluster, the edge is dropped
- If the edge passes all limit checks, it is added to the graph
- After all edges are processed, the connected-components algorithm finds the final clusters
Because limits are enforced in merge rule priority order, higher-priority edges are preserved over lower-priority edges when a limit is reached. This is why setting merge rule priorities correctly matters.
Edge Dropping vs. Cluster Splitting
Limit rules work by preventing edges from being created, not by splitting clusters after they are formed. This means:
- The algorithm is greedy: edges are added in priority order until a limit is hit
- Once an edge is added, it is not removed (even if a later, higher-priority edge would have produced a better cluster)
- The order of processing can affect the final clusters
For most datasets, this produces high-quality results. If you notice problematic clusters, adjust the limits or merge rule priorities.
Example Configuration
A typical limit rule configuration for an e-commerce company:
| Limit Rule | Value | Rationale |
|---|---|---|
| Max cluster size | 100 | No real person has more than 100 associated records across all systems |
| Max emails per cluster | 10 | A person rarely has more than 10 email addresses |
| Max phones per cluster | 5 | A person rarely has more than 5 phone numbers |
| Max device IDs per cluster | 20 | Shared household devices and device resets inflate counts |
| Max cookies per cluster | 50 | Cookies churn frequently; allow a higher limit |
| Max CRM Contacts per cluster | 5 | One person should not have more than 5 CRM contact records |
| Max Website Users per cluster | 3 | One person should not have more than 3 website accounts |
Exceeded Limits: What Happens
When a limit prevents an edge from being created:
- The edge is dropped — The two records are not linked
- The blocked record stays separate — It may join a different cluster through a different edge, or remain in its own single-record cluster
- An event is logged — The limit violation is recorded with the cluster ID, the blocked record, the identifier that would have linked them, and the limit that was exceeded
- The cluster proceeds — The cluster that hit the limit continues to exist with its current members
Reviewing Limit Violations
After running resolution, you can view limit violation statistics on the identity graph detail page:
| Metric | Description |
|---|---|
| Total edges dropped | How many potential edges were blocked by limit rules |
| Edges dropped by rule | Breakdown by which limit rule caused the drop |
| Records left unresolved | How many records were not linked to any cluster due to limit violations |
High violation counts may indicate:
- Data quality issues — Shared or generic identifiers that should be denied
- Limits too tight — Legitimate clusters are being broken apart
- Limits too loose — Not enough violations means the limits aren’t providing protection
Tuning Limits
Start Conservative
Begin with tight limits and loosen them based on results:
- Set initial limits at the lower end of the recommended ranges
- Run resolution
- Review violation statistics and sample clusters
- If legitimate clusters are being broken: loosen the relevant limit
- If large, suspicious clusters exist: tighten the relevant limit
- Repeat
Analyze Your Data
Before setting limits, analyze your identifier data to understand the distribution:
-- How many unique emails per person (based on a known key)?
SELECT user_id, COUNT(DISTINCT email) AS email_count
FROM user_identifiers
GROUP BY user_id
ORDER BY email_count DESC
LIMIT 100;The 99th percentile of this distribution is a good starting point for the per-family limit.
Monitor Over Time
As your data grows and new sources are added, limit rules may need adjustment. Periodically review:
- Cluster size distribution (are clusters getting larger over time?)
- Violation counts (are they increasing?)
- Sample clusters at the size limit (do they look correct?)
Setting limits too loose defeats their purpose — super-nodes will form. Setting them too tight breaks legitimate merges. Finding the right balance requires iteration and data analysis.
Best Practices
- Always set a max cluster size — This is the most important limit. Without it, a single bad identifier can merge thousands of records.
- Set per-family limits for weak identifiers — Device IDs and cookies are most likely to cause over-merging. Set tighter limits for these families.
- Review violations regularly — Limit violations are a signal about data quality and resolution accuracy. Don’t ignore them.
- Use deny lists alongside limits — Known-bad identifiers (test@test.com, 000-000-0000) should be excluded entirely, not just limited. Limits handle the edge cases that deny lists don’t catch.
- Coordinate with merge rule priorities — Higher-priority merge rules create edges first, so when limits kick in, the lower-priority edges are the ones dropped. Make sure this aligns with your confidence ordering.
Next Steps
- Merge Rules — Define the rules that limits constrain
- Running Resolution — Execute identity resolution and see limit enforcement in action
- Creating an Identity Graph — Full setup wizard including limit configuration