Limit Rules

Limit rules prevent over-merging by constraining the size and connectivity of identity clusters. Without limits, a single shared identifier — like a generic email address or a household device — can link thousands of unrelated records into one massive cluster (a “super-node”). Limit rules cap cluster growth to keep resolution results meaningful.

Why Limit Rules Matter

Consider a scenario where noreply@company.com appears as the email address on 50,000 records. Without a limit rule, the email merge rule would link all 50,000 records into a single cluster — clearly incorrect, as they don’t all represent the same person.

Limit rules detect when a cluster grows too large or accumulates too many identifiers and stop adding new records. The excess records remain unresolved, preserving the integrity of the clusters that are formed.

Types of Limit Rules

Max Cluster Size

Sets the maximum number of source records that can belong to a single cluster.

Setting	Description	Recommended Range
Max records	Maximum total records in one cluster	20-200

How it works: When the connected-components algorithm would add a record to a cluster that has already reached the maximum size, the edge is not created. The record either remains in its own cluster or joins a smaller one through a different edge.

Example: With a max cluster size of 50, a cluster cannot contain more than 50 source records. If Record #51 would link to the cluster via a shared email, that edge is dropped.

Max Identifiers per Family

Sets the maximum number of unique identifier values within a single identifier family that a cluster can contain.

Setting	Description	Recommended Range
Family	Which identifier family this limit applies to	—
Max unique values	Maximum distinct identifier values in that family	3-20

How it works: When adding a record to a cluster would introduce a new unique identifier value that exceeds the family limit, the edge is not created.

Example: With a limit of 10 emails per cluster, a cluster cannot contain more than 10 unique email addresses. This prevents a shared device (which links records with many different email addresses) from merging all those records together.

Max Records per Entity Type

Sets the maximum number of records from a single entity type that can belong to a cluster.

Setting	Description	Recommended Range
Entity type	Which entity type this limit applies to	—
Max records	Maximum records from this entity type in one cluster	3-20

How it works: When adding a record of a specific entity type would exceed the per-entity-type limit, the edge is not created.

Example: With a limit of 5 CRM Contacts per cluster, a cluster cannot contain more than 5 records from the CRM Contacts entity type. This prevents a single loyalty ID from merging dozens of CRM contacts (which might indicate data quality issues in the CRM).

How Limits Are Enforced

Limit rules are enforced during the graph construction phase, before the final connected-components computation:

Records are processed in order of merge rule priority (highest priority first)
For each potential edge (two records sharing an identifier), the algorithm checks all applicable limit rules
If creating the edge would cause any limit to be exceeded in the resulting cluster, the edge is dropped
If the edge passes all limit checks, it is added to the graph
After all edges are processed, the connected-components algorithm finds the final clusters

Because limits are enforced in merge rule priority order, higher-priority edges are preserved over lower-priority edges when a limit is reached. This is why setting merge rule priorities correctly matters.

Edge Dropping vs. Cluster Splitting

Limit rules work by preventing edges from being created, not by splitting clusters after they are formed. This means:

The algorithm is greedy: edges are added in priority order until a limit is hit
Once an edge is added, it is not removed (even if a later, higher-priority edge would have produced a better cluster)
The order of processing can affect the final clusters

For most datasets, this produces high-quality results. If you notice problematic clusters, adjust the limits or merge rule priorities.

Example Configuration

A typical limit rule configuration for an e-commerce company:

Limit Rule	Value	Rationale
Max cluster size	100	No real person has more than 100 associated records across all systems
Max emails per cluster	10	A person rarely has more than 10 email addresses
Max phones per cluster	5	A person rarely has more than 5 phone numbers
Max device IDs per cluster	20	Shared household devices and device resets inflate counts
Max cookies per cluster	50	Cookies churn frequently; allow a higher limit
Max CRM Contacts per cluster	5	One person should not have more than 5 CRM contact records
Max Website Users per cluster	3	One person should not have more than 3 website accounts

Exceeded Limits: What Happens

When a limit prevents an edge from being created:

The edge is dropped — The two records are not linked
The blocked record stays separate — It may join a different cluster through a different edge, or remain in its own single-record cluster
An event is logged — The limit violation is recorded with the cluster ID, the blocked record, the identifier that would have linked them, and the limit that was exceeded
The cluster proceeds — The cluster that hit the limit continues to exist with its current members

Reviewing Limit Violations

After running resolution, you can view limit violation statistics on the identity graph detail page:

Metric	Description
Total edges dropped	How many potential edges were blocked by limit rules
Edges dropped by rule	Breakdown by which limit rule caused the drop
Records left unresolved	How many records were not linked to any cluster due to limit violations

High violation counts may indicate:

Data quality issues — Shared or generic identifiers that should be denied
Limits too tight — Legitimate clusters are being broken apart
Limits too loose — Not enough violations means the limits aren’t providing protection

Tuning Limits

Start Conservative

Begin with tight limits and loosen them based on results:

Set initial limits at the lower end of the recommended ranges
Run resolution
Review violation statistics and sample clusters
If legitimate clusters are being broken: loosen the relevant limit
If large, suspicious clusters exist: tighten the relevant limit
Repeat

Analyze Your Data

Before setting limits, analyze your identifier data to understand the distribution:

-- How many unique emails per person (based on a known key)?
SELECT user_id, COUNT(DISTINCT email) AS email_count
FROM user_identifiers
GROUP BY user_id
ORDER BY email_count DESC
LIMIT 100;

The 99th percentile of this distribution is a good starting point for the per-family limit.

Monitor Over Time

As your data grows and new sources are added, limit rules may need adjustment. Periodically review:

Cluster size distribution (are clusters getting larger over time?)
Violation counts (are they increasing?)
Sample clusters at the size limit (do they look correct?)

⚠️

Setting limits too loose defeats their purpose — super-nodes will form. Setting them too tight breaks legitimate merges. Finding the right balance requires iteration and data analysis.

Best Practices

Always set a max cluster size — This is the most important limit. Without it, a single bad identifier can merge thousands of records.
Set per-family limits for weak identifiers — Device IDs and cookies are most likely to cause over-merging. Set tighter limits for these families.
Review violations regularly — Limit violations are a signal about data quality and resolution accuracy. Don’t ignore them.
Use deny lists alongside limits — Known-bad identifiers (test@test.com, 000-000-0000) should be excluded entirely, not just limited. Limits handle the edge cases that deny lists don’t catch.
Coordinate with merge rule priorities — Higher-priority merge rules create edges first, so when limits kick in, the lower-priority edges are the ones dropped. Make sure this aligns with your confidence ordering.

Next Steps

Merge Rules — Define the rules that limits constrain
Running Resolution — Execute identity resolution and see limit enforcement in action
Creating an Identity Graph — Full setup wizard including limit configuration

Probabilistic Matching Running Resolution