Limit Rules

Limit rules prevent over-merging by constraining the size and connectivity of identity clusters. Without limits, a single shared identifier — like a generic email address or a household device — can link thousands of unrelated records into one massive cluster (a “super-node”). Limit rules cap cluster growth to keep resolution results meaningful.

Why Limit Rules Matter

Consider a scenario where noreply@company.com appears as the email address on 50,000 records. Without a limit rule, the email merge rule would link all 50,000 records into a single cluster — clearly incorrect, as they don’t all represent the same person.

Limit rules detect when a cluster grows too large or accumulates too many identifiers and stop adding new records. The excess records remain unresolved, preserving the integrity of the clusters that are formed.

Types of Limit Rules

Max Cluster Size

Sets the maximum number of source records that can belong to a single cluster.

SettingDescriptionRecommended Range
Max recordsMaximum total records in one cluster20-200

How it works: When the connected-components algorithm would add a record to a cluster that has already reached the maximum size, the edge is not created. The record either remains in its own cluster or joins a smaller one through a different edge.

Example: With a max cluster size of 50, a cluster cannot contain more than 50 source records. If Record #51 would link to the cluster via a shared email, that edge is dropped.

Max Identifiers per Family

Sets the maximum number of unique identifier values within a single identifier family that a cluster can contain.

SettingDescriptionRecommended Range
FamilyWhich identifier family this limit applies to
Max unique valuesMaximum distinct identifier values in that family3-20

How it works: When adding a record to a cluster would introduce a new unique identifier value that exceeds the family limit, the edge is not created.

Example: With a limit of 10 emails per cluster, a cluster cannot contain more than 10 unique email addresses. This prevents a shared device (which links records with many different email addresses) from merging all those records together.

Max Records per Entity Type

Sets the maximum number of records from a single entity type that can belong to a cluster.

SettingDescriptionRecommended Range
Entity typeWhich entity type this limit applies to
Max recordsMaximum records from this entity type in one cluster3-20

How it works: When adding a record of a specific entity type would exceed the per-entity-type limit, the edge is not created.

Example: With a limit of 5 CRM Contacts per cluster, a cluster cannot contain more than 5 records from the CRM Contacts entity type. This prevents a single loyalty ID from merging dozens of CRM contacts (which might indicate data quality issues in the CRM).

How Limits Are Enforced

Limit rules are enforced during the graph construction phase, before the final connected-components computation:

  1. Records are processed in order of merge rule priority (highest priority first)
  2. For each potential edge (two records sharing an identifier), the algorithm checks all applicable limit rules
  3. If creating the edge would cause any limit to be exceeded in the resulting cluster, the edge is dropped
  4. If the edge passes all limit checks, it is added to the graph
  5. After all edges are processed, the connected-components algorithm finds the final clusters

Because limits are enforced in merge rule priority order, higher-priority edges are preserved over lower-priority edges when a limit is reached. This is why setting merge rule priorities correctly matters.

Edge Dropping vs. Cluster Splitting

Limit rules work by preventing edges from being created, not by splitting clusters after they are formed. This means:

  • The algorithm is greedy: edges are added in priority order until a limit is hit
  • Once an edge is added, it is not removed (even if a later, higher-priority edge would have produced a better cluster)
  • The order of processing can affect the final clusters

For most datasets, this produces high-quality results. If you notice problematic clusters, adjust the limits or merge rule priorities.

Example Configuration

A typical limit rule configuration for an e-commerce company:

Limit RuleValueRationale
Max cluster size100No real person has more than 100 associated records across all systems
Max emails per cluster10A person rarely has more than 10 email addresses
Max phones per cluster5A person rarely has more than 5 phone numbers
Max device IDs per cluster20Shared household devices and device resets inflate counts
Max cookies per cluster50Cookies churn frequently; allow a higher limit
Max CRM Contacts per cluster5One person should not have more than 5 CRM contact records
Max Website Users per cluster3One person should not have more than 3 website accounts

Exceeded Limits: What Happens

When a limit prevents an edge from being created:

  1. The edge is dropped — The two records are not linked
  2. The blocked record stays separate — It may join a different cluster through a different edge, or remain in its own single-record cluster
  3. An event is logged — The limit violation is recorded with the cluster ID, the blocked record, the identifier that would have linked them, and the limit that was exceeded
  4. The cluster proceeds — The cluster that hit the limit continues to exist with its current members

Reviewing Limit Violations

After running resolution, you can view limit violation statistics on the identity graph detail page:

MetricDescription
Total edges droppedHow many potential edges were blocked by limit rules
Edges dropped by ruleBreakdown by which limit rule caused the drop
Records left unresolvedHow many records were not linked to any cluster due to limit violations

High violation counts may indicate:

  • Data quality issues — Shared or generic identifiers that should be denied
  • Limits too tight — Legitimate clusters are being broken apart
  • Limits too loose — Not enough violations means the limits aren’t providing protection

Tuning Limits

Start Conservative

Begin with tight limits and loosen them based on results:

  1. Set initial limits at the lower end of the recommended ranges
  2. Run resolution
  3. Review violation statistics and sample clusters
  4. If legitimate clusters are being broken: loosen the relevant limit
  5. If large, suspicious clusters exist: tighten the relevant limit
  6. Repeat

Analyze Your Data

Before setting limits, analyze your identifier data to understand the distribution:

-- How many unique emails per person (based on a known key)?
SELECT user_id, COUNT(DISTINCT email) AS email_count
FROM user_identifiers
GROUP BY user_id
ORDER BY email_count DESC
LIMIT 100;

The 99th percentile of this distribution is a good starting point for the per-family limit.

Monitor Over Time

As your data grows and new sources are added, limit rules may need adjustment. Periodically review:

  • Cluster size distribution (are clusters getting larger over time?)
  • Violation counts (are they increasing?)
  • Sample clusters at the size limit (do they look correct?)
⚠️

Setting limits too loose defeats their purpose — super-nodes will form. Setting them too tight breaks legitimate merges. Finding the right balance requires iteration and data analysis.

Best Practices

  • Always set a max cluster size — This is the most important limit. Without it, a single bad identifier can merge thousands of records.
  • Set per-family limits for weak identifiers — Device IDs and cookies are most likely to cause over-merging. Set tighter limits for these families.
  • Review violations regularly — Limit violations are a signal about data quality and resolution accuracy. Don’t ignore them.
  • Use deny lists alongside limits — Known-bad identifiers (test@test.com, 000-000-0000) should be excluded entirely, not just limited. Limits handle the edge cases that deny lists don’t catch.
  • Coordinate with merge rule priorities — Higher-priority merge rules create edges first, so when limits kick in, the lower-priority edges are the ones dropped. Make sure this aligns with your confidence ordering.

Next Steps