Merge Rules

Merge rules define the conditions under which two customer records are linked together in the identity graph. A merge rule specifies which identifier matches constitute sufficient evidence that two records represent the same person.

How Merge Rules Work

When building the identity graph, SignalSmith evaluates each merge rule against every pair of records that share at least one identifier value. If a pair of records satisfies any merge rule, an edge is created between them in the identity graph.

After all edges are created, the connected-components algorithm finds clusters of transitively connected records — groups of records that are all linked directly or indirectly.

Merge Rule: "Email Match"
  Condition: Records share an email address (any variant)

Record A: email=alice@co.com, phone=555-0100
Record B: email=alice@co.com, device=abc123
Record C: phone=555-0100

Result:
  A ←→ B  (shared email: alice@co.com)
  A ←→ C  (not matched by this rule — no shared email)

If a second merge rule exists for phone matching, then A-C would also be linked.

Rule Types

Single Identifier Match

The simplest merge rule: two records are linked if they share a value in a specified identifier family.

Configuration:

  • Identifier family — Which family to match on (e.g., Email)
  • Variant (optional) — Restrict to a specific variant (e.g., only “Personal Email”), or match any variant in the family

Examples:

Rule NameFamilyVariantMeaning
Email MatchEmailAnyLink if records share any email address
Personal Email MatchEmailPersonalLink only if records share a personal email
Mobile Phone MatchPhoneMobileLink only if records share a mobile phone number
CRM ID MatchCustomer IDCRM IDLink if records share a CRM ID

Multi-Identifier Match

A more restrictive rule: two records are linked only if they share values in multiple identifier families simultaneously. This reduces false positives at the cost of missing some true matches.

Configuration:

  • Identifier families — Two or more families that must all match
  • Variants (optional) — Restrict to specific variants within each family
  • Logic — AND (all must match)

Examples:

Rule NameFamiliesMeaning
Phone + NamePhone AND NameLink if records share a phone number and have the same name
Device + Email DomainDevice AND Email DomainLink if records share a device ID and email from the same domain
Cookie + IPCookie AND IP AddressLink if records share a cookie and IP address

Multi-identifier rules produce fewer false positives but also fewer true positives. Use them for weaker identifier types (device IDs, cookies) where a single match is not conclusive.

Deterministic vs. Probabilistic Matching

Deterministic Matching

Deterministic rules require an exact match on the identifier value (after normalization). Two records are linked if and only if they have the same identifier value.

Most merge rules in SignalSmith are deterministic:

  • Email: alice@company.com = alice@company.com (match), alice@company.comalice.smith@company.com (no match)
  • Phone: +15550100 = +15550100 (match)
  • Customer ID: CRM-123 = CRM-123 (match)

Deterministic matching is high precision (few false positives) but can miss matches when identifiers are slightly different (typos, formatting variations).

Probabilistic Matching

Probabilistic rules use similarity scoring rather than exact matching. Two records are linked if their identifier values are sufficiently similar according to a scoring function.

SignalSmith supports three scoring functions:

FunctionIdentifier FamilyMethod
Fuzzy Namefirst_name, last_name, full_nameSOUNDEX blocking + Jaro-Winkler / Levenshtein
IP Clusterip_addressExact IP match + time window proximity
Householdpostal_code + namePostal code match + name similarity

Each probabilistic edge carries a confidence score (0.0–1.0) that propagates through the graph and can be filtered at query time without re-running resolution.

⚠️

Probabilistic rules are only available in probabilistic identity graphs. See Probabilistic Matching for full configuration details, confidence scoring, propagation rules, and threshold management.

Transitive Closure

A critical concept in identity resolution: merge rules create direct edges, but the connected-components algorithm follows transitive closure to find all indirectly linked records.

Merge Rule: Email Match

Record A: email=alice@co.com
Record B: email=alice@co.com, phone=555-0100
Record C: phone=555-0100

Merge Rule: Phone Match

Direct edges:
  A ←→ B  (shared email)
  B ←→ C  (shared phone)

Transitive closure:
  Cluster: {A, B, C}  — all three are the same person

Record A and Record C share no identifier directly, but they are linked through Record B. This is transitive closure at work.

Benefits

Transitive closure is essential for cross-system resolution. System A might share emails with System B, and System B might share phone numbers with System C. Without transitive closure, Systems A and C would never be linked.

Risks

Transitive closure can also cause over-merging. A single bad identifier match (e.g., a shared generic email like “test@test.com”) can link many unrelated records. This is why limit rules are critical.

Rule Priority

When multiple merge rules are defined, they are evaluated in priority order. Higher-priority rules are evaluated first, and their matches are processed before lower-priority rules.

Priority matters when rules conflict or when limit rules constrain the graph:

  • High-priority rule matches are preserved when a limit rule forces a cluster to be split
  • Lower-priority edges are the ones removed when cluster size limits are exceeded
  1. Customer ID matches (highest) — Deterministic, system-assigned identifiers
  2. Email matches — High-confidence, widely available
  3. Phone matches — Strong signal, some sharing risk (households)
  4. Multi-identifier matches — Require multiple signals, good for weaker identifiers
  5. Device/cookie matches (lowest) — Weakest signal, highest sharing risk

Example Configuration

A typical identity resolution setup for an e-commerce company:

PriorityRule NameTypeConditionNotes
1CRM ID MatchSingle, DeterministicCRM ID family, any variantStrongest signal
2Email MatchSingle, DeterministicEmail family, any variantHigh confidence
3Mobile Phone MatchSingle, DeterministicPhone family, Mobile variantGood signal, mobile only
4Phone + Name MatchMulti, DeterministicPhone AND Name familiesFor home/work phones
5MAID MatchSingle, DeterministicDevice family, MAID variantCross-app linking

Best Practices

  • Start conservative — Begin with high-confidence rules (email, customer ID) and add weaker rules only after reviewing the initial results
  • Use multi-identifier rules for weak identifiers — Device IDs and cookies are frequently shared. Require a second signal before merging.
  • Set priorities thoughtfully — Higher-priority matches are preserved when limit rules force cluster splits
  • Monitor merge quality — After running resolution, review sample clusters to check for false merges. Adjust rules based on findings.
  • Exclude known-bad identifiers — Add common test/placeholder values (test@test.com, 000-000-0000) to a deny list to prevent false merges
  • Document your rules — Include the business rationale for each rule so team members understand why certain merges are allowed or restricted

Next Steps