Merge Rules
Merge rules define the conditions under which two customer records are linked together in the identity graph. A merge rule specifies which identifier matches constitute sufficient evidence that two records represent the same person.
How Merge Rules Work
When building the identity graph, SignalSmith evaluates each merge rule against every pair of records that share at least one identifier value. If a pair of records satisfies any merge rule, an edge is created between them in the identity graph.
After all edges are created, the connected-components algorithm finds clusters of transitively connected records — groups of records that are all linked directly or indirectly.
Merge Rule: "Email Match"
Condition: Records share an email address (any variant)
Record A: email=alice@co.com, phone=555-0100
Record B: email=alice@co.com, device=abc123
Record C: phone=555-0100
Result:
A ←→ B (shared email: alice@co.com)
A ←→ C (not matched by this rule — no shared email)If a second merge rule exists for phone matching, then A-C would also be linked.
Rule Types
Single Identifier Match
The simplest merge rule: two records are linked if they share a value in a specified identifier family.
Configuration:
- Identifier family — Which family to match on (e.g., Email)
- Variant (optional) — Restrict to a specific variant (e.g., only “Personal Email”), or match any variant in the family
Examples:
| Rule Name | Family | Variant | Meaning |
|---|---|---|---|
| Email Match | Any | Link if records share any email address | |
| Personal Email Match | Personal | Link only if records share a personal email | |
| Mobile Phone Match | Phone | Mobile | Link only if records share a mobile phone number |
| CRM ID Match | Customer ID | CRM ID | Link if records share a CRM ID |
Multi-Identifier Match
A more restrictive rule: two records are linked only if they share values in multiple identifier families simultaneously. This reduces false positives at the cost of missing some true matches.
Configuration:
- Identifier families — Two or more families that must all match
- Variants (optional) — Restrict to specific variants within each family
- Logic — AND (all must match)
Examples:
| Rule Name | Families | Meaning |
|---|---|---|
| Phone + Name | Phone AND Name | Link if records share a phone number and have the same name |
| Device + Email Domain | Device AND Email Domain | Link if records share a device ID and email from the same domain |
| Cookie + IP | Cookie AND IP Address | Link if records share a cookie and IP address |
Multi-identifier rules produce fewer false positives but also fewer true positives. Use them for weaker identifier types (device IDs, cookies) where a single match is not conclusive.
Deterministic vs. Probabilistic Matching
Deterministic Matching
Deterministic rules require an exact match on the identifier value (after normalization). Two records are linked if and only if they have the same identifier value.
Most merge rules in SignalSmith are deterministic:
- Email:
alice@company.com=alice@company.com(match),alice@company.com≠alice.smith@company.com(no match) - Phone:
+15550100=+15550100(match) - Customer ID:
CRM-123=CRM-123(match)
Deterministic matching is high precision (few false positives) but can miss matches when identifiers are slightly different (typos, formatting variations).
Probabilistic Matching
Probabilistic rules use similarity scoring rather than exact matching. Two records are linked if their identifier values are sufficiently similar according to a scoring function.
SignalSmith supports three scoring functions:
| Function | Identifier Family | Method |
|---|---|---|
| Fuzzy Name | first_name, last_name, full_name | SOUNDEX blocking + Jaro-Winkler / Levenshtein |
| IP Cluster | ip_address | Exact IP match + time window proximity |
| Household | postal_code + name | Postal code match + name similarity |
Each probabilistic edge carries a confidence score (0.0–1.0) that propagates through the graph and can be filtered at query time without re-running resolution.
Probabilistic rules are only available in probabilistic identity graphs. See Probabilistic Matching for full configuration details, confidence scoring, propagation rules, and threshold management.
Transitive Closure
A critical concept in identity resolution: merge rules create direct edges, but the connected-components algorithm follows transitive closure to find all indirectly linked records.
Merge Rule: Email Match
Record A: email=alice@co.com
Record B: email=alice@co.com, phone=555-0100
Record C: phone=555-0100
Merge Rule: Phone Match
Direct edges:
A ←→ B (shared email)
B ←→ C (shared phone)
Transitive closure:
Cluster: {A, B, C} — all three are the same personRecord A and Record C share no identifier directly, but they are linked through Record B. This is transitive closure at work.
Benefits
Transitive closure is essential for cross-system resolution. System A might share emails with System B, and System B might share phone numbers with System C. Without transitive closure, Systems A and C would never be linked.
Risks
Transitive closure can also cause over-merging. A single bad identifier match (e.g., a shared generic email like “test@test.com”) can link many unrelated records. This is why limit rules are critical.
Rule Priority
When multiple merge rules are defined, they are evaluated in priority order. Higher-priority rules are evaluated first, and their matches are processed before lower-priority rules.
Priority matters when rules conflict or when limit rules constrain the graph:
- High-priority rule matches are preserved when a limit rule forces a cluster to be split
- Lower-priority edges are the ones removed when cluster size limits are exceeded
Recommended Priority Order
- Customer ID matches (highest) — Deterministic, system-assigned identifiers
- Email matches — High-confidence, widely available
- Phone matches — Strong signal, some sharing risk (households)
- Multi-identifier matches — Require multiple signals, good for weaker identifiers
- Device/cookie matches (lowest) — Weakest signal, highest sharing risk
Example Configuration
A typical identity resolution setup for an e-commerce company:
| Priority | Rule Name | Type | Condition | Notes |
|---|---|---|---|---|
| 1 | CRM ID Match | Single, Deterministic | CRM ID family, any variant | Strongest signal |
| 2 | Email Match | Single, Deterministic | Email family, any variant | High confidence |
| 3 | Mobile Phone Match | Single, Deterministic | Phone family, Mobile variant | Good signal, mobile only |
| 4 | Phone + Name Match | Multi, Deterministic | Phone AND Name families | For home/work phones |
| 5 | MAID Match | Single, Deterministic | Device family, MAID variant | Cross-app linking |
Best Practices
- Start conservative — Begin with high-confidence rules (email, customer ID) and add weaker rules only after reviewing the initial results
- Use multi-identifier rules for weak identifiers — Device IDs and cookies are frequently shared. Require a second signal before merging.
- Set priorities thoughtfully — Higher-priority matches are preserved when limit rules force cluster splits
- Monitor merge quality — After running resolution, review sample clusters to check for false merges. Adjust rules based on findings.
- Exclude known-bad identifiers — Add common test/placeholder values (test@test.com, 000-000-0000) to a deny list to prevent false merges
- Document your rules — Include the business rationale for each rule so team members understand why certain merges are allowed or restricted
Next Steps
- Limit Rules — Prevent over-merging
- Identifier Families — Define the identifiers that merge rules operate on
- Running Resolution — Execute identity resolution with your rules