Merge Rules

Merge rules define the conditions under which two customer records are linked together in the identity graph. A merge rule specifies which identifier matches constitute sufficient evidence that two records represent the same person.

How Merge Rules Work

When building the identity graph, SignalSmith evaluates each merge rule against every pair of records that share at least one identifier value. If a pair of records satisfies any merge rule, an edge is created between them in the identity graph.

After all edges are created, the connected-components algorithm finds clusters of transitively connected records — groups of records that are all linked directly or indirectly.

Merge Rule: "Email Match"
  Condition: Records share an email address (any variant)

Record A: email=alice@co.com, phone=555-0100
Record B: email=alice@co.com, device=abc123
Record C: phone=555-0100

Result:
  A ←→ B  (shared email: alice@co.com)
  A ←→ C  (not matched by this rule — no shared email)

If a second merge rule exists for phone matching, then A-C would also be linked.

Rule Types

Single Identifier Match

The simplest merge rule: two records are linked if they share a value in a specified identifier family.

Configuration:

Identifier family — Which family to match on (e.g., Email)
Variant (optional) — Restrict to a specific variant (e.g., only “Personal Email”), or match any variant in the family

Examples:

Rule Name	Family	Variant	Meaning
Email Match	Email	Any	Link if records share any email address
Personal Email Match	Email	Personal	Link only if records share a personal email
Mobile Phone Match	Phone	Mobile	Link only if records share a mobile phone number
CRM ID Match	Customer ID	CRM ID	Link if records share a CRM ID

Multi-Identifier Match

A more restrictive rule: two records are linked only if they share values in multiple identifier families simultaneously. This reduces false positives at the cost of missing some true matches.

Configuration:

Identifier families — Two or more families that must all match
Variants (optional) — Restrict to specific variants within each family
Logic — AND (all must match)

Examples:

Rule Name	Families	Meaning
Phone + Name	Phone AND Name	Link if records share a phone number and have the same name
Device + Email Domain	Device AND Email Domain	Link if records share a device ID and email from the same domain
Cookie + IP	Cookie AND IP Address	Link if records share a cookie and IP address

Multi-identifier rules produce fewer false positives but also fewer true positives. Use them for weaker identifier types (device IDs, cookies) where a single match is not conclusive.

Deterministic vs. Probabilistic Matching

Deterministic Matching

Deterministic rules require an exact match on the identifier value (after normalization). Two records are linked if and only if they have the same identifier value.

Most merge rules in SignalSmith are deterministic:

Email: alice@company.com = alice@company.com (match), alice@company.com ≠ alice.smith@company.com (no match)
Phone: +15550100 = +15550100 (match)
Customer ID: CRM-123 = CRM-123 (match)

Deterministic matching is high precision (few false positives) but can miss matches when identifiers are slightly different (typos, formatting variations).

Probabilistic Matching

Probabilistic rules use similarity scoring rather than exact matching. Two records are linked if their identifier values are sufficiently similar according to a scoring function.

SignalSmith supports three scoring functions:

Function	Identifier Family	Method
Fuzzy Name	first_name, last_name, full_name	SOUNDEX blocking + Jaro-Winkler / Levenshtein
IP Cluster	ip_address	Exact IP match + time window proximity
Household	postal_code + name	Postal code match + name similarity

Each probabilistic edge carries a confidence score (0.0–1.0) that propagates through the graph and can be filtered at query time without re-running resolution.

⚠️

Probabilistic rules are only available in probabilistic identity graphs. See Probabilistic Matching for full configuration details, confidence scoring, propagation rules, and threshold management.

Transitive Closure

A critical concept in identity resolution: merge rules create direct edges, but the connected-components algorithm follows transitive closure to find all indirectly linked records.

Merge Rule: Email Match

Record A: email=alice@co.com
Record B: email=alice@co.com, phone=555-0100
Record C: phone=555-0100

Merge Rule: Phone Match

Direct edges:
  A ←→ B  (shared email)
  B ←→ C  (shared phone)

Transitive closure:
  Cluster: {A, B, C}  — all three are the same person

Record A and Record C share no identifier directly, but they are linked through Record B. This is transitive closure at work.

Benefits

Transitive closure is essential for cross-system resolution. System A might share emails with System B, and System B might share phone numbers with System C. Without transitive closure, Systems A and C would never be linked.

Risks

Transitive closure can also cause over-merging. A single bad identifier match (e.g., a shared generic email like “test@test.com”) can link many unrelated records. This is why limit rules are critical.

Rule Priority

When multiple merge rules are defined, they are evaluated in priority order. Higher-priority rules are evaluated first, and their matches are processed before lower-priority rules.

Priority matters when rules conflict or when limit rules constrain the graph:

High-priority rule matches are preserved when a limit rule forces a cluster to be split
Lower-priority edges are the ones removed when cluster size limits are exceeded

Recommended Priority Order

Customer ID matches (highest) — Deterministic, system-assigned identifiers
Email matches — High-confidence, widely available
Phone matches — Strong signal, some sharing risk (households)
Multi-identifier matches — Require multiple signals, good for weaker identifiers
Device/cookie matches (lowest) — Weakest signal, highest sharing risk

Example Configuration

A typical identity resolution setup for an e-commerce company:

Priority	Rule Name	Type	Condition	Notes
1	CRM ID Match	Single, Deterministic	CRM ID family, any variant	Strongest signal
2	Email Match	Single, Deterministic	Email family, any variant	High confidence
3	Mobile Phone Match	Single, Deterministic	Phone family, Mobile variant	Good signal, mobile only
4	Phone + Name Match	Multi, Deterministic	Phone AND Name families	For home/work phones
5	MAID Match	Single, Deterministic	Device family, MAID variant	Cross-app linking

Best Practices

Start conservative — Begin with high-confidence rules (email, customer ID) and add weaker rules only after reviewing the initial results
Use multi-identifier rules for weak identifiers — Device IDs and cookies are frequently shared. Require a second signal before merging.
Set priorities thoughtfully — Higher-priority matches are preserved when limit rules force cluster splits
Monitor merge quality — After running resolution, review sample clusters to check for false merges. Adjust rules based on findings.
Exclude known-bad identifiers — Add common test/placeholder values (test@test.com, 000-000-0000) to a deny list to prevent false merges
Document your rules — Include the business rationale for each rule so team members understand why certain merges are allowed or restricted

Next Steps

Limit Rules — Prevent over-merging
Identifier Families — Define the identifiers that merge rules operate on
Running Resolution — Execute identity resolution with your rules

Identifier Families Probabilistic Matching