Setting Up Identity Resolution

This guide walks you through configuring SignalSmith’s identity resolution engine to unify customer records across multiple data sources into golden records — single, authoritative customer profiles. By the end, you’ll have merge rules, limit rules, and a resolution pipeline producing unified profiles you can explore and activate.

When You Need Identity Resolution

Identity resolution becomes essential when:

Customers interact across channels — A customer signs up with their work email, makes a purchase with their personal email, and browses on their phone. Without resolution, this looks like three separate people.
Data lives in multiple systems — Your CRM has one view, your product database has another, and your marketing platform has a third. Each system uses different identifiers.
You’re building a 360-degree customer view — Traits, audiences, and journeys are only as good as the underlying profiles. Fragmented identities lead to duplicate audiences and inconsistent experiences.
You’re activating across destinations — Sending the same customer to a CRM as three separate leads wastes sales time and damages trust.

How SignalSmith Resolves Identities

SignalSmith uses a connected-components graph algorithm to merge records:

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│              │     │              │     │              │
│  Identifier  │────▶│   Identity   │────▶│   Golden     │
│  Collection  │     │   Graph      │     │   Records    │
│              │     │              │     │              │
└──────────────┘     └──────────────┘     └──────────────┘

Collect identifiers — SignalSmith gathers all identifiers (emails, phones, device IDs, account IDs) from your mapped schema attributes.
Build the graph — Each identifier becomes a node. Two records sharing an identifier value are connected by an edge. Identifier families define which identifiers can form connections.
Find connected components — The algorithm finds all groups of records that are transitively connected through shared identifiers. Each connected component becomes a unified profile.
Apply merge rules — When multiple records contribute conflicting values for the same attribute, merge rules determine which value wins.
Enforce limit rules — Limit rules prevent unreasonably large clusters (super-clusters) that can form from shared device IDs or generic email addresses.

Prerequisites

A configured warehouse with customer data
A schema with at least one entity type that has identifier attributes mapped
Multiple data sources or tables contributing to the same entity type (otherwise there’s nothing to resolve)

Step 1: Audit Your Identifier Landscape

Before configuring identity resolution, understand what identifiers exist across your data.

Survey Your Data Sources

For each table or data source that contributes to your customer entity type, catalog the available identifiers:

Data Source	Identifiers Available	Coverage
CRM (Salesforce)	Email, Phone, Account ID	95% email, 60% phone
Product Database	User ID, Email	100% user ID, 90% email
Website Analytics	Cookie ID, Email (when logged in)	100% cookie, 30% email
Mobile App	Device ID, Email, Push Token	100% device ID, 70% email
Support System	Email, Phone, Ticket ID	100% email, 40% phone

Identify Linking Opportunities

Look for identifiers that bridge across sources:

Email bridges CRM, Product, Website (logged-in), Mobile, and Support
Phone bridges CRM and Support
User ID is internal to Product but can link to CRM via an integration table

Flag Problematic Identifiers

Some identifiers create false matches:

Shared device IDs — Family members or shared computers
Generic email addresses — info@company.com, noreply@example.com
Recycled phone numbers — Carriers reassign phone numbers
Test accounts — Internal testing with dummy data

You’ll address these with limit rules in Step 4.

Step 2: Define Identifier Families

Identifier families group identifiers by type and define how they participate in resolution.

Navigate to Identity Resolution in the left sidebar
Click Identifier Families
Click Create Family for each identifier type:

Recommended Identifier Families

Family Name	Identifier Columns	Priority	Notes
Email	`email`, `work_email`, `personal_email`	High	Most reliable cross-source linker
Phone	`phone`, `mobile_phone`, `work_phone`	Medium	Normalize to E.164 format
Device	`cookie_id`, `device_id`, `idfa`, `gaid`	Low	High volume but lower trust
Account	`user_id`, `account_id`, `crm_id`	High	Deterministic, system-assigned

Configuring a Family

For each family:

Name — Descriptive name (e.g., “Email”)
Columns — Select the schema attributes that belong to this family. All columns in a family are treated as the same type of identifier.
Priority — Determines preference when resolving conflicts. Higher-priority families produce more trusted links.
Normalization — SignalSmith normalizes identifiers before matching:
- Emails: lowercase, trim whitespace
- Phones: strip formatting, apply E.164 when possible
- All: remove leading/trailing spaces

Step 3: Configure Merge Rules

Merge rules determine which attribute value wins when multiple source records contribute conflicting values for the same attribute in a golden record.

Navigate to Identity Resolution → Merge Rules
Click Create Merge Rule

Merge Strategies

Strategy	Behavior	Best For
Most Recent	Uses the value from the record with the latest update timestamp	Attributes that change over time (address, subscription tier)
Source Priority	Uses the value from a preferred data source	Attributes where one source is authoritative (CRM for name, Product for plan)
Most Frequent	Uses the most common value across all contributing records	Attributes with noisy data (country, language)
Longest	Uses the longest non-null value	Attributes where more detail is better (full name vs. first initial)
Custom SQL	User-defined SQL expression for resolution	Complex merge logic specific to your domain

Example Merge Rules Configuration

Attribute	Strategy	Details
`first_name`	Source Priority	CRM > Product > Support
`last_name`	Source Priority	CRM > Product > Support
`email`	Most Recent	Primary email from latest interaction
`phone`	Most Recent	Latest known phone number
`address`	Source Priority	CRM (manually verified) > Product (self-reported)
`subscription_tier`	Source Priority	Product database is authoritative
`lifetime_value`	Custom SQL	`SUM(ltv)` across all contributing records

Setting Up a Merge Rule

For each attribute you want in your golden record:

Select the attribute name
Choose the merge strategy
If using Source Priority, drag sources into preference order
If using Custom SQL, write the aggregation expression
Click Save

Attributes without explicit merge rules default to Most Recent.

Step 4: Set Limit Rules to Prevent Super-Clusters

Without limit rules, a single shared identifier (like a device ID from a public computer or a generic email) can merge thousands of unrelated records into one giant cluster. Limit rules prevent this.

Navigate to Identity Resolution → Limit Rules
Click Create Limit Rule

Types of Limit Rules

Maximum Cluster Size — Cap the number of records that can merge into a single golden record.

Maximum cluster size: 50 records
Action when exceeded: Freeze cluster (stop merging new records)

Per-Family Link Limit — Cap how many links a single identifier value can create.

Family: Device
Maximum links per identifier: 10
Action when exceeded: Ignore the identifier for future merges

Blocklist — Exclude specific identifier values from resolution entirely.

Family: Email
Blocked values:
  - noreply@*
  - info@*
  - test@*
  - *@example.com

Recommended Limit Rules

Rule	Setting	Rationale
Max cluster size	50	No real customer should have more than 50 contributing records
Device link limit	10	Shared devices link too many unrelated users
Email blocklist	Common generic patterns	Prevents mass-merging from shared inboxes
Phone link limit	5	Phone number recycling can create false links

Step 5: Run Initial Full Resolution

With identifier families, merge rules, and limit rules configured, run the first full resolution pass.

Navigate to Identity Resolution → Resolution Runs
Click Run Full Resolution
SignalSmith will:
- Collect all identifier values from your mapped schema attributes
- Build the identity graph
- Find connected components
- Apply merge rules to produce golden records
- Enforce limit rules to break super-clusters
Monitor the run progress — the first full run may take 10–60 minutes depending on data volume

Understanding the Resolution Summary

After the run completes, review the summary:

Metric	Description	Healthy Range
Input Records	Total source records processed	Matches your data size
Golden Records	Resulting unified profiles	60–90% of input records (some merging expected)
Merge Rate	`1 - (Golden Records / Input Records)`	10–40% for typical datasets
Clusters > 10	Number of golden records with many contributing sources	Low (< 1% of golden records)
Blocked Merges	Merges prevented by limit rules	Non-zero if limit rules are working

A merge rate above 50% often indicates overly aggressive linking — review your identifier families and limit rules.

Step 6: Verify Golden Records in Profile Explorer

Inspect the resolved profiles to confirm the resolution quality.

Navigate to Identity Resolution → Profile Explorer
Search for a known customer by email, phone, or ID
Review the golden record:
- Merged attributes — Check that the right values were selected by merge rules
- Contributing records — See which source records were merged together
- Identity graph — Visualize the identifier connections that caused the merge
Spot-check several profiles:
- A customer you know has multiple accounts
- A customer who should NOT be merged with anyone
- A high-value customer where accuracy is critical

What to Look For

Good resolution:

Known duplicate accounts are correctly merged
Attribute values come from the expected authoritative source
The identity graph shows logical connections (same email across systems)

Problematic resolution:

Unrelated people merged into one profile (over-merging) → Tighten limit rules or remove problematic identifier families
Known duplicates not merged (under-merging) → Check that the linking identifiers are mapped and in the right family
Wrong attribute values in the golden record → Adjust merge rule priorities

Step 7: Schedule Incremental Resolution

Once you’re satisfied with the initial resolution, set up incremental runs to keep golden records current as new data arrives.

Navigate to Identity Resolution → Settings
Under Schedule, select the resolution frequency:

Frequency	When to Use
Hourly	High-velocity data environments with real-time activation needs
Every 6 hours	Balanced freshness and compute cost for most use cases
Daily	Stable data environments where freshness is less critical
Weekly	Low-volume or batch-oriented data pipelines

Configure the run window — choose off-peak hours to minimize warehouse compute impact
Enable incremental mode — processes only new and changed records since the last run, significantly faster than full resolution
Click Save

Full vs. Incremental Resolution

Mode	Behavior	Duration	When to Use
Full	Reprocesses all records from scratch	10–60 min	Initial setup, after rule changes, periodic refresh
Incremental	Processes only new/changed records	1–10 min	Scheduled ongoing runs

Schedule a full resolution run weekly or monthly to catch edge cases that incremental processing might miss (e.g., retroactive data corrections).

Common Pitfalls and Troubleshooting

Over-Merging (Clusters Too Large)

Symptom: Golden records contain records from clearly different people.

Diagnosis:

Open the problematic golden record in Profile Explorer
Examine the identity graph to find the linking identifier
Check if the identifier is a shared device, generic email, or recycled phone number

Solutions:

Add the problematic identifier value to a blocklist
Lower the link limit for the relevant identifier family
Reduce the maximum cluster size
Remove low-trust identifier families (e.g., cookie IDs) from resolution

Under-Merging (Too Many Duplicates)

Symptom: Known duplicate customers remain as separate golden records.

Diagnosis:

Search for the duplicates in Profile Explorer
Check what identifiers each record has
Verify that the linking identifier is mapped in the schema and assigned to an identifier family

Solutions:

Ensure all relevant identifier columns are mapped in your schema
Check that identifier normalization is working (e.g., John@example.com and john@example.com should match after lowercasing)
Add additional identifier families if new linking opportunities exist

Resolution Runs Timing Out

Symptom: Full resolution runs fail or take excessively long.

Solutions:

Check warehouse compute resources — resolution is CPU/memory-intensive
For Snowflake, use a larger warehouse size (at least MEDIUM for datasets > 10M records)
For BigQuery, ensure you have sufficient slot capacity
Break the run into phases: resolve high-priority identifier families first, then add lower-priority ones

Merge Rule Producing Wrong Values

Symptom: Golden record attributes have unexpected values.

Diagnosis:

Open the golden record in Profile Explorer
Check the “Contributing Records” panel to see all source values
Verify which merge rule is applied to the attribute

Solutions:

Adjust the source priority order in your merge rule
Switch to a different merge strategy (e.g., Most Recent instead of Source Priority)
Add a Custom SQL merge rule for complex resolution logic

Next Steps

Identity Resolution concepts — Detailed documentation on the resolution engine
Profile Explorer — Navigate and inspect golden records
Building a Warehouse-First CDP — End-to-end CDP setup guide
Audiences — Build segments on top of resolved profiles

Retargeting Audiences Event Forwarding Setup