Setting Up Identity Resolution
This guide walks you through configuring SignalSmith’s identity resolution engine to unify customer records across multiple data sources into golden records — single, authoritative customer profiles. By the end, you’ll have merge rules, limit rules, and a resolution pipeline producing unified profiles you can explore and activate.
When You Need Identity Resolution
Identity resolution becomes essential when:
- Customers interact across channels — A customer signs up with their work email, makes a purchase with their personal email, and browses on their phone. Without resolution, this looks like three separate people.
- Data lives in multiple systems — Your CRM has one view, your product database has another, and your marketing platform has a third. Each system uses different identifiers.
- You’re building a 360-degree customer view — Traits, audiences, and journeys are only as good as the underlying profiles. Fragmented identities lead to duplicate audiences and inconsistent experiences.
- You’re activating across destinations — Sending the same customer to a CRM as three separate leads wastes sales time and damages trust.
How SignalSmith Resolves Identities
SignalSmith uses a connected-components graph algorithm to merge records:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ │ │ │ │ │
│ Identifier │────▶│ Identity │────▶│ Golden │
│ Collection │ │ Graph │ │ Records │
│ │ │ │ │ │
└──────────────┘ └──────────────┘ └──────────────┘- Collect identifiers — SignalSmith gathers all identifiers (emails, phones, device IDs, account IDs) from your mapped schema attributes.
- Build the graph — Each identifier becomes a node. Two records sharing an identifier value are connected by an edge. Identifier families define which identifiers can form connections.
- Find connected components — The algorithm finds all groups of records that are transitively connected through shared identifiers. Each connected component becomes a unified profile.
- Apply merge rules — When multiple records contribute conflicting values for the same attribute, merge rules determine which value wins.
- Enforce limit rules — Limit rules prevent unreasonably large clusters (super-clusters) that can form from shared device IDs or generic email addresses.
Prerequisites
- A configured warehouse with customer data
- A schema with at least one entity type that has identifier attributes mapped
- Multiple data sources or tables contributing to the same entity type (otherwise there’s nothing to resolve)
Step 1: Audit Your Identifier Landscape
Before configuring identity resolution, understand what identifiers exist across your data.
Survey Your Data Sources
For each table or data source that contributes to your customer entity type, catalog the available identifiers:
| Data Source | Identifiers Available | Coverage |
|---|---|---|
| CRM (Salesforce) | Email, Phone, Account ID | 95% email, 60% phone |
| Product Database | User ID, Email | 100% user ID, 90% email |
| Website Analytics | Cookie ID, Email (when logged in) | 100% cookie, 30% email |
| Mobile App | Device ID, Email, Push Token | 100% device ID, 70% email |
| Support System | Email, Phone, Ticket ID | 100% email, 40% phone |
Identify Linking Opportunities
Look for identifiers that bridge across sources:
- Email bridges CRM, Product, Website (logged-in), Mobile, and Support
- Phone bridges CRM and Support
- User ID is internal to Product but can link to CRM via an integration table
Flag Problematic Identifiers
Some identifiers create false matches:
- Shared device IDs — Family members or shared computers
- Generic email addresses —
info@company.com,noreply@example.com - Recycled phone numbers — Carriers reassign phone numbers
- Test accounts — Internal testing with dummy data
You’ll address these with limit rules in Step 4.
Step 2: Define Identifier Families
Identifier families group identifiers by type and define how they participate in resolution.
- Navigate to Identity Resolution in the left sidebar
- Click Identifier Families
- Click Create Family for each identifier type:
Recommended Identifier Families
| Family Name | Identifier Columns | Priority | Notes |
|---|---|---|---|
email, work_email, personal_email | High | Most reliable cross-source linker | |
| Phone | phone, mobile_phone, work_phone | Medium | Normalize to E.164 format |
| Device | cookie_id, device_id, idfa, gaid | Low | High volume but lower trust |
| Account | user_id, account_id, crm_id | High | Deterministic, system-assigned |
Configuring a Family
For each family:
- Name — Descriptive name (e.g., “Email”)
- Columns — Select the schema attributes that belong to this family. All columns in a family are treated as the same type of identifier.
- Priority — Determines preference when resolving conflicts. Higher-priority families produce more trusted links.
- Normalization — SignalSmith normalizes identifiers before matching:
- Emails: lowercase, trim whitespace
- Phones: strip formatting, apply E.164 when possible
- All: remove leading/trailing spaces
Step 3: Configure Merge Rules
Merge rules determine which attribute value wins when multiple source records contribute conflicting values for the same attribute in a golden record.
- Navigate to Identity Resolution → Merge Rules
- Click Create Merge Rule
Merge Strategies
| Strategy | Behavior | Best For |
|---|---|---|
| Most Recent | Uses the value from the record with the latest update timestamp | Attributes that change over time (address, subscription tier) |
| Source Priority | Uses the value from a preferred data source | Attributes where one source is authoritative (CRM for name, Product for plan) |
| Most Frequent | Uses the most common value across all contributing records | Attributes with noisy data (country, language) |
| Longest | Uses the longest non-null value | Attributes where more detail is better (full name vs. first initial) |
| Custom SQL | User-defined SQL expression for resolution | Complex merge logic specific to your domain |
Example Merge Rules Configuration
| Attribute | Strategy | Details |
|---|---|---|
first_name | Source Priority | CRM > Product > Support |
last_name | Source Priority | CRM > Product > Support |
email | Most Recent | Primary email from latest interaction |
phone | Most Recent | Latest known phone number |
address | Source Priority | CRM (manually verified) > Product (self-reported) |
subscription_tier | Source Priority | Product database is authoritative |
lifetime_value | Custom SQL | SUM(ltv) across all contributing records |
Setting Up a Merge Rule
For each attribute you want in your golden record:
- Select the attribute name
- Choose the merge strategy
- If using Source Priority, drag sources into preference order
- If using Custom SQL, write the aggregation expression
- Click Save
Attributes without explicit merge rules default to Most Recent.
Step 4: Set Limit Rules to Prevent Super-Clusters
Without limit rules, a single shared identifier (like a device ID from a public computer or a generic email) can merge thousands of unrelated records into one giant cluster. Limit rules prevent this.
- Navigate to Identity Resolution → Limit Rules
- Click Create Limit Rule
Types of Limit Rules
Maximum Cluster Size — Cap the number of records that can merge into a single golden record.
Maximum cluster size: 50 records
Action when exceeded: Freeze cluster (stop merging new records)Per-Family Link Limit — Cap how many links a single identifier value can create.
Family: Device
Maximum links per identifier: 10
Action when exceeded: Ignore the identifier for future mergesBlocklist — Exclude specific identifier values from resolution entirely.
Family: Email
Blocked values:
- noreply@*
- info@*
- test@*
- *@example.comRecommended Limit Rules
| Rule | Setting | Rationale |
|---|---|---|
| Max cluster size | 50 | No real customer should have more than 50 contributing records |
| Device link limit | 10 | Shared devices link too many unrelated users |
| Email blocklist | Common generic patterns | Prevents mass-merging from shared inboxes |
| Phone link limit | 5 | Phone number recycling can create false links |
Step 5: Run Initial Full Resolution
With identifier families, merge rules, and limit rules configured, run the first full resolution pass.
- Navigate to Identity Resolution → Resolution Runs
- Click Run Full Resolution
- SignalSmith will:
- Collect all identifier values from your mapped schema attributes
- Build the identity graph
- Find connected components
- Apply merge rules to produce golden records
- Enforce limit rules to break super-clusters
- Monitor the run progress — the first full run may take 10–60 minutes depending on data volume
Understanding the Resolution Summary
After the run completes, review the summary:
| Metric | Description | Healthy Range |
|---|---|---|
| Input Records | Total source records processed | Matches your data size |
| Golden Records | Resulting unified profiles | 60–90% of input records (some merging expected) |
| Merge Rate | 1 - (Golden Records / Input Records) | 10–40% for typical datasets |
| Clusters > 10 | Number of golden records with many contributing sources | Low (< 1% of golden records) |
| Blocked Merges | Merges prevented by limit rules | Non-zero if limit rules are working |
A merge rate above 50% often indicates overly aggressive linking — review your identifier families and limit rules.
Step 6: Verify Golden Records in Profile Explorer
Inspect the resolved profiles to confirm the resolution quality.
- Navigate to Identity Resolution → Profile Explorer
- Search for a known customer by email, phone, or ID
- Review the golden record:
- Merged attributes — Check that the right values were selected by merge rules
- Contributing records — See which source records were merged together
- Identity graph — Visualize the identifier connections that caused the merge
- Spot-check several profiles:
- A customer you know has multiple accounts
- A customer who should NOT be merged with anyone
- A high-value customer where accuracy is critical
What to Look For
Good resolution:
- Known duplicate accounts are correctly merged
- Attribute values come from the expected authoritative source
- The identity graph shows logical connections (same email across systems)
Problematic resolution:
- Unrelated people merged into one profile (over-merging) → Tighten limit rules or remove problematic identifier families
- Known duplicates not merged (under-merging) → Check that the linking identifiers are mapped and in the right family
- Wrong attribute values in the golden record → Adjust merge rule priorities
Step 7: Schedule Incremental Resolution
Once you’re satisfied with the initial resolution, set up incremental runs to keep golden records current as new data arrives.
- Navigate to Identity Resolution → Settings
- Under Schedule, select the resolution frequency:
| Frequency | When to Use |
|---|---|
| Hourly | High-velocity data environments with real-time activation needs |
| Every 6 hours | Balanced freshness and compute cost for most use cases |
| Daily | Stable data environments where freshness is less critical |
| Weekly | Low-volume or batch-oriented data pipelines |
- Configure the run window — choose off-peak hours to minimize warehouse compute impact
- Enable incremental mode — processes only new and changed records since the last run, significantly faster than full resolution
- Click Save
Full vs. Incremental Resolution
| Mode | Behavior | Duration | When to Use |
|---|---|---|---|
| Full | Reprocesses all records from scratch | 10–60 min | Initial setup, after rule changes, periodic refresh |
| Incremental | Processes only new/changed records | 1–10 min | Scheduled ongoing runs |
Schedule a full resolution run weekly or monthly to catch edge cases that incremental processing might miss (e.g., retroactive data corrections).
Common Pitfalls and Troubleshooting
Over-Merging (Clusters Too Large)
Symptom: Golden records contain records from clearly different people.
Diagnosis:
- Open the problematic golden record in Profile Explorer
- Examine the identity graph to find the linking identifier
- Check if the identifier is a shared device, generic email, or recycled phone number
Solutions:
- Add the problematic identifier value to a blocklist
- Lower the link limit for the relevant identifier family
- Reduce the maximum cluster size
- Remove low-trust identifier families (e.g., cookie IDs) from resolution
Under-Merging (Too Many Duplicates)
Symptom: Known duplicate customers remain as separate golden records.
Diagnosis:
- Search for the duplicates in Profile Explorer
- Check what identifiers each record has
- Verify that the linking identifier is mapped in the schema and assigned to an identifier family
Solutions:
- Ensure all relevant identifier columns are mapped in your schema
- Check that identifier normalization is working (e.g.,
John@example.comandjohn@example.comshould match after lowercasing) - Add additional identifier families if new linking opportunities exist
Resolution Runs Timing Out
Symptom: Full resolution runs fail or take excessively long.
Solutions:
- Check warehouse compute resources — resolution is CPU/memory-intensive
- For Snowflake, use a larger warehouse size (at least MEDIUM for datasets > 10M records)
- For BigQuery, ensure you have sufficient slot capacity
- Break the run into phases: resolve high-priority identifier families first, then add lower-priority ones
Merge Rule Producing Wrong Values
Symptom: Golden record attributes have unexpected values.
Diagnosis:
- Open the golden record in Profile Explorer
- Check the “Contributing Records” panel to see all source values
- Verify which merge rule is applied to the attribute
Solutions:
- Adjust the source priority order in your merge rule
- Switch to a different merge strategy (e.g., Most Recent instead of Source Priority)
- Add a Custom SQL merge rule for complex resolution logic
Next Steps
- Identity Resolution concepts — Detailed documentation on the resolution engine
- Profile Explorer — Navigate and inspect golden records
- Building a Warehouse-First CDP — End-to-end CDP setup guide
- Audiences — Build segments on top of resolved profiles