GuidesIdentity Resolution Setup

Setting Up Identity Resolution

This guide walks you through configuring SignalSmith’s identity resolution engine to unify customer records across multiple data sources into golden records — single, authoritative customer profiles. By the end, you’ll have merge rules, limit rules, and a resolution pipeline producing unified profiles you can explore and activate.

When You Need Identity Resolution

Identity resolution becomes essential when:

  • Customers interact across channels — A customer signs up with their work email, makes a purchase with their personal email, and browses on their phone. Without resolution, this looks like three separate people.
  • Data lives in multiple systems — Your CRM has one view, your product database has another, and your marketing platform has a third. Each system uses different identifiers.
  • You’re building a 360-degree customer view — Traits, audiences, and journeys are only as good as the underlying profiles. Fragmented identities lead to duplicate audiences and inconsistent experiences.
  • You’re activating across destinations — Sending the same customer to a CRM as three separate leads wastes sales time and damages trust.

How SignalSmith Resolves Identities

SignalSmith uses a connected-components graph algorithm to merge records:

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│              │     │              │     │              │
│  Identifier  │────▶│   Identity   │────▶│   Golden     │
│  Collection  │     │   Graph      │     │   Records    │
│              │     │              │     │              │
└──────────────┘     └──────────────┘     └──────────────┘
  1. Collect identifiers — SignalSmith gathers all identifiers (emails, phones, device IDs, account IDs) from your mapped schema attributes.
  2. Build the graph — Each identifier becomes a node. Two records sharing an identifier value are connected by an edge. Identifier families define which identifiers can form connections.
  3. Find connected components — The algorithm finds all groups of records that are transitively connected through shared identifiers. Each connected component becomes a unified profile.
  4. Apply merge rules — When multiple records contribute conflicting values for the same attribute, merge rules determine which value wins.
  5. Enforce limit rules — Limit rules prevent unreasonably large clusters (super-clusters) that can form from shared device IDs or generic email addresses.

Prerequisites

  • A configured warehouse with customer data
  • A schema with at least one entity type that has identifier attributes mapped
  • Multiple data sources or tables contributing to the same entity type (otherwise there’s nothing to resolve)

Step 1: Audit Your Identifier Landscape

Before configuring identity resolution, understand what identifiers exist across your data.

Survey Your Data Sources

For each table or data source that contributes to your customer entity type, catalog the available identifiers:

Data SourceIdentifiers AvailableCoverage
CRM (Salesforce)Email, Phone, Account ID95% email, 60% phone
Product DatabaseUser ID, Email100% user ID, 90% email
Website AnalyticsCookie ID, Email (when logged in)100% cookie, 30% email
Mobile AppDevice ID, Email, Push Token100% device ID, 70% email
Support SystemEmail, Phone, Ticket ID100% email, 40% phone

Identify Linking Opportunities

Look for identifiers that bridge across sources:

  • Email bridges CRM, Product, Website (logged-in), Mobile, and Support
  • Phone bridges CRM and Support
  • User ID is internal to Product but can link to CRM via an integration table

Flag Problematic Identifiers

Some identifiers create false matches:

  • Shared device IDs — Family members or shared computers
  • Generic email addressesinfo@company.com, noreply@example.com
  • Recycled phone numbers — Carriers reassign phone numbers
  • Test accounts — Internal testing with dummy data

You’ll address these with limit rules in Step 4.

Step 2: Define Identifier Families

Identifier families group identifiers by type and define how they participate in resolution.

  1. Navigate to Identity Resolution in the left sidebar
  2. Click Identifier Families
  3. Click Create Family for each identifier type:
Family NameIdentifier ColumnsPriorityNotes
Emailemail, work_email, personal_emailHighMost reliable cross-source linker
Phonephone, mobile_phone, work_phoneMediumNormalize to E.164 format
Devicecookie_id, device_id, idfa, gaidLowHigh volume but lower trust
Accountuser_id, account_id, crm_idHighDeterministic, system-assigned

Configuring a Family

For each family:

  1. Name — Descriptive name (e.g., “Email”)
  2. Columns — Select the schema attributes that belong to this family. All columns in a family are treated as the same type of identifier.
  3. Priority — Determines preference when resolving conflicts. Higher-priority families produce more trusted links.
  4. Normalization — SignalSmith normalizes identifiers before matching:
    • Emails: lowercase, trim whitespace
    • Phones: strip formatting, apply E.164 when possible
    • All: remove leading/trailing spaces

Step 3: Configure Merge Rules

Merge rules determine which attribute value wins when multiple source records contribute conflicting values for the same attribute in a golden record.

  1. Navigate to Identity Resolution → Merge Rules
  2. Click Create Merge Rule

Merge Strategies

StrategyBehaviorBest For
Most RecentUses the value from the record with the latest update timestampAttributes that change over time (address, subscription tier)
Source PriorityUses the value from a preferred data sourceAttributes where one source is authoritative (CRM for name, Product for plan)
Most FrequentUses the most common value across all contributing recordsAttributes with noisy data (country, language)
LongestUses the longest non-null valueAttributes where more detail is better (full name vs. first initial)
Custom SQLUser-defined SQL expression for resolutionComplex merge logic specific to your domain

Example Merge Rules Configuration

AttributeStrategyDetails
first_nameSource PriorityCRM > Product > Support
last_nameSource PriorityCRM > Product > Support
emailMost RecentPrimary email from latest interaction
phoneMost RecentLatest known phone number
addressSource PriorityCRM (manually verified) > Product (self-reported)
subscription_tierSource PriorityProduct database is authoritative
lifetime_valueCustom SQLSUM(ltv) across all contributing records

Setting Up a Merge Rule

For each attribute you want in your golden record:

  1. Select the attribute name
  2. Choose the merge strategy
  3. If using Source Priority, drag sources into preference order
  4. If using Custom SQL, write the aggregation expression
  5. Click Save

Attributes without explicit merge rules default to Most Recent.

Step 4: Set Limit Rules to Prevent Super-Clusters

Without limit rules, a single shared identifier (like a device ID from a public computer or a generic email) can merge thousands of unrelated records into one giant cluster. Limit rules prevent this.

  1. Navigate to Identity Resolution → Limit Rules
  2. Click Create Limit Rule

Types of Limit Rules

Maximum Cluster Size — Cap the number of records that can merge into a single golden record.

Maximum cluster size: 50 records
Action when exceeded: Freeze cluster (stop merging new records)

Per-Family Link Limit — Cap how many links a single identifier value can create.

Family: Device
Maximum links per identifier: 10
Action when exceeded: Ignore the identifier for future merges

Blocklist — Exclude specific identifier values from resolution entirely.

Family: Email
Blocked values:
  - noreply@*
  - info@*
  - test@*
  - *@example.com
RuleSettingRationale
Max cluster size50No real customer should have more than 50 contributing records
Device link limit10Shared devices link too many unrelated users
Email blocklistCommon generic patternsPrevents mass-merging from shared inboxes
Phone link limit5Phone number recycling can create false links

Step 5: Run Initial Full Resolution

With identifier families, merge rules, and limit rules configured, run the first full resolution pass.

  1. Navigate to Identity Resolution → Resolution Runs
  2. Click Run Full Resolution
  3. SignalSmith will:
    • Collect all identifier values from your mapped schema attributes
    • Build the identity graph
    • Find connected components
    • Apply merge rules to produce golden records
    • Enforce limit rules to break super-clusters
  4. Monitor the run progress — the first full run may take 10–60 minutes depending on data volume

Understanding the Resolution Summary

After the run completes, review the summary:

MetricDescriptionHealthy Range
Input RecordsTotal source records processedMatches your data size
Golden RecordsResulting unified profiles60–90% of input records (some merging expected)
Merge Rate1 - (Golden Records / Input Records)10–40% for typical datasets
Clusters > 10Number of golden records with many contributing sourcesLow (< 1% of golden records)
Blocked MergesMerges prevented by limit rulesNon-zero if limit rules are working

A merge rate above 50% often indicates overly aggressive linking — review your identifier families and limit rules.

Step 6: Verify Golden Records in Profile Explorer

Inspect the resolved profiles to confirm the resolution quality.

  1. Navigate to Identity Resolution → Profile Explorer
  2. Search for a known customer by email, phone, or ID
  3. Review the golden record:
    • Merged attributes — Check that the right values were selected by merge rules
    • Contributing records — See which source records were merged together
    • Identity graph — Visualize the identifier connections that caused the merge
  4. Spot-check several profiles:
    • A customer you know has multiple accounts
    • A customer who should NOT be merged with anyone
    • A high-value customer where accuracy is critical

What to Look For

Good resolution:

  • Known duplicate accounts are correctly merged
  • Attribute values come from the expected authoritative source
  • The identity graph shows logical connections (same email across systems)

Problematic resolution:

  • Unrelated people merged into one profile (over-merging) → Tighten limit rules or remove problematic identifier families
  • Known duplicates not merged (under-merging) → Check that the linking identifiers are mapped and in the right family
  • Wrong attribute values in the golden record → Adjust merge rule priorities

Step 7: Schedule Incremental Resolution

Once you’re satisfied with the initial resolution, set up incremental runs to keep golden records current as new data arrives.

  1. Navigate to Identity Resolution → Settings
  2. Under Schedule, select the resolution frequency:
FrequencyWhen to Use
HourlyHigh-velocity data environments with real-time activation needs
Every 6 hoursBalanced freshness and compute cost for most use cases
DailyStable data environments where freshness is less critical
WeeklyLow-volume or batch-oriented data pipelines
  1. Configure the run window — choose off-peak hours to minimize warehouse compute impact
  2. Enable incremental mode — processes only new and changed records since the last run, significantly faster than full resolution
  3. Click Save

Full vs. Incremental Resolution

ModeBehaviorDurationWhen to Use
FullReprocesses all records from scratch10–60 minInitial setup, after rule changes, periodic refresh
IncrementalProcesses only new/changed records1–10 minScheduled ongoing runs

Schedule a full resolution run weekly or monthly to catch edge cases that incremental processing might miss (e.g., retroactive data corrections).

Common Pitfalls and Troubleshooting

Over-Merging (Clusters Too Large)

Symptom: Golden records contain records from clearly different people.

Diagnosis:

  1. Open the problematic golden record in Profile Explorer
  2. Examine the identity graph to find the linking identifier
  3. Check if the identifier is a shared device, generic email, or recycled phone number

Solutions:

  • Add the problematic identifier value to a blocklist
  • Lower the link limit for the relevant identifier family
  • Reduce the maximum cluster size
  • Remove low-trust identifier families (e.g., cookie IDs) from resolution

Under-Merging (Too Many Duplicates)

Symptom: Known duplicate customers remain as separate golden records.

Diagnosis:

  1. Search for the duplicates in Profile Explorer
  2. Check what identifiers each record has
  3. Verify that the linking identifier is mapped in the schema and assigned to an identifier family

Solutions:

  • Ensure all relevant identifier columns are mapped in your schema
  • Check that identifier normalization is working (e.g., John@example.com and john@example.com should match after lowercasing)
  • Add additional identifier families if new linking opportunities exist

Resolution Runs Timing Out

Symptom: Full resolution runs fail or take excessively long.

Solutions:

  • Check warehouse compute resources — resolution is CPU/memory-intensive
  • For Snowflake, use a larger warehouse size (at least MEDIUM for datasets > 10M records)
  • For BigQuery, ensure you have sufficient slot capacity
  • Break the run into phases: resolve high-priority identifier families first, then add lower-priority ones

Merge Rule Producing Wrong Values

Symptom: Golden record attributes have unexpected values.

Diagnosis:

  1. Open the golden record in Profile Explorer
  2. Check the “Contributing Records” panel to see all source values
  3. Verify which merge rule is applied to the attribute

Solutions:

  • Adjust the source priority order in your merge rule
  • Switch to a different merge strategy (e.g., Most Recent instead of Source Priority)
  • Add a Custom SQL merge rule for complex resolution logic

Next Steps