Identity ResolutionRunning Resolution

Running Resolution

After configuring your identity graph, you can run identity resolution to link customer records and produce golden records. SignalSmith supports two resolution modes: full resolution (recompute from scratch) and incremental resolution (process only changes since the last run).

Full Resolution

Full resolution reprocesses all records from all entity types, rebuilds the entire identity graph, and recomputes all clusters and golden records from scratch.

When to Use Full Resolution

  • First run — Always run full resolution the first time after creating an identity graph
  • After configuration changes — When you modify identifier families, merge rules, or limit rules
  • After major data changes — When a large data migration or backfill significantly changes the source data
  • Periodic refresh — As a scheduled full recomputation to correct any drift from incremental resolution

How It Works

  1. Extract identifiers — Query all source tables to extract entity keys and identifier values
  2. Normalize — Apply normalization rules to all identifier values (lowercase emails, strip phone formatting, etc.)
  3. Build edges — For each pair of records sharing a normalized identifier value, create a candidate edge
  4. Apply merge rules — Filter edges by merge rule conditions (single identifier match, multi-identifier match)
  5. Enforce limit rules — Drop edges that would cause clusters to exceed configured limits
  6. Find connected components — Run the graph algorithm to identify clusters of linked records
  7. Produce golden records — Apply survivorship rules to each cluster to create unified profiles
  8. Materialize — Write cluster assignments and golden records to the warehouse

Triggering Full Resolution

From the UI:

  1. Navigate to Identity Resolution and select your identity graph
  2. Click Run Resolution
  3. Select Full Resolution
  4. Click Start

From the API:

POST /api/v1/identity-graphs/{id}/resolve
Content-Type: application/json
 
{
  "mode": "full"
}

Performance

Full resolution performance depends on:

FactorImpact
Total record countMore records = more edges to evaluate
Number of identifier familiesMore families = more edge candidates
Merge rule complexityMulti-identifier rules require more computation
Warehouse computeLarger warehouse instances process faster

Typical performance benchmarks:

Record CountApproximate Duration
100,0002-5 minutes
1,000,00010-30 minutes
10,000,0001-3 hours
100,000,0004-12 hours

These are approximate benchmarks. Actual performance varies based on warehouse type, compute size, data distribution, and merge rule complexity.

Incremental Resolution

Incremental resolution processes only records that have been added, modified, or deleted since the last resolution run. This is significantly faster than full resolution for ongoing operations.

When to Use Incremental Resolution

  • Regular scheduled runs — After the initial full resolution, use incremental for daily/hourly updates
  • Real-time data ingestion — When new records arrive continuously from loaders or event pipelines
  • Cost optimization — Incremental runs consume far less warehouse compute than full runs

How It Works

  1. Detect changes — Identify records added, modified, or deleted since the last run (using change tracking or timestamp comparison)
  2. Extract new identifiers — Query only the changed records for their identifier values
  3. Evaluate edges — Create candidate edges between new/modified records and existing records
  4. Apply merge and limit rules — Same rules as full resolution, applied only to new edges
  5. Update clusters — Merge new records into existing clusters or create new clusters
  6. Update golden records — Recompute golden records for affected clusters only
  7. Materialize — Write updated cluster assignments and golden records to the warehouse

Change Detection

SignalSmith detects changes using one of two methods:

MethodDescriptionRequirements
Timestamp-basedCompares record timestamps to the last run timestampSource tables must have a created_at or updated_at column
Change trackingUses warehouse-native change trackingSnowflake: Stream; BigQuery: Change history; Databricks: Delta Lake change data feed

If your warehouse supports change tracking, it is preferred because it captures deletes and is more efficient.

Triggering Incremental Resolution

From the UI:

  1. Navigate to Identity Resolution and select your identity graph
  2. Click Run Resolution
  3. Select Incremental Resolution
  4. Click Start

From the API:

POST /api/v1/identity-graphs/{id}/resolve
Content-Type: application/json
 
{
  "mode": "incremental"
}

Limitations of Incremental Resolution

  • Drift accumulation — Over many incremental runs, small inaccuracies can accumulate. Schedule periodic full runs to correct this.
  • Deletes may be missed — If using timestamp-based change detection, deleted records are not detected. Use change tracking or periodic full runs.
  • Configuration changes — If you modify merge rules, limit rules, or identifier families, you must run full resolution for the changes to take effect across all records.

Scheduling

Identity resolution can be scheduled to run automatically:

ScheduleModeUse Case
ManualEitherRun on demand after configuration changes
HourlyIncrementalNear-real-time identity unification
DailyIncrementalStandard cadence for most use cases
WeeklyFullPeriodic full refresh to correct drift
Custom CronEitherSpecific timing needs

A common pattern is to combine both modes:

  • Daily at 2 AM: Incremental resolution (process new/changed records)
  • Weekly on Sunday at midnight: Full resolution (recompute everything to correct drift)

This provides daily freshness with weekly accuracy correction.

Monitoring

Resolution Run Status

Each resolution run has a status:

StatusDescription
QueuedThe run is scheduled but has not started
RunningThe resolution is in progress
CompletedThe resolution finished successfully
FailedThe resolution encountered an error
CancelledThe run was manually cancelled

Run Metrics

After a successful run, the following metrics are available:

MetricDescription
Records processedTotal number of source records included in resolution
Clusters formedNumber of distinct identity clusters
Merges performedNumber of records linked to a cluster with more than one record
SingletonsNumber of clusters with only one record (unmatched)
Edges createdTotal number of edges in the identity graph
Edges dropped (limits)Number of edges dropped due to limit rules
DurationTotal wall-clock time for the run
Golden records producedNumber of unified profiles created

Cluster Size Distribution

The run results include a cluster size distribution showing how many clusters contain N records:

Cluster SizeCountPercentage
1 (singleton)450,00045%
2200,00020%
3-5150,00015%
6-1080,0008%
11-5050,0005%
51-10010,0001%
100+00% (capped by limit rules)

A healthy distribution typically shows a long tail: many singletons, a significant number of 2-record clusters, and decreasing counts for larger clusters.

⚠️

If you see many clusters at or near your max cluster size limit, it may indicate over-merging. Review sample clusters at the limit boundary to check for false merges.

Error Handling

When a resolution run fails, SignalSmith provides:

  • Error message — The specific error that caused the failure
  • Error phase — Which phase failed (extraction, edge building, clustering, golden records, materialization)
  • Partial results — Whether any intermediate results were produced before the failure

Common failure causes:

CauseResolution
Warehouse timeoutIncrease warehouse timeout settings or scale up compute
Permission deniedCheck source connection credentials
Out of memoryReduce the number of concurrent edges being processed
Schema changeA source table was altered. Update identifier family column mappings.

Next Steps