Identity ResolutionCreating an Identity Graph

Creating an Identity Graph

The identity graph wizard guides you through a 5-step process to configure how SignalSmith resolves customer identities across your data sources. Each step builds on the previous one, and you can go back to adjust earlier steps before finalizing.

Prerequisites

Before starting, ensure you have:

  • At least one warehouse connected with accessible warehouse tables
  • At least one entity type defined in your schema
  • An understanding of which identifiers are shared across your data sources (email, phone, device ID, etc.)

Graph Type

Before starting the wizard, choose the graph type. The graph type determines whether confidence scoring infrastructure is enabled and cannot be changed after creation.

TypeWhen to Use
Deterministic (default)You only need exact-match rules (email, phone, customer ID). Simpler, no confidence concepts to manage.
ProbabilisticYou want fuzzy matching (name similarity, IP clustering, household) alongside exact-match rules. Enables confidence scores and thresholds at the graph, audience, and sync level.

Graph type is set at creation and cannot be changed later. To switch from deterministic to probabilistic, create a new identity graph and re-run resolution. See Probabilistic Matching for full details on scoring functions, confidence propagation, and threshold management.

Step 1: Select Entity Types

Choose which entity types to include in identity resolution. These are the entity types whose records will be linked together.

What to Select

Select entity types that represent the same real-world person across different data sources. Common patterns:

ScenarioEntity Types to Include
Single user tableUser (identity resolution links duplicate records within the same table)
Multiple user tablesWebsite User, App User, Support Contact
Cross-system unificationCRM Contact, Marketing Lead, Product User

Configuration

For each selected entity type, specify:

  • Source table — The warehouse table containing records for this entity type
  • Entity key column — The primary key column in the source table
  • Include in resolution — Toggle to include or exclude (useful for temporarily removing an entity type)

You can include multiple entity types from the same source or from different sources. Identity resolution works across sources as long as the entity types share common identifier families.

Step 2: Define Identifier Families

Identifier families are the types of identifiers used to link records together. Each family represents a category of identifier (email, phone, device ID), and each family can have multiple variants.

Adding an Identifier Family

For each identifier family:

  1. Name — A descriptive name (e.g., “Email”, “Phone”, “Device ID”)
  2. Variants — The specific subtypes within the family (e.g., “Personal Email”, “Work Email”)
  3. Column mappings — For each entity type, which column contains this identifier

Example Configuration

FamilyVariantUser Table ColumnApp Table ColumnCRM Column
EmailPersonalemailuser_emailcontact_email
EmailWorkwork_emailbusiness_email
PhoneMobilemobile_phonephonemobile
PhoneHomehome_phone
DeviceMobile Advertising IDmaid
DeviceCookiebrowser_cookie
Customer IDInternaluser_iduser_idexternal_id

Not every entity type needs to have every identifier. Leave the column mapping empty for entity types that don’t have a particular identifier variant.

Identifier Properties

For each identifier family, you can configure:

PropertyDescriptionDefault
Case sensitiveWhether matching should be case-sensitiveNo (emails are compared case-insensitively)
NormalizeWhether to apply normalization (trim whitespace, lowercase for emails)Yes
Minimum lengthReject identifiers shorter than this (filters out garbage data)1

Step 3: Configure Merge Rules

Merge rules define which identifier matches can actually link two records together. Not every shared identifier should cause a merge — for example, two records sharing a common cookie ID might be a weak signal, while shared email + phone is a strong signal.

Rule Types

Rule TypeDescription
Single identifier matchTwo records are linked if they share any value in the specified identifier family
Multi-identifier matchTwo records are linked only if they share values in multiple specified identifier families

Example Rules

Rule 1: Email match (deterministic)

  • If two records share an email address (any variant), link them
  • Confidence: High

Rule 2: Phone + Name match (probabilistic)

  • If two records share a phone number AND have a similar name, link them
  • Confidence: Medium

Rule 3: Device ID match (deterministic)

  • If two records share a mobile advertising ID, link them
  • Confidence: High

Priority and Conflict Resolution

When multiple merge rules produce conflicting results, rules are applied in priority order. Higher-priority rules take precedence.

See Merge Rules for detailed guidance on designing effective merge rules.

Step 4: Set Limit Rules

Limit rules prevent over-merging by capping the size and connectivity of clusters. Without limits, a single shared identifier (like a generic email address or a shared device) could merge thousands of unrelated records into one giant cluster.

Available Limits

LimitDescriptionExample
Max cluster sizeMaximum number of records in a single cluster50 records
Max identifiers per familyMaximum unique identifier values per family in a cluster10 emails per cluster
Max records per entity typeMaximum records from a single entity type in a cluster5 CRM contacts per cluster

Example Configuration

LimitValueRationale
Max cluster size100No person should have more than 100 associated records
Max emails per cluster10A person rarely has more than 10 email addresses
Max phones per cluster5A person rarely has more than 5 phone numbers
Max device IDs per cluster20Shared household devices can inflate this count

When a limit is reached, the algorithm stops adding new records to the cluster. The excess records remain unresolved and are flagged for manual review.

See Limit Rules for detailed guidance on setting effective limits.

Step 5: Review and Create

The final step presents a summary of your complete identity graph configuration:

Review Checklist

  • Entity types — All selected entity types with their source tables and key columns
  • Identifier families — All families, variants, and column mappings
  • Merge rules — All rules with their conditions and priorities
  • Limit rules — All constraints to prevent over-merging

Validation

SignalSmith validates the configuration before creation:

ValidationDescription
Column existenceAll mapped columns exist in the source tables
Column typesIdentifier columns have compatible data types
Merge rule coverageAt least one merge rule is defined
Limit rule sanityLimits are not set so low that no merging would occur

Create

Click Create Identity Graph to save the configuration. This does not run resolution — it only saves the configuration. To run resolution, proceed to Running Resolution.

After Creation

Once the identity graph is created, you can:

  1. Run resolution — Execute the connected-components algorithm to link records
  2. View the graph — Explore the identity graph structure and cluster statistics
  3. Configure golden records — Set survivorship rules for the unified profiles
  4. Iterate — Adjust identifier families, merge rules, and limit rules as you learn from the results
  5. Configure probabilistic rules — Add fuzzy matching strategies for expanded reach (probabilistic graphs only)

Identity resolution configuration is iterative. Start with conservative merge rules and tight limit rules, run resolution, review the results, then loosen rules as you gain confidence in the matching quality.

Next Steps