Creating an Identity Graph

The identity graph wizard guides you through a 5-step process to configure how SignalSmith resolves customer identities across your data sources. Each step builds on the previous one, and you can go back to adjust earlier steps before finalizing.

Prerequisites

Before starting, ensure you have:

At least one warehouse connected with accessible warehouse tables
At least one entity type defined in your schema
An understanding of which identifiers are shared across your data sources (email, phone, device ID, etc.)

Graph Type

Before starting the wizard, choose the graph type. The graph type determines whether confidence scoring infrastructure is enabled and cannot be changed after creation.

Type	When to Use
Deterministic (default)	You only need exact-match rules (email, phone, customer ID). Simpler, no confidence concepts to manage.
Probabilistic	You want fuzzy matching (name similarity, IP clustering, household) alongside exact-match rules. Enables confidence scores and thresholds at the graph, audience, and sync level.

Graph type is set at creation and cannot be changed later. To switch from deterministic to probabilistic, create a new identity graph and re-run resolution. See Probabilistic Matching for full details on scoring functions, confidence propagation, and threshold management.

Step 1: Select Entity Types

Choose which entity types to include in identity resolution. These are the entity types whose records will be linked together.

What to Select

Select entity types that represent the same real-world person across different data sources. Common patterns:

Scenario	Entity Types to Include
Single user table	`User` (identity resolution links duplicate records within the same table)
Multiple user tables	`Website User`, `App User`, `Support Contact`
Cross-system unification	`CRM Contact`, `Marketing Lead`, `Product User`

Configuration

For each selected entity type, specify:

Source table — The warehouse table containing records for this entity type
Entity key column — The primary key column in the source table
Include in resolution — Toggle to include or exclude (useful for temporarily removing an entity type)

You can include multiple entity types from the same source or from different sources. Identity resolution works across sources as long as the entity types share common identifier families.

Step 2: Define Identifier Families

Identifier families are the types of identifiers used to link records together. Each family represents a category of identifier (email, phone, device ID), and each family can have multiple variants.

Adding an Identifier Family

For each identifier family:

Name — A descriptive name (e.g., “Email”, “Phone”, “Device ID”)
Variants — The specific subtypes within the family (e.g., “Personal Email”, “Work Email”)
Column mappings — For each entity type, which column contains this identifier

Example Configuration

Family	Variant	User Table Column	App Table Column	CRM Column
Email	Personal	`email`	`user_email`	`contact_email`
Email	Work	`work_email`	—	`business_email`
Phone	Mobile	`mobile_phone`	`phone`	`mobile`
Phone	Home	—	—	`home_phone`
Device	Mobile Advertising ID	—	`maid`	—
Device	Cookie	`browser_cookie`	—	—
Customer ID	Internal	`user_id`	`user_id`	`external_id`

Not every entity type needs to have every identifier. Leave the column mapping empty for entity types that don’t have a particular identifier variant.

Identifier Properties

For each identifier family, you can configure:

Property	Description	Default
Case sensitive	Whether matching should be case-sensitive	No (emails are compared case-insensitively)
Normalize	Whether to apply normalization (trim whitespace, lowercase for emails)	Yes
Minimum length	Reject identifiers shorter than this (filters out garbage data)	1

Step 3: Configure Merge Rules

Merge rules define which identifier matches can actually link two records together. Not every shared identifier should cause a merge — for example, two records sharing a common cookie ID might be a weak signal, while shared email + phone is a strong signal.

Rule Types

Rule Type	Description
Single identifier match	Two records are linked if they share any value in the specified identifier family
Multi-identifier match	Two records are linked only if they share values in multiple specified identifier families

Example Rules

Rule 1: Email match (deterministic)

If two records share an email address (any variant), link them
Confidence: High

Rule 2: Phone + Name match (probabilistic)

If two records share a phone number AND have a similar name, link them
Confidence: Medium

Rule 3: Device ID match (deterministic)

If two records share a mobile advertising ID, link them
Confidence: High

Priority and Conflict Resolution

When multiple merge rules produce conflicting results, rules are applied in priority order. Higher-priority rules take precedence.

See Merge Rules for detailed guidance on designing effective merge rules.

Step 4: Set Limit Rules

Limit rules prevent over-merging by capping the size and connectivity of clusters. Without limits, a single shared identifier (like a generic email address or a shared device) could merge thousands of unrelated records into one giant cluster.

Available Limits

Limit	Description	Example
Max cluster size	Maximum number of records in a single cluster	50 records
Max identifiers per family	Maximum unique identifier values per family in a cluster	10 emails per cluster
Max records per entity type	Maximum records from a single entity type in a cluster	5 CRM contacts per cluster

Example Configuration

Limit	Value	Rationale
Max cluster size	100	No person should have more than 100 associated records
Max emails per cluster	10	A person rarely has more than 10 email addresses
Max phones per cluster	5	A person rarely has more than 5 phone numbers
Max device IDs per cluster	20	Shared household devices can inflate this count

When a limit is reached, the algorithm stops adding new records to the cluster. The excess records remain unresolved and are flagged for manual review.

See Limit Rules for detailed guidance on setting effective limits.

Step 5: Review and Create

The final step presents a summary of your complete identity graph configuration:

Review Checklist

Entity types — All selected entity types with their source tables and key columns
Identifier families — All families, variants, and column mappings
Merge rules — All rules with their conditions and priorities
Limit rules — All constraints to prevent over-merging

Validation

SignalSmith validates the configuration before creation:

Validation	Description
Column existence	All mapped columns exist in the source tables
Column types	Identifier columns have compatible data types
Merge rule coverage	At least one merge rule is defined
Limit rule sanity	Limits are not set so low that no merging would occur

Create

Click Create Identity Graph to save the configuration. This does not run resolution — it only saves the configuration. To run resolution, proceed to Running Resolution.

After Creation

Once the identity graph is created, you can:

Run resolution — Execute the connected-components algorithm to link records
View the graph — Explore the identity graph structure and cluster statistics
Configure golden records — Set survivorship rules for the unified profiles
Iterate — Adjust identifier families, merge rules, and limit rules as you learn from the results
Configure probabilistic rules — Add fuzzy matching strategies for expanded reach (probabilistic graphs only)

Identity resolution configuration is iterative. Start with conservative merge rules and tight limit rules, run resolution, review the results, then loosen rules as you gain confidence in the matching quality.

Next Steps

Identifier Families — Deep dive into family and variant configuration
Merge Rules — Design effective merge rules
Limit Rules — Prevent over-merging
Running Resolution — Execute identity resolution

Overview Identifier Families