Creating an Identity Graph
The identity graph wizard guides you through a 5-step process to configure how SignalSmith resolves customer identities across your data sources. Each step builds on the previous one, and you can go back to adjust earlier steps before finalizing.
Prerequisites
Before starting, ensure you have:
- At least one warehouse connected with accessible warehouse tables
- At least one entity type defined in your schema
- An understanding of which identifiers are shared across your data sources (email, phone, device ID, etc.)
Graph Type
Before starting the wizard, choose the graph type. The graph type determines whether confidence scoring infrastructure is enabled and cannot be changed after creation.
| Type | When to Use |
|---|---|
| Deterministic (default) | You only need exact-match rules (email, phone, customer ID). Simpler, no confidence concepts to manage. |
| Probabilistic | You want fuzzy matching (name similarity, IP clustering, household) alongside exact-match rules. Enables confidence scores and thresholds at the graph, audience, and sync level. |
Graph type is set at creation and cannot be changed later. To switch from deterministic to probabilistic, create a new identity graph and re-run resolution. See Probabilistic Matching for full details on scoring functions, confidence propagation, and threshold management.
Step 1: Select Entity Types
Choose which entity types to include in identity resolution. These are the entity types whose records will be linked together.
What to Select
Select entity types that represent the same real-world person across different data sources. Common patterns:
| Scenario | Entity Types to Include |
|---|---|
| Single user table | User (identity resolution links duplicate records within the same table) |
| Multiple user tables | Website User, App User, Support Contact |
| Cross-system unification | CRM Contact, Marketing Lead, Product User |
Configuration
For each selected entity type, specify:
- Source table — The warehouse table containing records for this entity type
- Entity key column — The primary key column in the source table
- Include in resolution — Toggle to include or exclude (useful for temporarily removing an entity type)
You can include multiple entity types from the same source or from different sources. Identity resolution works across sources as long as the entity types share common identifier families.
Step 2: Define Identifier Families
Identifier families are the types of identifiers used to link records together. Each family represents a category of identifier (email, phone, device ID), and each family can have multiple variants.
Adding an Identifier Family
For each identifier family:
- Name — A descriptive name (e.g., “Email”, “Phone”, “Device ID”)
- Variants — The specific subtypes within the family (e.g., “Personal Email”, “Work Email”)
- Column mappings — For each entity type, which column contains this identifier
Example Configuration
| Family | Variant | User Table Column | App Table Column | CRM Column |
|---|---|---|---|---|
| Personal | email | user_email | contact_email | |
| Work | work_email | — | business_email | |
| Phone | Mobile | mobile_phone | phone | mobile |
| Phone | Home | — | — | home_phone |
| Device | Mobile Advertising ID | — | maid | — |
| Device | Cookie | browser_cookie | — | — |
| Customer ID | Internal | user_id | user_id | external_id |
Not every entity type needs to have every identifier. Leave the column mapping empty for entity types that don’t have a particular identifier variant.
Identifier Properties
For each identifier family, you can configure:
| Property | Description | Default |
|---|---|---|
| Case sensitive | Whether matching should be case-sensitive | No (emails are compared case-insensitively) |
| Normalize | Whether to apply normalization (trim whitespace, lowercase for emails) | Yes |
| Minimum length | Reject identifiers shorter than this (filters out garbage data) | 1 |
Step 3: Configure Merge Rules
Merge rules define which identifier matches can actually link two records together. Not every shared identifier should cause a merge — for example, two records sharing a common cookie ID might be a weak signal, while shared email + phone is a strong signal.
Rule Types
| Rule Type | Description |
|---|---|
| Single identifier match | Two records are linked if they share any value in the specified identifier family |
| Multi-identifier match | Two records are linked only if they share values in multiple specified identifier families |
Example Rules
Rule 1: Email match (deterministic)
- If two records share an email address (any variant), link them
- Confidence: High
Rule 2: Phone + Name match (probabilistic)
- If two records share a phone number AND have a similar name, link them
- Confidence: Medium
Rule 3: Device ID match (deterministic)
- If two records share a mobile advertising ID, link them
- Confidence: High
Priority and Conflict Resolution
When multiple merge rules produce conflicting results, rules are applied in priority order. Higher-priority rules take precedence.
See Merge Rules for detailed guidance on designing effective merge rules.
Step 4: Set Limit Rules
Limit rules prevent over-merging by capping the size and connectivity of clusters. Without limits, a single shared identifier (like a generic email address or a shared device) could merge thousands of unrelated records into one giant cluster.
Available Limits
| Limit | Description | Example |
|---|---|---|
| Max cluster size | Maximum number of records in a single cluster | 50 records |
| Max identifiers per family | Maximum unique identifier values per family in a cluster | 10 emails per cluster |
| Max records per entity type | Maximum records from a single entity type in a cluster | 5 CRM contacts per cluster |
Example Configuration
| Limit | Value | Rationale |
|---|---|---|
| Max cluster size | 100 | No person should have more than 100 associated records |
| Max emails per cluster | 10 | A person rarely has more than 10 email addresses |
| Max phones per cluster | 5 | A person rarely has more than 5 phone numbers |
| Max device IDs per cluster | 20 | Shared household devices can inflate this count |
When a limit is reached, the algorithm stops adding new records to the cluster. The excess records remain unresolved and are flagged for manual review.
See Limit Rules for detailed guidance on setting effective limits.
Step 5: Review and Create
The final step presents a summary of your complete identity graph configuration:
Review Checklist
- Entity types — All selected entity types with their source tables and key columns
- Identifier families — All families, variants, and column mappings
- Merge rules — All rules with their conditions and priorities
- Limit rules — All constraints to prevent over-merging
Validation
SignalSmith validates the configuration before creation:
| Validation | Description |
|---|---|
| Column existence | All mapped columns exist in the source tables |
| Column types | Identifier columns have compatible data types |
| Merge rule coverage | At least one merge rule is defined |
| Limit rule sanity | Limits are not set so low that no merging would occur |
Create
Click Create Identity Graph to save the configuration. This does not run resolution — it only saves the configuration. To run resolution, proceed to Running Resolution.
After Creation
Once the identity graph is created, you can:
- Run resolution — Execute the connected-components algorithm to link records
- View the graph — Explore the identity graph structure and cluster statistics
- Configure golden records — Set survivorship rules for the unified profiles
- Iterate — Adjust identifier families, merge rules, and limit rules as you learn from the results
- Configure probabilistic rules — Add fuzzy matching strategies for expanded reach (probabilistic graphs only)
Identity resolution configuration is iterative. Start with conservative merge rules and tight limit rules, run resolution, review the results, then loosen rules as you gain confidence in the matching quality.
Next Steps
- Identifier Families — Deep dive into family and variant configuration
- Merge Rules — Design effective merge rules
- Limit Rules — Prevent over-merging
- Running Resolution — Execute identity resolution