Probabilistic Matching

Probabilistic matching extends identity resolution beyond exact-match rules by using fuzzy and statistical signals to link records. Each probabilistic match produces a confidence score between 0.0 and 1.0 that flows through to audiences and syncs, letting you control the precision/recall tradeoff at query time without re-running the identity pipeline.

Graph Types

Identity graphs have a type set at creation time. The graph type determines whether confidence scoring infrastructure is enabled.

Type	Description	Confidence Features
Deterministic	Exact-match rules only. Default. All edges carry confidence 1.0.	Hidden — no confidence UI, columns, or thresholds
Probabilistic	Supports both exact-match and fuzzy rules. Mixed rule types in the same graph.	Enabled — confidence columns, thresholds at graph, audience, and sync level

Deterministic graphs work exactly as before — no behavior change. Probabilistic graphs add confidence scoring infrastructure throughout the pipeline: edge tables, profile aggregates, golden record columns, and audience query filters.

Graph type is set at creation and cannot be changed. To switch from deterministic to probabilistic, create a new identity graph and re-run resolution.

Scoring Functions

When you add a merge rule with match type probabilistic to a probabilistic graph, you choose a scoring function that determines how candidate record pairs are compared. Each scoring function targets a specific identifier family and uses a method suited to that data type.

Fuzzy Name (`fuzzy_name`)

Matches records with similar names using phonetic blocking and string similarity scoring.

How it works: Records that share the same SOUNDEX code are selected as candidates. Confidence is then computed using a per-warehouse similarity algorithm: Jaro-Winkler similarity on Snowflake, Levenshtein distance on Databricks, and SOUNDEX match on BigQuery.
Requires: name family identifiers — first_name, last_name, or full_name — mapped on participating models.

Configuration:

Parameter	Default	Description
Min Similarity	0.85	Minimum computed similarity score (0.0–1.0) to create an edge

Example: “Alice Smith” and “Alice M. Smith” share the same SOUNDEX code. Jaro-Winkler similarity scores 0.92, which exceeds the 0.85 threshold, so an edge is created with confidence 0.92.

IP Cluster (`ip_cluster`)

Links records that share the same IP address within a configurable time window, indicating they may be the same person or household browsing session.

How it works: Exact IP address match combined with a time proximity constraint. Records whose timestamps fall within the configured window are treated as candidates.
Requires: ip_address family identifiers and a timestamp column mapped on participating models.

Configuration:

Parameter	Default	Description
Time Window (hours)	24	Maximum time difference between record events to create a link

Example: Two records from IP 192.168.1.1 with events 18 hours apart fall within the 24-hour window. An edge is created with the base confidence configured for the rule (default 0.6).

Household (`household`)

Links records that share a postal code and have sufficiently similar names, indicating they may belong to the same household.

How it works: Exact postal code match, followed by SOUNDEX name matching. Both conditions must be satisfied to create an edge.
Requires: Both address family identifiers (postal_code or zip_code) and name family identifiers mapped on participating models.

Configuration:

Parameter	Default	Description
Name Similarity Threshold	0.8	Minimum name similarity score required alongside the postal code match

Example: Two records in postal code 10001 with names “John Smith” and “J. Smith” satisfy the postal code match. SOUNDEX produces a match, and the resulting name similarity score meets the 0.8 threshold, so an edge is created.

Confidence Scores

Every edge in the identity graph carries a confidence score. Deterministic edges always carry 1.0; probabilistic edges carry the computed similarity score from their scoring function.

Edge Type	Confidence
Deterministic (email, phone, customer ID)	Always 1.0
Probabilistic — `fuzzy_name`	Computed Jaro-Winkler / Levenshtein score (0.0–1.0)
Probabilistic — `ip_cluster`	Base confidence from rule configuration
Probabilistic — `household`	Base confidence × name similarity score

Propagation

Confidence propagates through the connected-components algorithm. When two records are linked through one or more intermediate edges, the path confidence is the minimum edge confidence along the path from one record to the other. A single weak link therefore lowers the confidence for all records reachable through it.

Each row in _IDENTITY_GRAPH has a min_edge_confidence column — the minimum confidence on the shortest path from that record to the component root.

Profile-level aggregates in _IDENTITY_PROFILES summarize confidence across all members of the resolved cluster:

Column	Description
`min_confidence`	Lowest `min_edge_confidence` value of any member record in the profile
`avg_confidence`	Average `min_edge_confidence` across all member records in the profile

Confidence Thresholds

Thresholds control which profiles appear in audience queries and syncs. Records whose min_edge_confidence falls below the active threshold are excluded from results. Thresholds are evaluated at query time, not at pipeline time — you can adjust them without re-running resolution.

Three-Level Hierarchy

Thresholds are resolved with this precedence, highest to lowest:

Audience sync override — a per-sync threshold set on an individual sync destination, for destination-specific tuning
Audience threshold — a per-audience threshold set in the audience builder
Graph default — the default threshold configured on the identity graph itself

If none of the three levels has a threshold set, the effective threshold is 0.0, meaning all profiles are included regardless of confidence.

Example Scenario

Level	Threshold	Effect
Graph default	0.5	Base filtering — profiles with confidence below 0.5 are excluded everywhere
Email campaign audience	0.8	Stricter — only high-confidence profiles qualify for this audience
Google Ads sync	0.5	Uses graph default — ad platforms tolerate lower match quality
Braze sync	0.9	Strictest override — personalized messaging requires high accuracy

Confidence thresholds are applied at query time, not pipeline time. Changing a threshold on a graph, audience, or sync takes effect immediately on the next query — no re-run of identity resolution is required.

Impact on Golden Records

When a probabilistic identity graph has a golden record configured, the _GOLDEN_RECORD table includes a _min_confidence column. This column holds the minimum edge confidence across all source records that contributed to that golden record row.

Audience queries using the golden record path (Path B) filter on _min_confidence >= threshold when a confidence threshold is active. This ensures that audience membership derived from golden records reflects the same confidence controls as direct graph queries.

See Golden Records for the full survivorship configuration reference.

Impact on Audiences and Syncs

Audience Queries

When an audience is linked to a probabilistic identity graph and a confidence threshold is set at any level, the audience query applies the threshold automatically:

Path B (Golden Record): Filters _GOLDEN_RECORD rows where _min_confidence >= threshold.
Path C (ROW_NUMBER dedup): Adds AND ig.min_edge_confidence >= threshold to the identity graph JOIN clause.

Lower-confidence profiles are excluded from audience membership counts, size estimates, and previews. Threshold changes propagate to estimates on the next audience refresh without requiring a pipeline re-run.

Sync Activation

Each audience sync can specify its own confidence threshold override. This enables destination-specific precision/recall tuning:

Ad platforms (Google Ads, Meta): Use a looser threshold (e.g., 0.5) to maximize match rate and reach.
Email and SMS platforms (Braze, Iterable): Use a stricter threshold (e.g., 0.85–0.9) to protect sender reputation and deliverability.
CRM platforms (Salesforce, HubSpot): Use an intermediate threshold based on your sales team’s tolerance for fuzzy matches.

When a sync override is set, it takes precedence over the audience threshold and graph default for that sync only.

Best Practices

Start deterministic, add probabilistic later — Get your deterministic graph working first and validate cluster quality. Once you understand your data’s identity patterns, create a new probabilistic graph to expand reach with fuzzy signals.
Use limit rules aggressively — Probabilistic matching increases the risk of super-nodes because weak signals can chain records together through transitive closure. Set tight cardinality limits on weaker identifier families such as IP addresses and household postal codes. See Limit Rules.
Review confidence distributions before setting thresholds — After running resolution, query _IDENTITY_PROFILES.min_confidence to understand the distribution of confidence scores in your graph. Set graph-level defaults based on the shape of this distribution, not arbitrary values.
Set graph-level defaults conservatively — A default threshold of 0.5 is a reasonable starting point. Loosen it for specific audiences or syncs where recall matters more than precision.
Pair with Audience Boost — Probabilistic stitching expands your first-party graph by linking known records across systems. Audience Boost enriches with third-party identifiers at sync time. They address different gaps and are complementary, not alternatives.
Monitor avg_confidence over time — As new data is ingested and resolution re-runs, track avg_confidence at the graph level. A declining average can indicate data quality degradation or a misconfigured probabilistic rule that is creating weak edges at scale.

Next Steps

Merge Rules — Configure deterministic and probabilistic rules and set rule priorities
Identifier Families — Map name, address, and IP columns to the families that probabilistic scoring functions require
Golden Records — Unified profiles with confidence metadata and survivorship rules
Running Resolution — Execute the identity resolution pipeline and review output tables

Merge Rules Limit Rules

Probabilistic Matching

Graph Types

Scoring Functions

Fuzzy Name (fuzzy_name)

IP Cluster (ip_cluster)

Household (household)

Confidence Scores

Propagation

Confidence Thresholds

Three-Level Hierarchy

Example Scenario

Impact on Golden Records

Impact on Audiences and Syncs

Audience Queries

Sync Activation

Best Practices

Next Steps

Fuzzy Name (`fuzzy_name`)

IP Cluster (`ip_cluster`)

Household (`household`)