WarehousesDatabricks

Databricks

This guide covers how to configure Databricks as a warehouse in SignalSmith, including workspace setup, SQL warehouse configuration, authentication, and required permissions.

Prerequisites

  • A Databricks workspace (AWS, Azure, or GCP)
  • A SQL warehouse or all-purpose cluster running in the workspace
  • A personal access token with read permissions
  • Network access from SignalSmith to your Databricks workspace

Connection Configuration

Required Fields

FieldDescriptionExample
HostThe Databricks workspace URL (without https://)mycompany.cloud.databricks.com
HTTP PathThe HTTP path for the SQL warehouse or cluster/sql/1.0/warehouses/abc123def456
TokenA Databricks personal access tokendapi1234567890abcdef
CatalogThe Unity Catalog name (or hive_metastore for legacy)main
SchemaThe default schema to querydefault

Host

The workspace URL is visible in your browser’s address bar when logged in to Databricks. Use just the hostname without https://:

# AWS
mycompany.cloud.databricks.com

# Azure
adb-1234567890123456.7.azuredatabricks.net

# GCP
mycompany.gcp.databricks.com

HTTP Path

The HTTP path identifies which compute resource to use for running queries. You can find it in the Databricks UI:

For SQL Warehouses (recommended):

  1. Go to SQL Warehouses in the left sidebar
  2. Click on your warehouse
  3. Go to the Connection Details tab
  4. Copy the HTTP Path value

The format is typically:

/sql/1.0/warehouses/<warehouse-id>

For All-Purpose Clusters:

  1. Go to Compute in the left sidebar
  2. Click on your cluster
  3. Go to Advanced Options > JDBC/ODBC
  4. Copy the HTTP Path value

The format is typically:

/sql/protocolv1/o/<org-id>/<cluster-id>

SQL Warehouses are recommended over all-purpose clusters for SignalSmith because they start faster, scale automatically, and provide better cost isolation.

Catalog and Schema

If your Databricks workspace uses Unity Catalog, specify the catalog name (e.g., main, production). If you’re using the legacy Hive Metastore, set the catalog to hive_metastore.

-- Unity Catalog: catalog.schema.table
SELECT * FROM main.customer_data.users
 
-- Hive Metastore: schema.table (catalog is hive_metastore)
SELECT * FROM customer_data.users

Databricks is case-insensitive for schema and table names, so Customer_Data and customer_data reference the same schema.

Authentication

Personal Access Token

SignalSmith uses Databricks personal access tokens (PATs) for authentication.

To create a PAT:

  1. In the Databricks workspace, click your username in the top-right corner
  2. Select Settings
  3. Go to Developer > Access Tokens
  4. Click Generate New Token
  5. Set a description (e.g., “SignalSmith CDP”) and expiration
  6. Copy the token — it is only shown once
Token format: dapi1234567890abcdef1234567890abcdef

Token expiration: Set a reasonable expiration (e.g., 90 days) and establish a rotation process. When the token expires, update the source configuration in SignalSmith and re-test the connection.

Service Principal (Alternative)

For production environments, consider using a Databricks service principal instead of a personal access token:

  1. Create a service principal in Account Console > User Management
  2. Generate an OAuth secret or PAT for the service principal
  3. Grant the service principal access to the required catalog, schema, and tables

Required Permissions

Unity Catalog Permissions

-- Grant catalog access
GRANT USE CATALOG ON CATALOG main TO `signalsmith-service-principal`;
 
-- Grant schema access
GRANT USE SCHEMA ON SCHEMA main.customer_data TO `signalsmith-service-principal`;
 
-- Grant table read access
GRANT SELECT ON SCHEMA main.customer_data TO `signalsmith-service-principal`;

Hive Metastore Permissions

-- Grant database access
GRANT USAGE ON DATABASE customer_data TO `cdp_user`;
 
-- Grant table read access
GRANT SELECT ON DATABASE customer_data TO `cdp_user`;

SQL Warehouse Access

The user or service principal must also have Can Use permission on the SQL warehouse:

  1. Go to SQL Warehouses
  2. Click on the warehouse
  3. Go to the Permissions tab
  4. Add the user/service principal with Can Use permission

Data Types

Databricks TypeSignalSmith Handling
STRINGMapped as text
INT, BIGINT, DOUBLE, DECIMALMapped as number
BOOLEANMapped as boolean
TIMESTAMP, DATEMapped as date/datetime
STRUCTSupported in queries; flattened for sync
ARRAYSupported in queries; flattened for sync
MAPSupported in queries; serialized as JSON for sync

Delta Lake Features

Databricks tables use Delta Lake format, which provides:

  • Time travel — Query historical versions of tables in your model SQL
  • Schema evolution — Tables can change schema over time; SignalSmith detects column changes
  • ACID transactions — Consistent reads even during concurrent writes
-- Time travel: query data as of a specific timestamp
SELECT * FROM customer_data.users TIMESTAMP AS OF '2025-01-01'
 
-- Time travel: query a specific version
SELECT * FROM customer_data.users VERSION AS OF 42

Example Configuration

curl -X POST https://your-workspace.signalsmith.dev/api/v1/sources \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Production Databricks",
    "type": "databricks",
    "config": {
      "host": "mycompany.cloud.databricks.com",
      "http_path": "/sql/1.0/warehouses/abc123def456",
      "token": "dapi1234567890abcdef",
      "catalog": "main",
      "schema": "customer_data"
    }
  }'

Network Configuration

If your Databricks workspace uses private networking (Private Link, VNet injection), ensure that SignalSmith can reach the workspace’s public or private endpoint. Options include:

  • Public endpoint with IP allowlisting — Add SignalSmith IPs to the workspace’s IP access list
  • Private connectivity — Contact SignalSmith support for private link options

To configure IP access lists in Databricks:

  1. Go to Admin Console > Workspace Settings
  2. Enable IP Access Lists
  3. Add SignalSmith’s IP addresses to the allowlist

Troubleshooting

IssueSolution
”Invalid access token”Verify the token is correct and has not expired; generate a new one if needed
”SQL warehouse is not running”Start the SQL warehouse or enable auto-start in its configuration
”Catalog ‘X’ not found”Verify the catalog name; use hive_metastore if not using Unity Catalog
”Connection timed out”Check network access — ensure SignalSmith IPs are allowed and the workspace is reachable
”Insufficient privileges”Verify USE CATALOG, USE SCHEMA, and SELECT permissions are granted
”HTTP Path is invalid”Verify the HTTP path from the warehouse/cluster Connection Details tab

Next Steps