LoadersGitHub

GitHub Loader

The GitHub loader pulls repository, issue, pull request, and commit data from your GitHub organizations and repositories into your data warehouse. It uses the GitHub REST API and GraphQL API, and supports incremental sync via the updated_at timestamp for most resources.

Prerequisites

  • A GitHub account with access to the repositories you want to sync
  • Read access to the target repositories (public or private)
  • A connected Warehouse (target warehouse) with write permissions on the target schema

Authentication

The GitHub loader supports two authentication methods.

OAuth 2.0

  1. In SignalSmith, click Add Loader and select GitHub
  2. Choose OAuth 2.0 as the authentication method
  3. Click Connect with GitHub
  4. You’ll be redirected to GitHub’s authorization page
  5. Grant SignalSmith access to your repositories
  6. Select which organizations and repositories to authorize (if using GitHub Apps)
  7. You’ll be redirected back to SignalSmith with the connection established

SignalSmith requests the following scopes:

ScopePurpose
repoRead access to repositories, issues, pull requests, and commits
read:orgRead organization membership and teams

Personal Access Token (PAT)

For environments where OAuth isn’t practical, you can use a GitHub Personal Access Token:

  1. In GitHub, go to Settings > Developer settings > Personal access tokens > Fine-grained tokens
  2. Click Generate new token
  3. Set the token name (e.g., “SignalSmith Loader”)
  4. Under Repository access, select the repositories you want to sync
  5. Under Permissions, grant:
    • Repository permissions: Issues (Read), Pull requests (Read), Contents (Read), Metadata (Read)
    • Organization permissions: Members (Read) — if syncing org data
  6. Click Generate token and copy it
  7. In SignalSmith, paste the token into the Personal Access Token field

Classic personal access tokens also work. Grant the repo and read:org scopes.

Repository Selection

After authentication, SignalSmith discovers accessible organizations and repositories. You can select:

  • Entire organizations — Sync all repositories in the organization
  • Individual repositories — Select specific repositories to sync

Available Objects

ObjectAPI NameDescriptionDefault Sync Mode
RepositoriesreposRepository metadata, settings, and statisticsIncremental
IssuesissuesIssues with labels, assignees, milestones, and stateIncremental
Pull Requestspull_requestsPull requests with review status, merge state, and branch infoIncremental
CommitscommitsCommit history with author, message, and file changesIncremental
CommentscommentsComments on issues and pull requestsIncremental
ReviewsreviewsPull request review records with state and bodyIncremental
LabelslabelsLabel definitions per repositoryFull Refresh
MilestonesmilestonesMilestone definitions with due dates and progressFull Refresh
ReleasesreleasesRelease tags with notes and assetsIncremental
TeamsteamsOrganization team definitions and membershipFull Refresh
MembersmembersOrganization member profilesFull Refresh
WorkflowsworkflowsGitHub Actions workflow definitionsFull Refresh
Workflow Runsworkflow_runsGitHub Actions workflow execution historyIncremental

Issues vs. Pull Requests

In GitHub’s API, pull requests are a superset of issues. SignalSmith extracts them as separate objects to simplify querying. Issues that are also pull requests are included in the pull_requests table, not the issues table.

Commits

Commits are extracted per repository and include:

  • Commit SHA, message, and timestamp
  • Author and committer information
  • File change statistics (additions, deletions, changed files count)
  • Parent commit SHAs (for merge detection)

For large repositories with extensive history, the initial backfill may be limited to a configurable number of recent commits (default: last 10,000).

Configuration

SettingDescriptionDefault
Auth MethodOAuth 2.0 or Personal Access TokenOAuth 2.0
OrganizationsOrganizations to sync repositories from— (optional)
RepositoriesSpecific repositories to sync (overrides org selection)— (optional)
ObjectsList of objects to sync— (you choose during setup)
Sync ModeFull Refresh or Incremental (per object)Incremental
Cursor FieldField used for incremental syncupdated_at
Primary KeyField(s) that uniquely identify a recordid
Target SchemaWarehouse schema for GitHub tables— (required)
Table PrefixOptional prefix for table namesgh_
ScheduleSync frequencyHourly

Scheduling Notes

  • Rate limits: GitHub’s REST API allows 5,000 requests per hour for authenticated users. The GraphQL API has a separate point-based limit of 5,000 points per hour. SignalSmith manages both limits with automatic backoff.
  • Secondary rate limits: GitHub also enforces secondary (abuse) rate limits for rapid consecutive requests. SignalSmith paces requests to avoid these limits.
  • Large repositories: Repositories with hundreds of thousands of issues or commits may require multiple sync runs to complete the initial backfill. SignalSmith uses pagination cursors to resume across runs.
  • GitHub Enterprise: For GitHub Enterprise Server deployments, provide your GitHub Enterprise base URL instead of github.com. The API endpoints are the same but hosted on your domain.
  • Fork repositories: By default, forked repositories are excluded from organization syncs. Enable “Include Forks” if you need fork data.

Schema Mapping

GitHub field types are mapped to warehouse-compatible types:

GitHub TypeWarehouse TypeNotes
stringVARCHAR
integerBIGINTIDs, counts
booleanBOOLEAN
datetimeTIMESTAMPUTC normalized from ISO 8601
arrayJSON / VARCHARLabels, assignees, requested reviewers
objectJSON / VARCHARNested structures like user, head, base
nullNULLOptional fields that may not be set

Troubleshooting

IssueSolution
”401 Bad credentials”Token has expired or been revoked. Re-authenticate or regenerate the PAT
”403 API rate limit exceeded”Wait for the rate limit to reset (check X-RateLimit-Reset header) or reduce sync frequency
”404 Not Found” on specific reposThe authenticated user may lack access to the repository. Verify permissions
Missing private repositoriesEnsure the OAuth app or PAT has the repo scope for private repository access
Commits table is very largeLimit the commit history depth in the loader configuration, or sync only recent commits
Issues include pull requestsGitHub’s API returns pull requests as issues. SignalSmith separates them, but if using the raw API, filter by pull_request field
Workflow runs not appearingGitHub Actions data requires the actions scope (included in repo). Verify the repository uses GitHub Actions

Next Steps

  • Create a model to transform your raw GitHub data
  • Build engineering metrics dashboards (cycle time, PR review time, deployment frequency)
  • Correlate product development activity with customer outcomes