GitHub Loader
The GitHub loader pulls repository, issue, pull request, and commit data from your GitHub organizations and repositories into your data warehouse. It uses the GitHub REST API and GraphQL API, and supports incremental sync via the updated_at timestamp for most resources.
Prerequisites
- A GitHub account with access to the repositories you want to sync
- Read access to the target repositories (public or private)
- A connected Warehouse (target warehouse) with write permissions on the target schema
Authentication
The GitHub loader supports two authentication methods.
OAuth 2.0
- In SignalSmith, click Add Loader and select GitHub
- Choose OAuth 2.0 as the authentication method
- Click Connect with GitHub
- You’ll be redirected to GitHub’s authorization page
- Grant SignalSmith access to your repositories
- Select which organizations and repositories to authorize (if using GitHub Apps)
- You’ll be redirected back to SignalSmith with the connection established
SignalSmith requests the following scopes:
| Scope | Purpose |
|---|---|
repo | Read access to repositories, issues, pull requests, and commits |
read:org | Read organization membership and teams |
Personal Access Token (PAT)
For environments where OAuth isn’t practical, you can use a GitHub Personal Access Token:
- In GitHub, go to Settings > Developer settings > Personal access tokens > Fine-grained tokens
- Click Generate new token
- Set the token name (e.g., “SignalSmith Loader”)
- Under Repository access, select the repositories you want to sync
- Under Permissions, grant:
- Repository permissions: Issues (Read), Pull requests (Read), Contents (Read), Metadata (Read)
- Organization permissions: Members (Read) — if syncing org data
- Click Generate token and copy it
- In SignalSmith, paste the token into the Personal Access Token field
Classic personal access tokens also work. Grant the repo and read:org scopes.
Repository Selection
After authentication, SignalSmith discovers accessible organizations and repositories. You can select:
- Entire organizations — Sync all repositories in the organization
- Individual repositories — Select specific repositories to sync
Available Objects
| Object | API Name | Description | Default Sync Mode |
|---|---|---|---|
| Repositories | repos | Repository metadata, settings, and statistics | Incremental |
| Issues | issues | Issues with labels, assignees, milestones, and state | Incremental |
| Pull Requests | pull_requests | Pull requests with review status, merge state, and branch info | Incremental |
| Commits | commits | Commit history with author, message, and file changes | Incremental |
| Comments | comments | Comments on issues and pull requests | Incremental |
| Reviews | reviews | Pull request review records with state and body | Incremental |
| Labels | labels | Label definitions per repository | Full Refresh |
| Milestones | milestones | Milestone definitions with due dates and progress | Full Refresh |
| Releases | releases | Release tags with notes and assets | Incremental |
| Teams | teams | Organization team definitions and membership | Full Refresh |
| Members | members | Organization member profiles | Full Refresh |
| Workflows | workflows | GitHub Actions workflow definitions | Full Refresh |
| Workflow Runs | workflow_runs | GitHub Actions workflow execution history | Incremental |
Issues vs. Pull Requests
In GitHub’s API, pull requests are a superset of issues. SignalSmith extracts them as separate objects to simplify querying. Issues that are also pull requests are included in the pull_requests table, not the issues table.
Commits
Commits are extracted per repository and include:
- Commit SHA, message, and timestamp
- Author and committer information
- File change statistics (additions, deletions, changed files count)
- Parent commit SHAs (for merge detection)
For large repositories with extensive history, the initial backfill may be limited to a configurable number of recent commits (default: last 10,000).
Configuration
| Setting | Description | Default |
|---|---|---|
| Auth Method | OAuth 2.0 or Personal Access Token | OAuth 2.0 |
| Organizations | Organizations to sync repositories from | — (optional) |
| Repositories | Specific repositories to sync (overrides org selection) | — (optional) |
| Objects | List of objects to sync | — (you choose during setup) |
| Sync Mode | Full Refresh or Incremental (per object) | Incremental |
| Cursor Field | Field used for incremental sync | updated_at |
| Primary Key | Field(s) that uniquely identify a record | id |
| Target Schema | Warehouse schema for GitHub tables | — (required) |
| Table Prefix | Optional prefix for table names | gh_ |
| Schedule | Sync frequency | Hourly |
Scheduling Notes
- Rate limits: GitHub’s REST API allows 5,000 requests per hour for authenticated users. The GraphQL API has a separate point-based limit of 5,000 points per hour. SignalSmith manages both limits with automatic backoff.
- Secondary rate limits: GitHub also enforces secondary (abuse) rate limits for rapid consecutive requests. SignalSmith paces requests to avoid these limits.
- Large repositories: Repositories with hundreds of thousands of issues or commits may require multiple sync runs to complete the initial backfill. SignalSmith uses pagination cursors to resume across runs.
- GitHub Enterprise: For GitHub Enterprise Server deployments, provide your GitHub Enterprise base URL instead of
github.com. The API endpoints are the same but hosted on your domain. - Fork repositories: By default, forked repositories are excluded from organization syncs. Enable “Include Forks” if you need fork data.
Schema Mapping
GitHub field types are mapped to warehouse-compatible types:
| GitHub Type | Warehouse Type | Notes |
|---|---|---|
string | VARCHAR | |
integer | BIGINT | IDs, counts |
boolean | BOOLEAN | |
datetime | TIMESTAMP | UTC normalized from ISO 8601 |
array | JSON / VARCHAR | Labels, assignees, requested reviewers |
object | JSON / VARCHAR | Nested structures like user, head, base |
null | NULL | Optional fields that may not be set |
Troubleshooting
| Issue | Solution |
|---|---|
| ”401 Bad credentials” | Token has expired or been revoked. Re-authenticate or regenerate the PAT |
| ”403 API rate limit exceeded” | Wait for the rate limit to reset (check X-RateLimit-Reset header) or reduce sync frequency |
| ”404 Not Found” on specific repos | The authenticated user may lack access to the repository. Verify permissions |
| Missing private repositories | Ensure the OAuth app or PAT has the repo scope for private repository access |
| Commits table is very large | Limit the commit history depth in the loader configuration, or sync only recent commits |
| Issues include pull requests | GitHub’s API returns pull requests as issues. SignalSmith separates them, but if using the raw API, filter by pull_request field |
| Workflow runs not appearing | GitHub Actions data requires the actions scope (included in repo). Verify the repository uses GitHub Actions |
Next Steps
- Create a model to transform your raw GitHub data
- Build engineering metrics dashboards (cycle time, PR review time, deployment frequency)
- Correlate product development activity with customer outcomes