No items found.

Extract insights from GitHub using PyAirbyte

Learn how to use PyAirbyte to extract data from Github, followed by a series of transformations and analyses to derive meaningful insights from this data. In particular, we demonstrate PyAirbyte capabilities for extracting data incrementally.

Summarize with ChatGPT

Should you build or buy your data pipelines?

Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.

Download now

In this demo, we use PyAirbyte to extract data from Github, followed by a series of transformations and analyses to derive meaningful insights from this data. In particular, we demonstrate PyAirbyte capabilities for extracting data incrementally.

Prerequisites:

  • A Github personal access token. For details on configuring authetication credentials, refer to the Github source connector documentation.
# Add virtual environment support in Google Colab
!apt-get install -qq python3.10-venv

# Install airbyte
%pip install --quiet airbyte

Load source data from Github to local cache

In this section, we establish a connection to Github via PyAirbyte. The source connector is configured with necessary parameters like the credentials repository we are interested in. Check the docs for more details on these parameters.

After configuring the source connector, we perform a check() to ensure that the configuration is correct and the connection to the API is successful. Then, we list the available streams for this source and select the ones we are interested in syncing.

Then, we proceed to read from the source into the internal DuckDB cache.

Note: The credentials are retrieved securely using the get_secret() method. This will automatically locate a matching Google Colab secret or environment variable, ensuring they are not hard-coded into the notebook. Make sure to add your key to the Secrets section on the left.

import airbyte as ab

# Create and configure the source:
source = ab.get_source(
"source-github",
install_if_missing=True,
config={
"repositories": ["airbytehq/quickstarts"],
"credentials": {
"personal_access_token": ab.get_secret("GITHUB_PERSONAL_ACCESS_TOKEN"),
},
},
)

# Verify the config and creds by running `check`:
source.check()

Installing 'source-github' into virtual environment '/content/.venv-source-github'.
Running 'pip install airbyte-source-github'...

Connector 'source-github' installed successfully!
For more information, see the source-github documentation:
https://docs.airbyte.com/integrations/sources/github#reference

Connection check succeeded for `source-github`.

# List the available streams available for the Github source
source.get_available_streams()

['issue_timeline_events',
'assignees',
'branches',
'collaborators',
'comments',
'commit_comment_reactions',
'commit_comments',
'commits',
'contributor_activity',
'deployments',
'events',
'issue_comment_reactions',
'issue_events',
'issue_labels',
'issue_milestones',
'issue_reactions',
'issues',
'organizations',
'project_cards',
'project_columns',
'projects',
'pull_request_comment_reactions',
'pull_request_commits',
'pull_request_stats',
'projects_v2',
'pull_requests',
'releases',
'repositories',
'review_comments',
'reviews',
'stargazers',
'tags',
'teams',
'team_members',
'users',
'workflows',
'workflow_runs',
'workflow_jobs',
'team_memberships']

# Select the streams we are interested in loading to cache
source.set_streams(["pull_requests", "issues", "reviews", "stargazers"])

# Read into DuckDB local default cache
cache = ab.get_default_cache()
result = source.read(cache=cache)

Read Progress

Started reading at 22:16:25.

Read 314 records over 6 seconds (52.3 records / second).

Wrote 314 records over 4 batches.

Finished reading at 22:16:32.

Started finalizing streams at 22:16:32.

Finalized 4 batches over 2 seconds.

Completed 4 out of 4 streams:

  • reviews
  • stargazers
  • pull_requests
  • issues

Completed writing at 22:16:34. Total time elapsed: 8 seconds

Completed `source-github` read operation at 22:16:34.

Read again to sync changes

The PyAirbyte Github source connector has the ability to read data incrementally by default, meaning that after the first successful data sync, only updates and new records will be synced in subsequent reads.

For more information on sync modes for this source, you can refer to the docs.

Let's read again, and see how no records will be loaded to cache.

result = source.read(cache=cache) 

Read Progress

Started reading at 22:17:30.

Read 0 records over 3 seconds (0.0 records / second).

Finished reading at 22:17:33.

Started finalizing streams at 22:17:33.

Finalized 0 batches over 0 seconds.

Completed 4 out of 4 streams:

  • reviews
  • stargazers
  • pull_requests
  • issues

Completed writing at 22:17:33. Total time elapsed: 3 seconds

Completed `source-github` read operation at 22:17:33.

Read data from the cache

Read from the already-written DuckDB cache into a pandas Dataframe. After the data is in the cache, you can read it without re-configuring or re-creating the source object. You can also select a specific stream to read from.


# Read from the cache into a pandas Dataframe:
reviews = cache["reviews"].to_pandas()
stargazers = cache["stargazers"].to_pandas()
pull_requests = cache["pull_requests"].to_pandas()
issues = cache["issues"].to_pandas()

Transform and analyze the data

Let's take our Github data and run some analysis with pandas and matplotlib.


reviews.head(3)

_links submitted_at created_at updated_at commit_id author_association
0 airbytehq/quickstarts 1689913150 PRR_kwDOKQqr2s5kugc- {'avatar_url': https://avatars.githubuserconte... COMMENTED https://github.com/airbytehq/quickstarts/pull/... https://github.com/airbytehq/quickstarts/pull/59 {'html': {'href': https://github.com/airbytehq... 2023-10-20 12:18:17 2023-10-20 12:18:17 2023-10-20 12:18:17 29ca06ae449b07f17c1be6bb104264b6d62dafcc MEMBER
1 airbytehq/quickstarts 1689921852 PRR_kwDOKQqr2s5kuik8 {'avatar_url': https://avatars.githubuserconte... COMMENTED https://github.com/airbytehq/quickstarts/pull/... https://github.com/airbytehq/quickstarts/pull/59 {'html': {'href': https://github.com/airbytehq... 2023-10-20 12:23:23 2023-10-20 12:23:22 2023-10-20 12:23:23 29ca06ae449b07f17c1be6bb104264b6d62dafcc CONTRIBUTOR
2 airbytehq/quickstarts 1689945893 PRR_kwDOKQqr2s5kuocl {'avatar_url': https://avatars.githubuserconte... COMMENTED https://github.com/airbytehq/quickstarts/pull/... https://github.com/airbytehq/quickstarts/pull/59 {'html': {'href': https://github.com/airbytehq... 2023-10-20 12:38:03 2023-10-20 12:38:03 2023-10-20 12:38:03 29ca06ae449b07f17c1be6bb104264b6d62dafcc CONTRIBUTOR
stargazers.head(3) 
repository	user_id	starred_at	user
0 airbytehq/quickstarts 95236817 2023-10-02 16:40:52 {'avatar_url': https://avatars.githubuserconte...
1 airbytehq/quickstarts 123734227 2023-10-17 10:50:04 {'avatar_url': https://avatars.githubuserconte...
2 airbytehq/quickstarts 70362748 2023-10-31 02:20:26 {'avatar_url': https://avatars.githubuserconte...
pull_requests.head(3) 
repository	url	id	node_id	html_url	diff_url	patch_url	issue_url	commits_url	review_comments_url	...	assignee	assignees	requested_reviewers	requested_teams	head	base	_links	author_association	auto_merge	draft
0 airbytehq/quickstarts https://api.github.com/repos/airbytehq/quickst... 1534111883 PR_kwDOKQqr2s5bcLCL https://github.com/airbytehq/quickstarts/pull/35 https://github.com/airbytehq/quickstarts/pull/... https://github.com/airbytehq/quickstarts/pull/... https://api.github.com/repos/airbytehq/quickst... https://api.github.com/repos/airbytehq/quickst... https://api.github.com/repos/airbytehq/quickst... ... None [] [] [] {'label': airbytehq:ecommerce-analytics-stack,... {'label': airbytehq:main, 'ref': main, 'repo':... {'comments': {'href': https://api.github.com/r... MEMBER None false
1 airbytehq/quickstarts https://api.github.com/repos/airbytehq/quickst... 1534172082 PR_kwDOKQqr2s5bcZuy https://github.com/airbytehq/quickstarts/pull/36 https://github.com/airbytehq/quickstarts/pull/... https://github.com/airbytehq/quickstarts/pull/... https://api.github.com/repos/airbytehq/quickst... https://api.github.com/repos/airbytehq/quickst... https://api.github.com/repos/airbytehq/quickst... ... None [] [] [] {'label': airbytehq:ecommerce-analytics-stack,... {'label': airbytehq:main, 'ref': main, 'repo':... {'comments': {'href': https://api.github.com/r... MEMBER None false
2 airbytehq/quickstarts https://api.github.com/repos/airbytehq/quickst... 1545412143 PR_kwDOKQqr2s5cHR4v https://github.com/airbytehq/quickstarts/pull/38 https://github.com/airbytehq/quickstarts/pull/... https://github.com/airbytehq/quickstarts/pull/... https://api.github.com/repos/airbytehq/quickst... https://api.github.com/repos/airbytehq/quickst... https://api.github.com/repos/airbytehq/quickst... ... None [] [] [] {'label': avionmission:feature/pad_bigquery_st... {'label': airbytehq:main, 'ref': main, 'repo':... {'comments': {'href': https://api.github.com/r... FIRST_TIME_CONTRIBUTOR None false
issues.head(3)
repository	id	node_id	url	repository_url	labels_url	comments_url	events_url	html_url	number	...	pull_request	closed_at	created_at	updated_at	author_association	draft	reactions	timeline_url	performed_via_github_app	state_reason
0 airbytehq/quickstarts 1917441405 PR_kwDOKQqr2s5bcLCL https://api.github.com/repos/airbytehq/quickst... https://api.github.com/repos/airbytehq/quickst... https://api.github.com/repos/airbytehq/quickst... https://api.github.com/repos/airbytehq/quickst... https://api.github.com/repos/airbytehq/quickst... https://github.com/airbytehq/quickstarts/pull/35 35 ... {'diff_url': https://github.com/airbytehq/quic... 2023-09-28T12:35:12Z 2023-09-28 12:32:19 2023-09-28 12:35:12 MEMBER false {'+1': 0, '-1': 0, 'confused': 0, 'eyes': 0, '... https://api.github.com/repos/airbytehq/quickst... None None
1 airbytehq/quickstarts 1917507008 PR_kwDOKQqr2s5bcZuy https://api.github.com/repos/airbytehq/quickst... https://api.github.com/repos/airbytehq/quickst... https://api.github.com/repos/airbytehq/quickst... https://api.github.com/repos/airbytehq/quickst... https://api.github.com/repos/airbytehq/quickst... https://github.com/airbytehq/quickstarts/pull/36 36 ... {'diff_url': https://github.com/airbytehq/quic... 2023-09-28T13:02:56Z 2023-09-28 13:01:48 2023-09-28 13:02:56 MEMBER false {'+1': 0, '-1': 0, 'confused': 0, 'eyes': 0, '... https://api.github.com/repos/airbytehq/quickst... None None
2 airbytehq/quickstarts 1930249843 PR_kwDOKQqr2s5cHR4v https://api.github.com/repos/airbytehq/quickst... https://api.github.com/repos/airbytehq/quickst... https://api.github.com/repos/airbytehq/quickst... https://api.github.com/repos/airbytehq/quickst... https://api.github.com/repos/airbytehq/quickst... https://github.com/airbytehq/quickstarts/pull/38 38 ... {'diff_url': https://github.com/airbytehq/quic... None 2023-10-06 14:04:27 2023-10-12 15:20:21 FIRST_TIME_CONTRIBUTOR false {'+1': 0, '-1': 0, 'confused': 0, 'eyes': 0, '... https://api.github.com/repos/airbytehq/quickst... None None

Pull Requests Trend Over Time

This will visualize the number of pull requests created over time.

import pandas as pd
import matplotlib.pyplot as plt

# Assuming 'pull_requests' dataframe is available and has a 'created_at' column in datetime format
pull_requests["created_at"] = pd.to_datetime(pull_requests["created_at"])
pr_trend = pull_requests.resample("M", on="created_at").size()

plt.figure(figsize=(12, 6))
pr_trend.plot(kind="line", color="blue", marker="o")
plt.title("Monthly Pull Requests Trend")
plt.xlabel("Month")
plt.ylabel("Number of Pull Requests")
plt.grid(True)
plt.show()

Open vs Closed Issues

Compare the number of open versus closed issues.

# Assuming 'issues' dataframe has a 'state' column
issue_status_count = issues["state"].value_counts()

plt.figure(figsize=(8, 6))
issue_status_count.plot(kind="bar", color=["blue", "green"])
plt.title("Open vs Closed Issues")
plt.xlabel("State")
plt.ylabel("Number of Issues")
plt.xticks(rotation=0)
plt.show()

Star Growth Over Time

Plot the cumulative count of stars over time.

# Assuming 'stargazers' dataframe has a 'starred_at' column in datetime format
stargazers["starred_at"] = pd.to_datetime(stargazers["starred_at"])
star_growth = stargazers.resample("M", on="starred_at").size().cumsum()

plt.figure(figsize=(12, 6))
star_growth.plot(kind="line", color="gold", marker="*")
plt.title("Cumulative Star Growth Over Time")
plt.xlabel("Month")
plt.ylabel("Cumulative Stars")
plt.grid(True)
plt.show()

 

Issue Resolution Followed by Pull Requests

This analysis checks if there's a trend in issue resolutions followed by pull requests, which might indicate active development in response to issues.

# Convert created_at and closet_at to datetime
pull_requests["created_at"] = pd.to_datetime(pull_requests["created_at"])
issues["closed_at"] = issues["closed_at"].dt.tz_localize(None)

# Resample both dataframes to get monthly counts
monthly_closed_issues = issues.resample("M", on="closed_at").size()
monthly_created_prs = pull_requests.resample("M", on="created_at").size()

# Combine into a single dataframe
combined_issues_prs_df = pd.concat([monthly_closed_issues, monthly_created_prs], axis=1)
combined_issues_prs_df.columns = ["Monthly Closed Issues", "Monthly Created PRs"]
combined_issues_prs_df.plot(kind="line", figsize=(12, 6))
plt.title("Monthly Closed Issues vs Monthly Created Pull Requests")
plt.xlabel("Month")
plt.ylabel("Count")
plt.grid(True)
plt.show()

Should you build or buy your data pipelines?

Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.

Download now

Similar use cases

No similar recipes were found, but check back soon!