Learn how to use PyAirbyte to extract data from Github, followed by a series of transformations and analyses to derive meaningful insights from this data. In particular, we demonstrate PyAirbyte capabilities for extracting data incrementally.
Join our newsletter to get all the insights on the data stack
Should you build or buy your data pipelines?
Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.
In this demo, we use PyAirbyte to extract data from Github, followed by a series of transformations and analyses to derive meaningful insights from this data. In particular, we demonstrate PyAirbyte capabilities for extracting data incrementally.
Prerequisites:
A Github personal access token. For details on configuring authetication credentials, refer to the Github source connector documentation.
# Add virtual environment support in Google Colab
!apt-get install -qq python3.10-venv
# Install airbyte
%pip install --quiet airbyte
Load source data from Github to local cache
In this section, we establish a connection to Github via PyAirbyte. The source connector is configured with necessary parameters like the credentials repository we are interested in. Check the docs for more details on these parameters.
After configuring the source connector, we perform a check() to ensure that the configuration is correct and the connection to the API is successful. Then, we list the available streams for this source and select the ones we are interested in syncing.
Then, we proceed to read from the source into the internal DuckDB cache.
Note: The credentials are retrieved securely using the get_secret() method. This will automatically locate a matching Google Colab secret or environment variable, ensuring they are not hard-coded into the notebook. Make sure to add your key to the Secrets section on the left.
import airbyte as ab
# Create and configure the source:
source = ab.get_source(
"source-github",
install_if_missing=True,
config={
"repositories": ["airbytehq/quickstarts"],
"credentials": {
"personal_access_token": ab.get_secret("GITHUB_PERSONAL_ACCESS_TOKEN"),
},
},
)
# Verify the config and creds by running `check`:
source.check()
Installing 'source-github' into virtual environment '/content/.venv-source-github'. Running 'pip install airbyte-source-github'...
Connector 'source-github' installed successfully! For more information, see the source-github documentation: https://docs.airbyte.com/integrations/sources/github#reference
Connection check succeeded for `source-github`.
# List the available streams available for the Github source
source.get_available_streams()
# Select the streams we are interested in loading to cache
source.set_streams(["pull_requests", "issues", "reviews", "stargazers"])
# Read into DuckDB local default cache
cache = ab.get_default_cache()
result = source.read(cache=cache)
Read Progress
Started reading at 22:16:25.
Read 314 records over 6 seconds (52.3 records / second).
Wrote 314 records over 4 batches.
Finished reading at 22:16:32.
Started finalizing streams at 22:16:32.
Finalized 4 batches over 2 seconds.
Completed 4 out of 4 streams:
reviews
stargazers
pull_requests
issues
Completed writing at 22:16:34. Total time elapsed: 8 seconds
Completed `source-github` read operation at 22:16:34.
Read again to sync changes
The PyAirbyte Github source connector has the ability to read data incrementally by default, meaning that after the first successful data sync, only updates and new records will be synced in subsequent reads.
For more information on sync modes for this source, you can refer to the docs.
Let's read again, and see how no records will be loaded to cache.
result = source.read(cache=cache)
Read Progress
Started reading at 22:17:30.
Read 0 records over 3 seconds (0.0 records / second).
Finished reading at 22:17:33.
Started finalizing streams at 22:17:33.
Finalized 0 batches over 0 seconds.
Completed 4 out of 4 streams:
reviews
stargazers
pull_requests
issues
Completed writing at 22:17:33. Total time elapsed: 3 seconds
Completed `source-github` read operation at 22:17:33.
Read data from the cache
Read from the already-written DuckDB cache into a pandas Dataframe. After the data is in the cache, you can read it without re-configuring or re-creating the source object. You can also select a specific stream to read from.
# Read from the cache into a pandas Dataframe:
reviews = cache["reviews"].to_pandas()
stargazers = cache["stargazers"].to_pandas()
pull_requests = cache["pull_requests"].to_pandas()
issues = cache["issues"].to_pandas()
Transform and analyze the data
Let's take our Github data and run some analysis with pandas and matplotlib.
This will visualize the number of pull requests created over time.
import pandas as pd
import matplotlib.pyplot as plt
# Assuming 'pull_requests' dataframe is available and has a 'created_at' column in datetime format
pull_requests["created_at"] = pd.to_datetime(pull_requests["created_at"])
pr_trend = pull_requests.resample("M", on="created_at").size()
plt.figure(figsize=(12, 6))
pr_trend.plot(kind="line", color="blue", marker="o")
plt.title("Monthly Pull Requests Trend")
plt.xlabel("Month")
plt.ylabel("Number of Pull Requests")
plt.grid(True)
plt.show()
Open vs Closed Issues
Compare the number of open versus closed issues.
# Assuming 'issues' dataframe has a 'state' column
issue_status_count = issues["state"].value_counts()
plt.figure(figsize=(8, 6))
issue_status_count.plot(kind="bar", color=["blue", "green"])
plt.title("Open vs Closed Issues")
plt.xlabel("State")
plt.ylabel("Number of Issues")
plt.xticks(rotation=0)
plt.show()
Star Growth Over Time
Plot the cumulative count of stars over time.
# Assuming 'stargazers' dataframe has a 'starred_at' column in datetime format
stargazers["starred_at"] = pd.to_datetime(stargazers["starred_at"])
star_growth = stargazers.resample("M", on="starred_at").size().cumsum()
plt.figure(figsize=(12, 6))
star_growth.plot(kind="line", color="gold", marker="*")
plt.title("Cumulative Star Growth Over Time")
plt.xlabel("Month")
plt.ylabel("Cumulative Stars")
plt.grid(True)
plt.show()
Issue Resolution Followed by Pull Requests
This analysis checks if there's a trend in issue resolutions followed by pull requests, which might indicate active development in response to issues.
# Convert created_at and closet_at to datetime
pull_requests["created_at"] = pd.to_datetime(pull_requests["created_at"])
issues["closed_at"] = issues["closed_at"].dt.tz_localize(None)
# Resample both dataframes to get monthly counts
monthly_closed_issues = issues.resample("M", on="closed_at").size()
monthly_created_prs = pull_requests.resample("M", on="created_at").size()
# Combine into a single dataframe
combined_issues_prs_df = pd.concat([monthly_closed_issues, monthly_created_prs], axis=1)
combined_issues_prs_df.columns = ["Monthly Closed Issues", "Monthly Created PRs"]
combined_issues_prs_df.plot(kind="line", figsize=(12, 6))
plt.title("Monthly Closed Issues vs Monthly Created Pull Requests")
plt.xlabel("Month")
plt.ylabel("Count")
plt.grid(True)
plt.show()
Should you build or buy your data pipelines?
Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.