End-to-end RAG using Github, PyAirbyte and Weaviate

Learn how to load data from Github into Weaviate using PyAirbyte, then to use source-github and its stream 'issues'.

Should you build or buy your data pipelines?

Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.

Download now

Should you build or buy your data pipelines?

Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.

Download now

In this notebook, we'll illustrate how to load data from github into Weaviate using PyAirbyte, then afterward retrieving In this, we've used source-github and stream 'issues' of it, but you can replace the source according to your requirements.

Prerequisites

  1. Github:
  2. Weaviate Account:
    • Create a Weaviate Account: Sign up for an account on the Weaviate website.
    • Create a Cluster: Follow this instruction to create database and obatain weaviate API_KEY and URL.
  3. OpenAI API Key:
    • Create an OpenAI Account: Sign up for an acco unt on OpenAI.
    • Generate an API Key: Go to the API section and generate a new API key. For detailed instructions, refer to the OpenAI documentation.

Install PyAirbyte and other dependencies

# Add virtual environment support for running in Google Colab
!apt-get install -qq python3.10-venv

# First, we need to install the necessary libraries.
!pip3 install airbyte weaviate-client python-dotenv


Setup Source Github

Note: The credentials are retrieved securely using the get_secret() method. This will automatically locate a matching Google Colab secret or environment variable, ensuring they are not hard-coded into the notebook. Make sure to add your key to the Secrets section on the left.

import airbyte as ab

source = ab.get_source(
    "source-github",
    config={
        "repositories": ab.get_secret('GITHUB_REPOSITORY'),
        "credentials": {
            "personal_access_token": ab.get_secret('GITHUB_ACCESS_TOKEN'),
        },
    },
)
source.check()


Reads the data from the selected issues stream, extracting the GitHub issues data for further processing.

# In this notebook we are focused on only issues stream# checkout all stream here : https://docs.airbyte.com/integrations/sources/gitlab#supported-streamsprint(source.get_available_streams())source.select_streams(["issues"])cache = ab.get_default_cache()result = source.read(cache=cache,force_full_refresh=True)
# In this notebook we are focused on only issues stream
# checkout all stream here : https://docs.airbyte.com/integrations/sources/gitlab#supported-streams

print(source.get_available_streams())
source.select_streams(["issues"])
cache = ab.get_default_cache()
result = source.read(cache=cache,force_full_refresh=True)
issues_details = result['issues'].to_pandas() #coverting data from issues stream to pandas dataframe

print(issues_details.columns)
print(issues_details[10])


Setting up Weaviate

Connect to the weaviate instance, Enter your weaviate cluster url and API_KEY

import weaviate

client = weaviate.Client(
    url = ab.get_secret('WCD_URL'),  # Replace with your Weaviate endpoint
    auth_client_secret=weaviate.auth.AuthApiKey(api_key=ab.get_secret('WCD_API_KEY')),  # Replace with your Weaviate instance API key
    additional_headers = {
        "X-OpenAI-Api-Key": ab.get_secret('OPENAI_API_KEY')  # Replace with your Openai API key
    }
)

Weaviate stores data in collections. Each data object in a collection has a set of properties and a vector representation.

collection_name = "issues" # name of collection
class_obj = {
    "class": collection_name,
    "vectorizer": "text2vec-openai",  # If set to "none" you must always provide vectors yourself. Could be any other "text2vec-*" also.
    "moduleConfig": {
        "text2vec-openai": {},
        "generative-openai": {}  # Ensure the `generative-openai` module is used for generative queries
    }
}

client.schema.create_class(class_obj)
# Batch imports are an efficient way to add multiple data objects and cross-references.
client.batch.configure(batch_size=100)  # Configure batch

#The following example adds objects to the collection.
with client.batch as batch:  # Initialize a batch process
    for i,d in enumerate(issues_details):  # Batch import data
        properties = {
            "issue_details": d, # You can also change property name here and also add multiple property
        }
        batch.add_data_object(
            data_object=properties,
            class_name=collection_name
        )

Weaviate has integrated generative search capabilities, so that the retrieval and generation steps are combined into a single query. This means that you can use Weaviate's search capabilities to retrieve the data you need, and then in the same query, prompt the LLM with the same data.

This makes it easier, faster and more efficient to implement generative search workflows in your application.

You can checkout more ways of query here.

response = (
    client.query
    .get(class_name=collection_name, properties=["issue_details"])
    .with_near_text({"concepts": ["title","comments"]})
    .with_generate(single_prompt="Use {issue_details}, Give me summary of Pagination Handling in Github COnnector issues in airbytehq/quickstarts repository") # do not forget to add USE {property_names} in prompt
    .with_limit(1)
    .do()
)

print(response["data"]["Get"]["Issues"][0]["_additional"]["generate"]["singleResult"])
/*
Summary of Pagination Handling in Github Connector issues in airbytehq/quickstarts repository:

- Pagination handling in the Github Connector issues involves retrieving a limited number of issues at a time from the Github API and then using pagination to fetch the next set of issues.
- This ensures that large datasets of issues can be efficiently retrieved without overwhelming the API or the system.
- The Github Connector in the airbytehq/quickstarts repository likely implements pagination logic to handle the retrieval of issues in a systematic and efficient manner.
- Pagination parameters such as page number and page size are typically used to control the retrieval of issues in batches.
- Proper pagination handling is crucial for managing large volumes of data and ensuring smooth and efficient data retrieval from the Github API.
*/

Similar use cases

No similar recipes were found, but check back soon!