End-to-end RAG using Github, PyAirbyte and Chroma Vector DB

Join our newsletter to get all the insights on the data stack

This notebook illustrates the complete setup of a Retrieval-Augmented Generation (RAG) pipeline.
We extract data from a GitHub repository using PyAirbyte, store the data in a Chroma vector store, and use LangChain to perform RAG on the stored data.

Prerequisites

1) OpenAI API Key:

Create an OpenAI Account: Sign up for an account on OpenAI.
Generate an API Key: Go to the API section and generate a new API key. For detailed instructions, refer to the OpenAI documentation.

2) Github Personal Access Token:

Create a Github Account: Sign up for an account on Github.
Generate an API Key: Cick on your profile icon->Settings->Developer Settings and generate a new API key. For detailed instructions, refer to the Github documentation.

Installing Dependencies

First Thing First !
Lets get the dependencies installed before anything else!!

# Add virtual environment support for running in Google Colab
!apt-get install -qq python3.10-venv

# First, we need to install the necessary libraries.
!pip3 install airbyte langchain langchain-openai chromadb python-dotenv langchainhub langchain-chroma

Source Setup : Github with PyAirbyte

The code you see below configures an Airbyte source to pull out data from a github repository.

You can also customize the configuration according to your own needs. See this

Note that we here only fetch data from the Commits Stream
To know about all the available streams go here

import airbyte as ab

source = ab.get_source(
    "source-github",
    config={
        "credentials": {
            "personal_access_token": "your_personal_access_token"
        },
        "repositories": ["your_github_username/your_repository_ID"]
    }
)
source.check()

source.get_available_streams()
source.select_streams(["commits"])
cache = ab.get_default_cache()
result = source.read(cache=cache)

commits_details = [doc for doc in result["commits"].to_documents()]

print(str(commits_details[0]))

Split Documents into Chunks

Large documents are split into smaller chunks to make them easier to handle. This also helps in improving the efficiency of the retrieval process, as smaller chunks can be more relevant to specific queries.
Here we set each chunk size to 512 characters and adjacent chunks will overlap by 50 characters to ensure continuity of context
Then the loop converts all metadata to string format to ensure consistent processing later in the pipeline.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunked_docs = splitter.split_documents(commits_details)

for doc in chunked_docs:
    for md in doc.metadata:
        doc.metadata[md] = str(doc.metadata[md])


from langchain_openai import OpenAIEmbeddings
import os

os.environ['OPENAI_API_KEY'] = ab.get_secret("YOUR_OPENAI_API_KEY")
embeddings = OpenAIEmbeddings()

Setting up Chroma

Create and configure a Chroma vector store to store the document embeddings.
First we initialize Chroma Client
Then we create Chroma Vector Store from Documents
Finally we use embedding function when accessing the collection
Since currently there is a waitlist for Chroma,We initialize the Chroma client in persistent mode (local file)


import chromadb
from langchain_chroma import Chroma
from chromadb.utils import embedding_functions

persist_directory = 'chroma_db'
client = chromadb.PersistentClient(path=persist_directory)
collection_name = "github_commits"

openai_lc_client = Chroma.from_documents(
    documents=chunked_docs,
    embedding=embeddings,
    persist_directory=persist_directory,
    collection_name=collection_name
)

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.getenv("OPENAI_API_KEY"),
    model_name="text-embedding-ada-002"
)
collection = client.get_collection(name=collection_name, embedding_function=openai_ef)

Querying Chroma and RAG Pipeline

Finally we use LangChain to retrieve documents from Chroma and generate responses using an OpenAI chat model.


from langchain_openai import ChatOpenAI
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Initialize the LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# Set up the retriever from the Chroma vector store
retriever = openai_lc_client.as_retriever()

# Set up the prompt
prompt = hub.pull("rlm/rag-prompt")

# Function to format documents
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Create the RAG chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

print("Langchain RAG pipeline set up successfully.")

# Example query
response = rag_chain.invoke("Which are the commit messages of latest commits?")
print(response)

About the Author

Should you build or buy your data pipelines?

Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.

Download now

Join our newsletter to get all the insights on the data stack

Should you build or buy your data pipelines?

About the Author

About the Author

Join our newsletter to get all the insights on the data stack

Prerequisites

Installing Dependencies

Source Setup : Github with PyAirbyte

Split Documents into Chunks

Setting up Chroma

Querying Chroma and RAG Pipeline

About the Author

About the Author

Should you build or buy your data pipelines?

Similar use cases