End-to-end RAG using GitHub, PyAirbyte, and Langchain

Join our newsletter to get all the insights on the data stack

PyAirbyte Demo

This demo uses the PyAirbyte libary to read records from Github, converts those records to documents, which can then be passed to LangChain for RAG.

Prerequisites:

A Github personal access token. For details on configuring authetication credentials, refer to the Github source connector documentation.
OpenAI API Key. You can create one by signing up on https://openai.com/ and going to the "Keys" tab on left sidebar.

Install PyAirbyte and other dependencies

# Add virtual environment support for running in Google Colab
!apt-get install -qq python3.10-venv

# Install PyAirbyte & Langchain modules
%pip install --quiet airbyte langchain langchain_openai langchainhub chromadb

Load the Source Data using PyAirbyte

import airbyte as ab

# Configure and read from the source
read_result = ab.get_source(
    "source-github",
    config={
        "repositories": ["airbytehq/pyAirbyte"],
        "credentials": {
            "personal_access_token": ab.get_secret("GITHUB_PERSONAL_ACCESS_TOKEN")

        }
    },
    streams=["issues"],
).read()

Read a single record from stream to examine the fields

first_record = next((record for record in read_result["issues"]))

# Print the fields list, followed by the first full record.
display(list(first_record.keys()))
display(first_record)

Use PyAirbyte `to_documents()` method on a dataset

This demo uses a new to_documents() method, which accepts record property names which point to specific aspects of the document.

When we set render_metadata=True, then metadata properties are also published to the markdown file. This option is helpful for small-ish documents, when passing the entire document to the LLM. It will be less helpful on long documents which are planned to be split into smaller chuns.

Note: We use rich to print the documents as markdown, although that's not strictly necessary.

import textwrap
from rich.console import Console

# convert incoming stream data into documents
docs = list(read_result["issues"].to_documents(
    title_property="title",
    content_properties=["body"],
    metadata_properties=["state", "url", "number"],
    render_metadata=True,
))

# print a doc comprising github issue
console = Console()
console.print(str(docs[10]))

Use Langchain to build a RAG pipeline.

Here, we just show a generic method of splitting docs. This and the following steps are copied from a generic LangChain tutorial.

# Split the docs so they can be stored in vector database downstream
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=30)

chunked_docs = splitter.split_documents(docs)

print(f"Created {len(chunked_docs)} document chunks.")

Now we can publish the chunks to a vector store. Ensure you have added your OPENAI_API_KEY to the secrects tab on left.

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
import os

os.environ['OPENAI_API_KEY'] = ab.get_secret("OPENAI_API_KEY")

# store into vector db
vectorstore = Chroma.from_documents(documents=chunked_docs, embedding=OpenAIEmbeddings())
print("Chunks successfully stored in vectorstore.")

Set up a RAG application using LangChain.

from langchain_openai import ChatOpenAI
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
print("Langchain RAG pipeline set up successfully.")

Ask a question.

console.print(rag_chain.invoke("Show me all documentation related issues, along with issue number, each on a new line."))

Should you build or buy your data pipelines?

Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.

Download now

End-to-end RAG using GitHub, PyAirbyte, and Langchain

Join our newsletter to get all the insights on the data stack

Should you build or buy your data pipelines?

About the Author

About the Author

Join our newsletter to get all the insights on the data stack

PyAirbyte Demo

Prerequisites:

Install PyAirbyte and other dependencies

Load the Source Data using PyAirbyte

Read a single record from stream to examine the fields

Use PyAirbyte `to_documents()` method on a dataset

Use Langchain to build a RAG pipeline.

Should you build or buy your data pipelines?

About the Author

About the Author

Similar use cases

Join our newsletter to get all the insights on the data stack

Should you build or buy your data pipelines?

About the Author

About the Author

Join our newsletter to get all the insights on the data stack

PyAirbyte Demo

Prerequisites:

Install PyAirbyte and other dependencies

Load the Source Data using PyAirbyte

Read a single record from stream to examine the fields

Use PyAirbyte to_documents() method on a dataset

Use Langchain to build a RAG pipeline.

Should you build or buy your data pipelines?

About the Author

About the Author

Similar use cases

Use PyAirbyte `to_documents()` method on a dataset