PyAirbyte Demo
This demo uses the PyAirbyte libary to read records from Github, converts those records to documents, which can then be passed to LangChain for RAG.
Prerequisites:
- A Github personal access token. For details on configuring authetication credentials, refer to the Github source connector documentation.
- OpenAI API Key. You can create one by signing up on https://openai.com/ and going to the "Keys" tab on left sidebar.
Install PyAirbyte and other dependencies
# Add virtual environment support for running in Google Colab
!apt-get install -qq python3.10-venv
# Install PyAirbyte & Langchain modules
%pip install --quiet airbyte langchain langchain_openai langchainhub chromadb
Load the Source Data using PyAirbyte
import airbyte as ab
# Configure and read from the source
read_result = ab.get_source(
"source-github",
config={
"repositories": ["airbytehq/pyAirbyte"],
"credentials": {
"personal_access_token": ab.get_secret("GITHUB_PERSONAL_ACCESS_TOKEN")
}
},
streams=["issues"],
).read()
Read a single record from stream to examine the fields
first_record = next((record for record in read_result["issues"]))
# Print the fields list, followed by the first full record.
display(list(first_record.keys()))
display(first_record)
Use PyAirbyte to_documents()
method on a dataset
This demo uses a new to_documents()
method, which accepts record property names which point to specific aspects of the document.
When we set render_metadata=True
, then metadata properties are also published to the markdown file. This option is helpful for small-ish documents, when passing the entire document to the LLM. It will be less helpful on long documents which are planned to be split into smaller chuns.
Note: We use rich
to print the documents as markdown, although that's not strictly necessary.
import textwrap
from rich.console import Console
# convert incoming stream data into documents
docs = list(read_result["issues"].to_documents(
title_property="title",
content_properties=["body"],
metadata_properties=["state", "url", "number"],
render_metadata=True,
))
# print a doc comprising github issue
console = Console()
console.print(str(docs[10]))
Use Langchain to build a RAG pipeline.
Here, we just show a generic method of splitting docs. This and the following steps are copied from a generic LangChain tutorial.
# Split the docs so they can be stored in vector database downstream
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=30)
chunked_docs = splitter.split_documents(docs)
print(f"Created {len(chunked_docs)} document chunks.")
Now we can publish the chunks to a vector store. Ensure you have added your OPENAI_API_KEY
to the secrects tab on left.
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
import os
os.environ['OPENAI_API_KEY'] = ab.get_secret("OPENAI_API_KEY")
# store into vector db
vectorstore = Chroma.from_documents(documents=chunked_docs, embedding=OpenAIEmbeddings())
print("Chunks successfully stored in vectorstore.")
Set up a RAG application using LangChain.
from langchain_openai import ChatOpenAI
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
print("Langchain RAG pipeline set up successfully.")
Ask a question.
console.print(rag_chain.invoke("Show me all documentation related issues, along with issue number, each on a new line."))