This notebook illustrates the complete setup of a Retrieval-Augmented Generation (RAG) pipeline.
We extract data from a GitHub repository using PyAirbyte, store the data in a Chroma vector store, and use LangChain to perform RAG on the stored data.
Prerequisites
1) OpenAI API Key:
- Create an OpenAI Account: Sign up for an account on OpenAI.
- Generate an API Key: Go to the API section and generate a new API key. For detailed instructions, refer to the OpenAI documentation.
2) Github Personal Access Token:
- Create a Github Account: Sign up for an account on Github.
- Generate an API Key: Cick on your profile icon->Settings->Developer Settings and generate a new API key. For detailed instructions, refer to the Github documentation.
Installing Dependencies
First Thing First !
Lets get the dependencies installed before anything else!!
# Add virtual environment support for running in Google Colab
!apt-get install -qq python3.10-venv
# First, we need to install the necessary libraries.
!pip3 install airbyte langchain langchain-openai chromadb python-dotenv langchainhub langchain-chroma
Source Setup : Github with PyAirbyte
The code you see below configures an Airbyte source to pull out data from a github repository.
You can also customize the configuration according to your own needs. See this
Note that we here only fetch data from the Commits Stream
To know about all the available streams go here
import airbyte as ab
source = ab.get_source(
"source-github",
config={
"credentials": {
"personal_access_token": "your_personal_access_token"
},
"repositories": ["your_github_username/your_repository_ID"]
}
)
source.check()
source.get_available_streams()
source.select_streams(["commits"])
cache = ab.get_default_cache()
result = source.read(cache=cache)
commits_details = [doc for doc in result["commits"].to_documents()]
print(str(commits_details[0]))
Split Documents into Chunks
Large documents are split into smaller chunks to make them easier to handle. This also helps in improving the efficiency of the retrieval process, as smaller chunks can be more relevant to specific queries.
Here we set each chunk size to 512 characters and adjacent chunks will overlap by 50 characters to ensure continuity of context
Then the loop converts all metadata to string format to ensure consistent processing later in the pipeline.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunked_docs = splitter.split_documents(commits_details)
for doc in chunked_docs:
for md in doc.metadata:
doc.metadata[md] = str(doc.metadata[md])
from langchain_openai import OpenAIEmbeddings
import os
os.environ['OPENAI_API_KEY'] = ab.get_secret("YOUR_OPENAI_API_KEY")
embeddings = OpenAIEmbeddings()
Setting up Chroma
Create and configure a Chroma vector store to store the document embeddings.
First we initialize Chroma Client
Then we create Chroma Vector Store from Documents
Finally we use embedding function when accessing the collection
Since currently there is a waitlist for Chroma,We initialize the Chroma client in persistent mode (local file)
import chromadb
from langchain_chroma import Chroma
from chromadb.utils import embedding_functions
persist_directory = 'chroma_db'
client = chromadb.PersistentClient(path=persist_directory)
collection_name = "github_commits"
openai_lc_client = Chroma.from_documents(
documents=chunked_docs,
embedding=embeddings,
persist_directory=persist_directory,
collection_name=collection_name
)
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key=os.getenv("OPENAI_API_KEY"),
model_name="text-embedding-ada-002"
)
collection = client.get_collection(name=collection_name, embedding_function=openai_ef)
Querying Chroma and RAG Pipeline
Finally we use LangChain to retrieve documents from Chroma and generate responses using an OpenAI chat model.
from langchain_openai import ChatOpenAI
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
# Initialize the LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
# Set up the retriever from the Chroma vector store
retriever = openai_lc_client.as_retriever()
# Set up the prompt
prompt = hub.pull("rlm/rag-prompt")
# Function to format documents
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
# Create the RAG chain
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
print("Langchain RAG pipeline set up successfully.")
# Example query
response = rag_chain.invoke("Which are the commit messages of latest commits?")
print(response)