PyAirbyte Demo This demo uses the PyAirbyte libary to read records from Github, converts those records to documents, which can then be passed to LangChain for RAG.
Prerequisites: A Github personal access token. For details on configuring authetication credentials, refer to the Github source connector documentation . OpenAI API Key. You can create one by signing up on https://openai.com/ and going to the "Keys" tab on left sidebar. Install PyAirbyte and other dependencies # Add virtual environment support for running in Google Colab !apt-get install -qq python3.10-venv # Install PyAirbyte & Langchain modules %pip install --quiet airbyte langchain langchain_openai langchainhub chromadbLoad the Source Data using PyAirbyte import airbyte as ab # Configure and read from the source read_result = ab.get_source( "source-github", config={ "repositories": ["airbytehq/pyAirbyte"], "credentials": { "personal_access_token": ab.get_secret("GITHUB_PERSONAL_ACCESS_TOKEN") } }, streams=["issues"], ).read()Read a single record from stream to examine the fields first_record = next((record for record in read_result["issues"])) # Print the fields list, followed by the first full record. display(list(first_record.keys())) display(first_record)Use PyAirbyte to_documents() method on a dataset This demo uses a new to_documents() method, which accepts record property names which point to specific aspects of the document.
When we set render_metadata=True, then metadata properties are also published to the markdown file. This option is helpful for small-ish documents, when passing the entire document to the LLM. It will be less helpful on long documents which are planned to be split into smaller chuns.
Note: We use rich to print the documents as markdown, although that's not strictly necessary.
import textwrap from rich.console import Console # convert incoming stream data into documents docs = list(read_result["issues"].to_documents( title_property="title", content_properties=["body"], metadata_properties=["state", "url", "number"], render_metadata=True, )) # print a doc comprising github issue console = Console() console.print(str(docs[10]))Use Langchain to build a RAG pipeline. Here, we just show a generic method of splitting docs. This and the following steps are copied from a generic LangChain tutorial.
# Split the docs so they can be stored in vector database downstream from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=30) chunked_docs = splitter.split_documents(docs) print(f"Created {len(chunked_docs)} document chunks.")Now we can publish the chunks to a vector store. Ensure you have added your OPENAI_API_KEY to the secrects tab on left.
from langchain_community.vectorstores import Chroma from langchain_openai import OpenAIEmbeddings import os os.environ['OPENAI_API_KEY'] = ab.get_secret("OPENAI_API_KEY") # store into vector db vectorstore = Chroma.from_documents(documents=chunked_docs, embedding=OpenAIEmbeddings()) print("Chunks successfully stored in vectorstore.")Set up a RAG application using LangChain.
from langchain_openai import ChatOpenAI from langchain import hub from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough retriever = vectorstore.as_retriever() prompt = hub.pull("rlm/rag-prompt") llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0) def format_docs(docs): return "\n\n".join(doc.page_content for doc in docs) rag_chain = ( {"context": retriever | format_docs, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser() ) print("Langchain RAG pipeline set up successfully.")Ask a question.
console.print(rag_chain.invoke("Show me all documentation related issues, along with issue number, each on a new line."))About the Author About the Author