Illustrating the Usage of langchain_airbyte Package

Join our newsletter to get all the insights on the data stack

The langchain-airbyte package integrates LangChain with Airbyte.

It has a very powerful function AirbyteLoader which can be used to load data as document into langchain from any Airbyte source!

This notebook demonstrates the usage of langchain_airbyte to load data from an Airbyte source (Github Repository) , store the data into a vector database, and perform a basic QnA on that data using FAISS and OpenAI embeddings.

Prerequisite

1) OpenAI API Key:

Create an OpenAI Account: Sign up for an account on OpenAI.
Generate an API Key: Go to the API section and generate a new API key. For detailed instructions, refer to the OpenAI documentation.

2) Github Personal Access Token:

Create a Github Account: Sign up for an account on Github.
Generate an API Key: Cick on your profile icon->Settings->Developer Settings and generate a new API key. For detailed instructions, refer to the Github documentation.

Installing Dependencies

Lets start by installing all the required dependencies!
First of all we will create a virtual environment and then begin installing the dependencies.

# Add virtual environment support for running in Google Colab
!apt-get install -qq python3.10-venv

#Installing the necessary libraries
!pip3.10 install -qU langchain-airbyte faiss-cpu langchain-community langchain-openai

Load Data using AirbyteLoader

Now we use AirbyteLoader to fetch data from the source source-github.
You may use any other source you require, but fetch the data accordingly!
Dont forget to add all the required fields!
Refer the guide for your source here

For more information regarding this package refer

The last step of converting data to documents ensures that the raw data (GitHub commits) is converted into a standardized format that includes both the main content and any associated metadata.


from langchain_airbyte import AirbyteLoader
from langchain.schema import Document

# Configure the AirbyteLoader to load data from a GitHub repository
loader = AirbyteLoader(
    source="source-github",
    stream="commits",
    config={
        "credentials": {
            "personal_access_token": "your_personal_access_token"
        },
        "repositories": ["your_username/repository_name"]
    }
)

# Load documents from the specified GitHub source
docs = loader.load()

# Convert incoming stream data into documents
docs = [Document(page_content=record.page_content, metadata=record.metadata) for record in docs]

Split Documents into Chunks and Store these Chunks in Vector Store using FAISS

Large documents are split into smaller chunks to make them easier to handle. This also helps in improving the efficiency of the retrieval process, as smaller chunks can be more relevant to specific queries.

The chunks of documents are transformed into vectors using an embedding model (OpenAI embeddings).
These vectors are then stored in a FAISS vector store, which allows for efficient similarity search.
The vector store indexes the vectors and enables fast retrieval of similar vectors based on a query.


from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=30)
chunked_docs = splitter.split_documents(docs)

print(f"Created {len(chunked_docs)} document chunks.")

# Store Chunks in Vector Store using FAISS
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
import os

# Set the OpenAI API Key (make sure to set your own API key here)
os.environ['OPENAI_API_KEY'] = "YOUR_OPENAI_API_KEY"

# Ensure filtered_docs is not empty
if not chunked_docs:
    raise ValueError("No valid documents to store in the vector store.")

# Store document chunks in FAISS vector store
embeddings = OpenAIEmbeddings(openai_api_key=os.getenv("OPENAI_API_KEY"))
vector_store = FAISS.from_texts([doc.page_content for doc in chunked_docs], embeddings)

print("Chunks successfully stored in vectorstore.")

Perform QnA on Stored Data

Finally we perform the Question And Answer here.

When a query is made, the vector store retrieves relevant document chunks based on their vector similarity to the query. The language model (OpenAI) then generates answers based on the retrieved chunks.


# Step 5: Perform QnA on Stored Data
from langchain.chains.question_answering import load_qa_chain

# Initialize the LLM (OpenAI)
llm = OpenAI(openai_api_key=os.getenv("OPENAI_API_KEY"))

# Create a QnA chain
qa_chain = load_qa_chain(llm=llm, chain_type="stuff")

# Perform a QnA
query = "What are the latest commits in the repository?"
inputs = {"question": query, "input_documents": chunked_docs}
answer = qa_chain.invoke(inputs)

print("QnA Result:", answer)

About the Author

Should you build or buy your data pipelines?

Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.

Download now

Illustrating the Usage of langchain airbyte Package

Join our newsletter to get all the insights on the data stack

Should you build or buy your data pipelines?

About the Author

About the Author

Join our newsletter to get all the insights on the data stack

Prerequisite

Installing Dependencies

Load Data using AirbyteLoader

Split Documents into Chunks and Store these Chunks in Vector Store using FAISS

Perform QnA on Stored Data

About the Author

About the Author

Should you build or buy your data pipelines?

Similar use cases