Quickstart for End-to-end RAG using Gitlab, PyAirbyte, and Qdrant

Learn how to build an end-to-end RAG pipeline, extracting data from Gitlab using PyAirbyte, storing it in Qdrant, and then using LangChain to perform RAG on the stored data.

Should you build or buy your data pipelines?

Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.

Download now

This notebook demonstrates an end-to-end Retrieval-Augmented Generation (RAG) pipeline. We will extract data from an gitlab using PyAirbyte, store it in a qdrantvector store, and then use LangChain to perform RAG on the stored data. This workflow showcases how to integrate these tools to build a scalable RAG system.

Prerequisites

  1. Gitlab Account:
    • Follow the instructions in the Gitlab Docs to set up your gitlab account and obtain the necessary access token.
  2. Qdrant Account:
    • Create a Qdrant Account: Sign up for an account on the Qdrant website
    • Create Cluster: Open the Qdrant dashboard and establish a new cluster. After building a new cluster, you will see an option for creating API_key; copy the URL and API_key from there.
  3. OpenAI API Key:
    • Create an OpenAI Account: Sign up for an acco unt on OpenAI.
    • Generate an API Key: Go to the API section and generate a new API key. For detailed instructions, refer to the OpenAI documentation.

Install PyAirbyte and other dependencies

# Add virtual environment support for running in Google Colab
!apt-get install -qq python3.10-venv

# First, we need to install the necessary libraries.
!pip3 install airbyte langchain langchain-openai qdrant-client python-dotenv langchainhub

Setup Source Gitlab with PyAirbyte

The provided code configures an Airbyte source to extract data from a gitlab.

To configure according to your requirements, you can refer to this references.

Note: The credentials are retrieved securely using the get_secret() method. This will automatically locate a matching Google Colab secret or environment variable, ensuring they are not hard-coded into the notebook. Make sure to add your key to the Secrets section on the left.

import airbyte as ab

source = ab.get_source(
    "source-gitlab",
    config={
        "credentials":{
          "auth_type":"access_token",
          "access_token": ab.get_secret("GITLAB_ACCESS_TOKEN"),
        },
        "projects" :ab.get_secret("GITLAB_PROJECT")
    }
)
source.check()
# In this notebook we are focused on only issues stream
# checkout all stream here : https://docs.airbyte.com/integrations/sources/gitlab#supported-streams

source.get_available_streams()
source.select_streams(["issues"])
cache = ab.get_default_cache()
result = source.read(cache=cache)

issues_details = [doc for doc in result["issues"].to_documents()]  # Fetching data for issues stream only

print(str(issues_details[10]))

Use Langchain to build a RAG pipeline

The code uses RecursiveCharacterTextSplitter to break documents into smaller chunks. Metadata within these chunks is converted to strings. This facilitates efficient processing of large texts, enhancing analysis capabilities.

from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunked_docs = splitter.split_documents(issues_details)
print(f"Created {len(chunked_docs)} document chunks.")

for doc in chunked_docs:
    for md in doc.metadata:
        doc.metadata[md] = str(doc.metadata[md])
from langchain_openai import OpenAIEmbeddings
import os

## Embedding Technique Of OPENAI
os.environ['OPENAI_API_KEY'] = ab.get_secret("OPENAI_API_KEY")
embeddings=OpenAIEmbeddings()

Setting up Qdrant

Qdrant is leading open source vector database and similarity search engine designed to handle high-dimensional vectors for performance and massive-scale AI applications.

from qdrant_client import QdrantClient, models

client = QdrantClient(
    location=ab.get_secret("QDRANT_URL"), # As obtain above
    api_key=ab.get_secret("QDRANT_API_KEY"),
)

collection_name = "gitlab_issue" # Give collection a name
client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=1536, # vector dimensions
        distance=models.Distance.COSINE,
    ),
)
from langchain.vectorstores.qdrant import Qdrant

qdrant = Qdrant(
    client=client,
    collection_name=collection_name,
    embeddings=embeddings,
)

qdrant.add_documents(chunked_docs, batch_size=20)

Now setting up a pipeline for RAG using LangChain, incorporating document retrieval from Pinecone, prompt configuration, and a chat model from OpenAI for response generation.

from langchain_openai import ChatOpenAI
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

retriever = qdrant.as_retriever()
prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
print("Langchain RAG pipeline set up successfully.")
Langchain RAG pipeline set up successfully.
print(rag_chain.invoke("Which programing languages are mentioned in issues most?"))
The programming languages mentioned in the context are Java and JavaScript.

Should you build or buy your data pipelines?

Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.

Download now

Similar use cases

No similar recipes were found, but check back soon!