Building a Knowledge Management System with PyAirbyte and Vector Databases

Join our newsletter to get all the insights on the data stack

Data is often dispersed across various sources. Accessing this scattered data from a unified system can aid you in generating impactful insights that optimize business processes. This approach can also support efficient management and retrieval of information.

To integrate your organization’s data, a well-designed knowledge management system (KMS) can help centralize, structure, and enable quick data retrieval. Building such a system requires data consolidation and organization in a way that supports user-friendly accessibility.

The next sections will comprehensively outline what is a knowledge management system and how you can build one using PyAirbyte and vector database.

What Is a Knowledge Management System?

A knowledge management system, or KMS, is a platform that allows you to create, curate, organize, share, and utilize information. It encourages you to centralize data in a single source of truth to amplify data accessibility within your organization. By consolidating information from dispersed sources, you can strengthen collaboration while eliminating the requirement to rediscover data.

Modern knowledge management systems are evolving to include intelligence features like automated tagging and natural language processing. These new features can help optimize operational efficiency and strategize business decisions. The AI-enabled KMS uses vector databases as a central hub to store data for robust data retrieval.

Building Knowledge Management Systems with PyAirbyte and Vector Databases

This section will describe a step-by-step guide to building a knowledge management system. Before getting started, it is important to understand the key technologies that we will be using. Here’s an overview of the necessary tools:

Vector Databases: An Overview

Vector databases are data storage systems that aid you in storing complex data formats, such as images, videos, and other higher-dimensional information. But you can also store this data in other data storage systems, like a data lake! Why take this additional step of utilizing a different tool?

The key reason for storing complex data in vector databases is that they facilitate robust similarity search functionality. With this feature, you can quickly extract data that seems similar to the provided context.

Another reason to integrate your organizational data with a vector database is its compatibility with a modern AI tech stack. With this capability, you can build robust AI applications that work as a knowledge management system, resolving your queries in real time.

Airbyte: An Overview

Airbyte is an AI-powered data integration tool that empowers you to replicate data from various sources to the destination of your choice. It offers over 550 pre-built connectors, enabling you to move structured, semi-structured, and unstructured data between numerous platforms. If the connector you seek is unavailable, you can build custom connectors using Airbyte’s Connector Builder and Connector Development Kits (CDKs).

Some of the features provided by Airbyte include:

AI-Enabled Connector Builder: The Connector Builder comes with an AI assistant that reads the API docs and fills out most configuration fields, simplifying your connector development process.‍
Vector Database Support: To streamline the development of AI applications, Airbyte extends support to vector databases. You can store vector embeddings in these databases for tasks such as performing similarity searches, powering recommendation systems, and enhancing the summarization of complex documents.‍
Automated RAG Techniques: With automated chunking, embedding, and indexing, Airbyte enables you to convert raw data into vector representations and organize them in vector stores.

Along with these features, Airbyte also supports a Python library, PyAirbyte, which lets you leverage Airbyte connectors in a development environment. This library encourages you to extract data from multiple sources into popular SQL caches like DuckDB. The resulting caches are compatible with popular Python libraries like Pandas and AI frameworks like LangChain and LlamaIndex.

Now that you have an understanding of the necessary tools, let’s get started with the steps. For this tutorial, we will develop a pipeline to merge GitLab data in Qdrant. Before starting the steps, ensure that you satisfy the following prerequisites:

‍Prerequisites

You must have a GitLab account and access to the necessary access token. For more information, follow the official GitLab documentation.
Sign up for a Qdrant account. Navigate to the dashboard to create a new cluster and copy the URL and API key.
Register for an OpenAI account and generate a new API key from the API section.

Step 1: Installing Dependencies

Open your preferred code editor. For this example, we will use Google Colab Notebook.

Add a virtual environment to isolate dependencies and manage the installed packages. Run the following:

!apt-get install -qq python3.10-venv

Now, install the necessary libraries.

!pip3 install airbyte langchain langchain-openai qdrant-client python-dotenv langchainhub

Step 2: Importing Useful Libraries

After installing all the required libraries, import the important ones by executing the code in this section.

import airbyte as ab
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
import os

The RecursiveCharacterTextSplitter method permits you to perform the chunking operation on data. The OpenAIEmbeddings, on the other hand, facilitates the transformation of the tokens produced by the chunking operation into vector embeddings. For the embedding techniques, you can enter your OPEN_API_KEY in the code below.

os.environ['OPENAI_API_KEY'] = ab.get_secret("OPENAI_API_KEY")
embeddings=OpenAIEmbeddings()

Import the Qdrant database connection libraries that will help you in migrating and retrieving data from Qdrant.

from qdrant_client import QdrantClient, models
from langchain.vectorstores.qdrant import Qdrant

Finally, you can also import libraries that can enable you to chat with your data. For this, you can use the diverse set of functionalities offered by LangChain. Run the following code:

from langchain_openai import ChatOpenAI
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

Step 3: Configure GitLab Source in PyAirbyte

To set up a source connector, you can mention source-gitlab and enter your credentials in the config parameter, as shown here:

source = ab.get_source(
    "source-gitlab",
    config={
        "credentials":{
          "auth_type":"access_token",
          "access_token": ab.get_secret("GITLAB_ACCESS_TOKEN"),
        },
        "projects" :ab.get_secret("GITLAB_PROJECT")
    }
)

The get_secret() method securely retrieves your credentials by referring to the environment variables. It is necessary to ensure the credentials are not hard-coded into the notebook.

You can now check the connection status by running:

source.check()

The above code must respond with a success message.

In the GitLab source, there might be multiple available data streams. To check all the available streams, execute:

source.get_available_streams()

For the sake of simplicity, we will only be using the issues stream.

source.select_streams(["issues"])

Convert the issues data stream into PyAirbyte’s default DuckDB cache.

cache = ab.get_default_cache()
result = source.read(cache=cache)

To store this data and leverage its potential to the fullest, you must convert it to a list of documents. This step will enhance the data transformation process of your data pipeline.

issues_details = [doc for doc in result["issues"].to_documents()]

Step 4: Data Transformation

Before storing the data in Qdrant, it is crucial to convert the data into vector embeddings. This requires you first to break down large files into smaller, manageable components and then perform the embedding operation. To perform the document chunking method, execute the code below:

splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunked_docs = splitter.split_documents(issues_details)
print(f"Created {len(chunked_docs)} document chunks.")

for doc in chunked_docs:
    for md in doc.metadata:
        doc.metadata[md] = str(doc.metadata[md])

The above code uses LangChain’s text_splitter package to segment the docs stored in the issues_details list. Each chunk contains 512 tokens and can have 50 tokens that overlap with another chunk for better contextual understanding.

Step 5: Configuring Qdrant Destination

After performing the transformation steps, you can set Qdrant as the destination. To initialize the Qdrant client account, you can replace the QDRANT_URL and QDRANT_API_KEY placeholders and execute this code:

client = QdrantClient(
    location=ab.get_secret("QDRANT_URL"),
    api_key=ab.get_secret("QDRANT_API_KEY"),
)

Mention the collection name and create a collection where you will store the data.

collection_name = "gitlab_issue"
client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=1536,
        distance=models.Distance.COSINE,
    ),
)

Let’s use this information to create a new Qdrant instance.

qdrant = Qdrant(
    client=client,
    collection_name=collection_name,
    embeddings=embeddings,
)

Using this qdrant instance, add the chunked documents to the database.

qdrant.add_documents(chunked_docs, batch_size=20)

In the above code, the batch_size=20 highlights that the documents are processed and uploaded to the database in batches of 20. This creates a centralized repository for efficient data retrieval. By performing some additional steps, you can create a conversational chatbot that can simplify similarity search.

Step 6: Retrieving Data from the Database

As a final step, you can retrieve data from the database and have a conversational interface that allows you to talk to your KMS. To accomplish this, you must configure a few elements, including a data retriever, prompt, and LLM.

The retriever fetches the data from the database, while the prompt provides a structured outline for the output. Harnessing the LLM, you can generate output that uses prompt and context to generate human-like responses.

retriever = qdrant.as_retriever()
prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

Let’s create a function that modifies the model's response coherently. To split each page's content into two separate newline characters, execute the following code:

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

Construct a retrieval-augmented generation (RAG) chain with all the above parameters, including retriever, prompt, and llm.

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
print("Langchain RAG pipeline set up successfully.")

Output:‍

Now that the model is ready, you can test its functionality by asking a question using the .invoke() method.

print(rag_chain.invoke("Which programing languages are mentioned in issues most?"))

Output:‍

The above code refers to the Qdrant database to retrieve the data relevant to the question. This is how you can create a bot using the knowledge management system that replies to your queries.

Practical Examples of Knowledge Management Systems

Here are some of the practical examples of a knowledge management system:

‍Document Management Systems: Working as a central file cabinet, a document management system permits document retrieval while supporting regulatory compliance. With this system, you can access data from anywhere globally with proper credentials. The advanced security, like role-based access control (RBAC), offered by these systems restricts unauthorized access to data.

‍Content Management Systems: Content management systems extend the functionality of document management systems by allowing the management of audio and video media types. For enterprise-level data management, it is vital to integrate workflows with enterprise content management systems.

‍Database: Data storage systems like databases permit you to store and interact with the data. To increase the speed of data retrieval, databases are indexed. You can interact with a database using a database management system (DBMS).

‍Data Warehouse: Data warehouses are a type of knowledge management system that empowers you to perform analytics and reporting operations on your data. Consolidating data into a data warehouse can encourage you to produce effective insights in a single repository.

‍Wikis: As an easy-to-use collaboration tool, wikis allows you to publish and store data on web pages. These pages can be considered beneficial for saving business documentation and product information.

Benefits of Building a Knowledge Management System

Building a knowledge management system has multiple benefits, from better communication to streamlining customer service. Let’s explore some of the advantages.

Boosting Internal Communication: When data is shared across multiple platforms, the general information may get lost over time. Creating a KMS can assist in saving all the present updates with historical data. With this system, you can store company information in a centralized, readily accessible location.
‍Improved Data Quality: Continuous integration of data in a knowledge management system significantly enriches data quality. Before storage, the data goes through multiple transformations, which makes it easily understandable. Any inaccurate or incomplete information is discarded for better decision-making.‍
Increased Collaboration: By connecting various applications into a single source of information, you can harness the benefits of collaboration. Different teams within your organization can come together to build solutions that solve customer issues. This can empower you to boost customer experience.

Learn How Perplexity Built Its Knowledge Engine with the Help of Airbyte

As a robust AI-powered search engine, Perplexity provides effortless access to information. However, the increasing data and team size results in a frequently encountered challenge of scalability. With expanding workloads, it is crucial to maintain a scalable solution that can provide time-saving results. This is why relying on traditional data migration capabilities is not the most efficient way to handle growing demands.

The turning point for Perplexity came when the team incorporated Airbyte to conduct data operations. The ease of use, reliability, freedom from vendor lock-in, and cost-effective scalability offered by Airbyte enabled Perplexity to scale data management.

Previously, Perplexity’s backend team used manual methods to migrate data from the PostgreSQL database to Snowflake. Conducting data tasks through manual methods increased the probability of encountering errors, which were time-consuming to resolve. To resolve this issue, Perplexity relied on Airbyte. Its seamless integration with Perplexity’s existing data infrastructure allowed the team to adopt it into their workflow effortlessly. For more details, explore Perplexity’s success story.

Conclusion

Through this tutorial, you get a detailed understanding of what is a knowledge management system. Incorporating KMS into your organizational data ecosystem will improve data sharing and foster the development of innovative solutions.

With this system, you can refine data operations, save overall costs, and simplify complex processes. However, building a KMS can be a challenging task, requiring the development of custom connections between various platforms. To streamline this task, you can consider leveraging PyAirbyte to facilitate optimal data integration. PyAirbyte enables you to develop and manage efficient data pipelines, connecting diverse data sources to your KMS.

About the Author

Should you build or buy your data pipelines?

Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.

Download now

Building a Knowledge Management System with PyAirbyte and Vector Databases

Join our newsletter to get all the insights on the data stack

Should you build or buy your data pipelines?

About the Author

About the Author

Join our newsletter to get all the insights on the data stack

What Is a Knowledge Management System?

Building Knowledge Management Systems with PyAirbyte and Vector Databases

Vector Databases: An Overview

Airbyte: An Overview

Step 1: Installing Dependencies

Step 2: Importing Useful Libraries

Step 3: Configure GitLab Source in PyAirbyte

Step 4: Data Transformation

Step 5: Configuring Qdrant Destination

Step 6: Retrieving Data from the Database

Practical Examples of Knowledge Management Systems

Benefits of Building a Knowledge Management System

Learn How Perplexity Built Its Knowledge Engine with the Help of Airbyte

Conclusion

About the Author

About the Author

Should you build or buy your data pipelines?

Similar use cases

Building a Knowledge Management System with PyAirbyte and Vector Databases

A Beginner's Guide to Qdrant: Installation, Setup, and Basic Operations