How to Create an LLM Application with ChromaDB & Airbyte

Learn how to build a robust Large Language Model application using ChromaDB for vector storage and Airbyte for data integration, simplifying your AI development workflow.

Should you build or buy your data pipelines?

Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.

Download now

Incorporating an LLM application into your workflow has become crucial for automating modern organizational tasks. It can help improve customer service, build a knowledge base, and assist with code generation, text summarization, and data analytics. However, engineering an LLM solution from scratch can be a challenging task. To simplify this complexity, you can leverage existing models with vector databases, like ChromaDB.

ChromaDB is a prominent vector store that is continuously being used by the developer community to build robust AI systems. The below graph shows the download trend of the chromadb python package for the past 30 days according to the PyPI Stats.

Daily Download Quantity of ChromaDB Package

This article will discuss the steps you can take to create a ChromaDB LLM application that can aid you in optimizing data-driven workflows.

What Is ChromaDB?

ChromaDB

ChromaDB is an open-source vector database that allows you to store vector embeddings along with metadata. These embeddings can further provide the LLM with contextual understanding, which the model can refer to produce relevant results. The response generated by the model can cater to your specific needs depending on the vector data.

Use of ChromaDB in Building an LLM

  • In-memory Capabilities: ChromaDB delivers high-throughput operations using in-memory storage mechanisms. This makes it an excellent choice for AI-driven applications.
  • Open-Source: ChromaDB’s source code is publicly available, encouraging collaboration and contribution within the developer community. Public availability of the database promotes continuous improvement with feedback from diverse tech professionals across the globe.
  • Metadata Filtering: By storing data in ChromaDB, you can perform powerful filtering operations with the metadata accompanying your embedding data. With this functionality, you can narrow down the search space to the most relevant information.
  • Advanced Search Capabilities: Unlike traditional databases, ChromaDB is designed to facilitate advanced vector similarity searches. It enables you to optimize the process of data retrieval via vector indexing, enhancing data accessibility across different teams within your organization.
  • Multi-Language Support: To enable smooth interactions with the database, ChromaDB offers Software Development Kits (SDKs) in multiple languages, including Python and JavaScript.

What Is Airbyte?

Airbyte

Airbyte is a no-code data integration platform that allows you to replicate data from multiple sources to the destination of your preference. It offers over 550 pre-built connectors, enabling you to move structured, semi-structured, and unstructured data into a vector database like ChromaDB. If the source you seek is unavailable, Airbyte provides Connector Builder and a suite of Connector Development Kits (CDKs) for generating custom connectors.

Use of Airbyte in Building an LLM

  • Support for Vector Databases: Airbyte supports popular vector databases, including ChromaDB, Pinecone, and Qdrant. Storing vector embeddings in these data stores can aid in streamlining the development of AI applications.
  • Automated RAG Techniques: With automated chunking, embedding, and indexing operations, you can transform raw data into vector embeddings and store them in prominent vector databases. To support vector embedding operations, Airbyte offers pre-build LLM providers, such as OpenAI, Anthropic, and Cohere.
  • AI-Powered Connector Builder: The Connector Builder comes with an AI-assist functionality that reads through your preferred platform’s API documentation and auto-fills configuration fields. This reduces the manual intervention required to build custom connectors.
  • Developer-Friendly Pipelines: PyAirbyte, a Python library, enables you to leverage Airbyte connectors in a developer environment. Using this library, you can extract data from different sources and load it into prominent SQL caches, such as DuckDB, Snowflake, and BigQuery. You can transform these caches into vector embeddings by utilizing embedding models, like OpenAIEmbeddings, making the data compatible with the vector database.
  • Incremental Syncs: With Airbyte, you can implement the data synchronization process. This ensures that only the newly added source records are processed and migrated to the destination. When building an LLM application, this method keeps the model updated with the latest information while optimizing time and resource consumption.

Understanding LLMs & RAGs

LLMs, or Large Language Models, are extensive machine learning models that are built explicitly for language-related tasks. Some of the everyday use cases of LLMs include text summarization, code generation, interaction with APIs, and sentiment analysis.

Contrarily, RAG—Retrieval Augmented Generation—is an AI framework that facilitates the retrieval of domain-specific factual information from an external data store. Instead of relying on LLM’s training data, RAG produces highly relevant content according to your specified data source. For instance, storing your GitHub issues data in a vector database facilitates the generation of LLM responses according to your GitHub repository.

Let’s use ChromaDB with Airbyte to build a chatbot that responds to your queries based on GitHub issues:

Prerequisites

Step-by-Step Guide to Creating LLM Application Using ChromaDB & Airbyte

Step 1: Installing the Necessary Libraries

Open Google Colab and add a virtual environment to isolate dependencies by running:

!apt-get install -qq python3.10-venv

Install all the required libraries that will be essential for this tutorial.

%pip install --quiet airbyte langchain langchain_openai langchainhub chromadb rich

Step 2: Importing the Libraries

After installing all the required libraries, you can import them by executing the code below:

Import PyAirbyte:

import airbyte as ab

To display the response generated by the LLM application in a structured way, import textwrap and rich.console.Console.

import textwrap
from rich.console import Console

Import the os module to access and manage environment variables.

import os

Using this os library, get the OpenAI API key, which will be crucial in generating embeddings.

os.environ['OPENAI_API_KEY'] = ab.get_secret("OPENAI_API_KEY")

Now, install the LangChain-specific modules to simplify LLM application development. Run:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

In the above code:

  • RecursiveCharacterTextSplitter permits you to split the document so that it can be stored in ChromaDB.
  • The langchain_community.vectorstores offers access to the Chroma database.
  • OpenAIEmbeddings allows you to convert the chunked docs into vector embeddings.
  • ChatOpenAI is the chat model, or LLM, that will enable the conversation with your Chroma data.
  • Langchain’s hub module supports you in managing LLM chains and configurations.
  • The StrOutputParser is advantageous for transforming LLM responses into test strings, while RunnablePassthrough is beneficial for passing the input without making any changes.

Step 3: Extract GitHub Data Using PyAirbyte

Let’s now configure GitHub as the data source. To achieve this, use the get_source method. Replace the GITHUB_PERSONAL_ACCESS_TOKEN with your personal credentials and execute this code:

source = ab.get_source(
    "source-github",
    install_if_missing=True,
    config={
        "repositories": ["airbytehq/quickstarts"],
        "credentials": {
            "personal_access_token": ab.get_secret("GITHUB_PERSONAL_ACCESS_TOKEN"),
        },
    },
)

As an example, the above code includes Airbyte’s GitHub repository.

To verify if the connection works properly, run the following command:

source.check()

This code must result in a success statement.

Check all the streams available in PyAirbyte’s GitHub repository by executing the following code:

source.get_available_streams()

For this tutorial, let’s select the issues stream. Run the following:

source.select_streams(["issues"])

To temporarily store the data in PyAirbyte default DuckDB cache, run:

cache = ab.get_default_cache()
read_results = source.read(cache=cache)

Examine the available fields by printing a single record from the stream.

first_record = next((record for record in read_result["issues"]))

display(list(first_record.keys()))
display(first_record)

Step 4: Data Transformation

To perform data transformations, convert the records into a list of documents. This step allows you to arrange the records into well-structured documents. The produced documents can further be split into smaller chunks and then converted into embeddings for efficient processing. Execute the code below:

docs = list(read_result["issues"].to_documents(
    title_property="title",
    content_properties=["body"],
    metadata_properties=["state", "url", "number"],
    render_metadata=True,
))

In this code, the render_metadata parameter publishes the metadata properties with the markdown file. This parameter is useful for comparatively smaller documents, as it provides additional information about the data.

To store data in the vector database, you must split the documents into smaller, more manageable chunks. This can be done using the RecursiveCharacterTextSplitter as follows:

splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=30)
chunked_docs = splitter.split_documents(docs)

The above code splits the documents into chunks of 512 characters, and the chunk_overlap outlines how many characters can overlap in consecutive chunks.

Step 5: Loading Data into ChromaDB

Define a vector store variable that will use the OpenAI embeddings to convert the chunked documents into vector embeddings and use the ChromaDB to store the embeddings.

vectorstore = Chroma.from_documents(documents=chunked_docs,
     embedding=OpenAIEmbeddings())

When defining the vectorstore variable, you can also use the persist_directory parameter to save the embeddings to a specific directory.

Step 6: Building an LLM Application

Now that the data is stored in ChromaDB in the form of embeddings, define a model to interact with your data. In this way, you can create an LLM application, such as a responsive chatbot. The model can summarize the content of the GitHub issues, give relevant content to specific products, and aid you in identifying unresolved issues.

For example, to create a RAG solution, first use ChromaDB as a retriever for the application.

retriever = vectorstore.as_retriever()

Define a model and a prompt, specifying the model to behave in a certain way. For instance, use the LangChain Hub to pull the existing rlm prompt.

Run the following script:

prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

Instead of directly responding to the user queries, you can also define a format in which you want the RAG to respond.

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

Use all this information to build a RAG chain that can reply to your questions. Execute:

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Finally, create a console instance to output the data and ask questions to your ChromaDB LLM application.

console = Console()

console.print(rag_chain.invoke("Show me all documentation related issues, along with issue number, each on a new line."))

This code will result in all the documentation-related issues with issue numbers in the format specified under the format_docs function. Check our PyAirbyte GitHub repository demo for detailed steps.

Real-World Use Cases for LLM Application Powered by ChromaDB & Airbyte

Here are a few real-world use cases of the ChromaDB LLM application:

Customer Support Chatbot

By storing customer support documentation in ChromaDB, you can boost customer experience. The customer support chatbot can automate the customer handling process by providing replies to user queries, saving time and resources. Based on official documentation and past conversations, the model can respond to your customer’s specific requirements.

Document Intelligence

You can build a knowledge management system (KMS) by integrating official documents into a single source of truth, like ChromaDB. The KMS can improve data quality, boost internal communication, and enhance collaboration within your organization.

Code Assistance

ChromaDB and Airbyte integration can be used to develop code assistance applications. The code assistant is beneficial in improving productivity by detecting errors, suggesting code completion, and enhancing code quality.

Language Translation

Storing multilingual data in ChromaDB enables the development of LLM applications that can translate one language to another. In this way, you can create your own Google Translate to learn or interpret different languages without a tutor.

Data Analysis

The ChromaDB LLM application can function as a junior data analyst. For example, the RAG model built using this tutorial can analyze recent GitHub issues, outline resolved ones, and highlight the issue resolution rate. You can customize the application by defining a prompt that allows the RAG to operate as a data analyst.

Conclusion

Through this tutorial, you get a comprehensive overview of how to create a ChromaDB LLM application.

By extracting data from different sources into a vector data store, like ChromaDB, you can create a centralized knowledge base for developing AI-driven applications. This stored data can help you provide quick and efficient responses from your LLM.

Although this tutorial emphasizes ChromaDB, you can also use other vector databases, like Qdrant, which is supported by Airbyte. Learn more about the key differences between ChromaDB vs Qdrant.

Should you build or buy your data pipelines?

Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.

Download now

Similar use cases

How to Create an LLM Application with ChromaDB & Airbyte

Learn how to build a robust Large Language Model application using ChromaDB for vector storage and Airbyte for data integration, simplifying your AI development workflow.

Creating a GitHub Documentation Chatbot Using PyAirbyte and pgvector

Learn how to build a GitHub documentation chatbot with PyAirbyte and PG Vector for seamless data retrieval and enhanced user experience.

Build a GitHub Analytics Dashboard Using Metabase & Airbyte

Using the Airbyte GitHub connector and Metabase, we can create insightful dashboards for GitHub projects.