Leveraging ChromaDB for Vector Embeddings - A Comprehensive Guide

September 4, 2024
30 min read

Vector embeddings have become a key tool in data science and machine learning. Embeddings help you capture complex relationships within the data. 

You can use different tools and algorithms to convert high-dimensional information, such as text and images, into dense numerical vectors. The conversion enables you to implement better search and retrieval operations, making embeddings important for various generative large language applications like GPT or PaLM.

ChromaDB is a dedicated vector database built to store, manage, and query vector embeddings. With its specialized indexing and retrieval features, ChromaDB ensures fast, accurate data processing, even as the volume of vector embeddings grows.

This guide will show you how ChromaDB, with Chroma embeddings, enhances the functionality of various generative applications.

Understanding Vectors and Its Importance

Before learning about embeddings and vector databases, it is important to understand the basics of vectors and their relevance. This foundation will make it easier for you to grasp how ChromaDB works. 

Vectors are numerical representations of your data. These numeric values are fundamental in measuring similarities and differences in machine learning algorithms. 

Key Properties of Vectors 

  • Dimension: Dimension represents the number of elements a vector contains. For instance, a vector containing two elements is a two-dimensional vector like this one: v, = [ 0.1, 2.4]. You can easily visualize a vector if it has three dimensions or fewer, but it can encode words and text up to thousands of dimensions.
  • Magnitude: It is a non-negative number representing a vector's size or length. Magnitude is often computed using the Euclidean norm.
  • Direction: The line that indicates the direction where a vector points. It can be represented using angles or coordinates. 
  • Dot Product: A scalar value represents the similarity between the two vectors. The dot product can be computed using the cosine of the angle between the vector values.

What are Vector Embeddings? 

Vector embeddings are a way of representing complex data such as text, images, and graphs into fixed-sized numerical arrays known as vectors. Each element in a vector is a number that encodes different aspects of data. This data helps you capture the meaning and relationships between various data entities.

Transforming high-dimensional data into a simpler format makes data analysis and comparison easier. This leads to more effective data processing in applications such as machine learning algorithms, large language models, or generative AI applications.

Vector Similarity

Vector similarity measures how closely two vectors are related to each other inside a vector space. It quantifies how two vectors point in the same direction or are oriented relative to each other. This concept is fundamental in fields like machine learning, natural language processing, and data analysis, where vectors are used to represent the data.

Methods to calculate vector similarity:

  • Dot Product: This method measures the similarity between vectors based on their angle. A higher dot product indicates that the vectors are more similar. However, the dot product can take on any value, making it difficult to interpret in absolute terms.
  • Cosine Similarity: Cosine similarity is a method that normalizes the dot product, providing a value between -1 and 1. You can compute cosine similarity by taking the cosine angle between two vectors and rearranging the cosine definition of the dot product to solve for cos (θ). 

The equation for cosine similarity is 

equation for cosine similarity

In the above equation, u.v is the dot product, and ||u|| and ||v|| are the magnitudes of both vectors. Cosine similarity disregards the magnitude of both vectors, focusing solely on their direction. This allows the calculation to lie between -1 and 1, providing the following outcomes.

  • 1 indicates that vectors are similar, pointing in the same direction. 
  • 0 indicates that the vectors are unrelated or orthogonal, with a 90-degree angle between them. 
  • -1 indicates that vectors are dissimilar, pointing in opposite directions.

What is ChromaDB?

ChromaDB

ChromaDB is an open-source database designed to manage and query vector embeddings. These embeddings are numerical representations of your data, making it easier for computers to process and understand. The primary function of ChromaDB is to store the vector embedding associated with metadata, which LLMs can use later. 

The open-source nature of ChromaDB allows you to customize and integrate with tools and systems. For example, you can use it with PyTorch to manage and query Chroma embeddings within machine learning frameworks. These features help you efficiently access and manage high-dimensional complex data, enabling precise querying across various data types. 

Retrieval Features of ChromaDB 

Here are a few retrieval features of ChromaDB:

  • Vector Search: ChromaDB’s vector search feature allows you to search for data by comparing numerical vector representations, also known as Chroma embeddings. You can use these vectors to find contextually similar elements, enabling fast data retrieval. 
  • Document Storage: ChromaDB allows you to manage and store documents alongside their vector embeddings and metadata. Metadata includes information about your data, such as categories, tags, or attributes. ChromaDB’s metadata filtering allows you to filter search results based on these metadata, facilitating efficient data organization and quick data retrieval. 
  • Full-Text Search: With Full-text search, you can perform a thorough search across the entire content of your data. Unlike vector search, this feature helps you find documents based on exact or partial text matches. This is useful when locating specific phrases or documents within your stored documents.
  • Multimodal Retrieval: ChromaDB supports multimodal retrieval, which permits searching and retrieving information across multiple data types, such as text, images, and other formats. This facilitates the analysis of diverse datasets within one system.

How ChromaDB Stores and Uses Vector Embeddings?

ChromaDB uses vector embeddings for data storage, retrieval, and comparison based on underlying features rather than simple keyword matching. This approach is particularly useful for applications like search engines, recommendation systems, and AI-driven analysis. 

How ChromaDB Stores and Uses Vector Embeddings?

Here’s how ChromaDB manages this process: 

  • In ChromaDB, you begin by creating a collection. Your data, including text documents and their associated metadata, will be stored here. 
  • ChromaDB automatically converts the text into embeddings when you add data into a collection. By default, it uses the ‘all-MiniLM-L6-v2’ model to generate these embeddings, but you can choose a different model. 
  • These Chroma embeddings are stored with a unique ID and any additional metadata you provide. The system transforms the text into corresponding embeddings, preparing them for querying. 
  • You can search your collection by inputting a text query. ChromaDB enables you to compare your query to your stored embeddings and return the most similar documents.

Embedding Functions in ChromaDB 

Embedding functions in ChromaDB are essential tools for converting text, images, and other data into vector representations that AI algorithms can efficiently process. ChromaDB supports various popular embedding models from leading platforms like OpenAI, Google, Generative AI, Cohere, and Hugging Face, offering flexibility in creating embeddings. 

Default Embedding Function in ChromaDB

Chroma uses the all-MiniLM-L6-v2 model from sentence transformers, which runs locally and automatically handles downloading necessary files. To use an embedding function in ChromaDB, you can either set it up when creating a Chroma collection or call it directly. The embedding function can be used for tasks like adding, updating, or querying data. 

For example, using the default embedding function is straightforward and requires minimal setup.


from chromadb.utils import embedding_functions
default_ef = embedding_functions.DefaultEmbeddingFunction()

Customized Embedding Function in ChromaDB 

You can also create a custom embedding function to use with Chroma. This function should be implemented with the protocol of EmbeddingFunction. 

For example:


from chromadb import Documents, EmbeddingFunction, Embeddings

class MyEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input: Documents) -> Embeddings:
        # embed the documents somehow
        return embeddings

In the above code, you are setting up a template for creating custom embeddings in ChromaDB. The MyEmbeddingFunction class follows the EmbeddingFunction protocol, ensuring it can be used with ChromaDB. When you call this function with some documents, it will return their embeddings.

Using ChromaDB to Store and Query Vector Embeddings

To effectively manage and query vector embeddings, you need to start by setting up ChromaDB.

Step 1: Installing ChromaDB 

To begin working with ChromaDB, download and install the necessary files, libraries, and dependencies on your machine.  

ChromaDB offers flexibility by supporting both Python and JavaScript environments. You can choose either one, depending on your project requirements. The following process will demonstrate using Python to work with vector embeddings.


pip install chromadb

When you run this command, ‘pip,’ which is a package installer for Python, will download and load ChromaDB on your machine, along with any dependencies. This allows you to use ChromaDB in your Python environment.

Step 2: Creating a Chroma Client 

The Chroma client acts as an interface between your code and the ChromaDB. It allows you to interact with the database, create collections, add data, and perform queries.


import chromadb
chroma_client = chromadb.Client()

In the above code: 

  • Import chromadb imports the ChromaDB library, making its functions available in your script. 
  • chroma_client = chromadb.Client(): Here, you are creating an instance of the ChromaDB client.  

Step 3: Creating a Collection 

A collection is like a container that stores your data, specifically the text documents, their corresponding vector embeddings, and associated metadata. 

Run the following command to create a collection within your ChromaDB instance. 


collection = chroma_client.create_collection(name="my_collection")

Step 4: Add Documents to the Collection 

After the collection is created, you can add documents to it. ChromaDB will automatically create vector embeddings for these documents and store them in the collections.

For instance, run the following command to add documents to your collection: 


collection.add(
    documents=[
        "This is a document about pineapple",
        "This is a document about oranges"
    ],
    ids=["id1", "id2"]
)

In the above example, you added two documents to the collection: one about pineapples and the other about oranges.  

Here, 

  • collection.add() is the method for adding documents and their corresponding embeddings to the collection. 
  • document = [ ] parameter allows you to add text documents within your collection.
  • ids= [ ], is a parameter that you provide to each document to uniquely identify each document separately. 

Step 5: Query the Collection

You can perform queries on the data in your collection, such as finding the most similar document based on your query text. ChromaDB will embed your query text and compare it to the embeddings of the documents in your collection.

For instance, you enter a query text to query document about hawaii 


results = collection.query(
    query_texts=["This is a query document about hawaii"],
    n_results=2
)
print(results)

In the above example, 

  • query_texts = [ ] is the text you are querying against the ChromaDB collection. ChromaDB will embed this query text and compare it with the Chroma embeddings of the documents within your collection. 
  • n_results = 2 is the parameter that specifies the number of similar results you want the ChromaDB instance to return. Here, you are asking to return two similar results. 
  • results = collection.query(), where the query method will return the most similar document to your query text. The result is based on the vector embeddings that ChromaDB generated earlier. This result will contain the documents that match their IDs and the similarity scores (distance). 

Step 6: Inspect the Result 

The results returned by running the query provide insights into how closely the documents inside the collection match the query text. You can inspect these results to understand which documents are most similar and by what margin.


{
  'documents': [[
      'This is a document about pineapple',
      'This is a document about oranges'
  ]],
  'ids': [['id1', 'id2']],
  'distances': [[1.0404009819030762, 1.243080496788025]],
  'uris': None,
  'data': None,
  'metadatas': [[None, None]],
  'embeddings': None,
}

In the above code, which is the result of the query: 

  • ‘documents’: [[ ]], this is the part of the output that shows the documents that are most similar to the query. In this case, the documents about pineapples and oranges are returned.
  • ‘distances’: [[ ]], this part shows the numbers representing similarity scores/distance between the query text and returned documents. A lower distance indicates that there is high similarity.

Since the distance for pineapple (1.0404) is lower than that of oranges ( 1.2432), this indicates that the pineapple document is more similar to the query ‘hawaii’. 

Using LangChain with ChromaDB to Query Text Data 

Langchain is a framework that helps simplify the development of applications that integrate LLMs with external data sources, APIs, and databases. It provides tools to handle tasks like loading documents, splitting them into manageable chunks, embedding them into vectors, and then storing and querying these vectors.

Using Langchain and ChromaDB streamlines the process of embedding text data into numerical vectors and storing them in ChromaDB. This integration allows you to perform advanced similarity searches on text data, retrieving the most relevant information based on vector similarities.

The following example demonstrates how to use LangChain with ChromaDB to store and query text data. Here, you will use OpenAI embeddings. 

Setting up Environment 

Start by installing the necessary libraries. Run the following command do that: 


pip install langchain-chroma

Get Your OpenAI API Key 

To use OpenAI embeddings, you need an API key.


import os
import getpass
os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')

Load the Document 

You can load a text document and split it into smaller chunks, making it easier to process and store data in your database instance.


from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter

# Load the document
raw_documents = TextLoader('../../../state_of_the_union.txt').load()

# Split the document into smaller chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)

Embedding the Text and Storing It in ChromaDB

After you split the document into smaller chunks, you can convert these pieces into numerical vectors (embeddings) and store them in ChromaDB. 


from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

# Embed the document chunks and store them in ChromaDB
db = Chroma.from_documents(documents, OpenAIEmbeddings())

Perform Query on Text Embeddings 

You can perform a query for similarity search on these Chroma embeddings within the collection stored in the ChromaDB. 


query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)

Simplifying Data Integration for ChromaDB with Airbyte 

Airbyte 

Airbyte is a robust data integration tool that simplifies data migration from multiple sources into vector databases like ChromaDB, Pinecone, and Weaviate. It features a library of 350 pre-built connectors, which you can use to create a data pipeline between different sources and ChromaDB in a few minutes. It also provides a Connector Development Kit, which provides you flexibility for creating custom connectors.

Here’s how Airbyte enhances data management for ChromaDB and vector embeddings:

  • Gen AI Workflow: Airbyte’s Gen AI workflow allows you to efficiently load unstructured data into ChromaDB, which can be further optimized for AI-driven applications. This workflow supports the integration of data into ChromaDB, enhancing its utility for vector embeddings. 
  • Change Data Capture: Airbyte’s CDC capability ensures that ChromaDB remains updated by tracking and replicating changes from the source. This keeps your destination system data current and accurate.
  • Data Security and Compliance: Airbyte ensures secure data handling with features like single sign-on (SSO), role-based access control, and data encryption. These features help maintain data integrity and compliance during the integration process.

Conclusion 

ChromaDB is a powerful vector database for managing and querying vector embeddings, making it indispensable for applications like search engines and AI-driven applications. By converting complex data into numerical vectors, this open-source solution allows for precise similarity searches, enabling more accurate and contextually relevant results.

Limitless data movement with free Alpha and Beta connectors
Introducing: our Free Connector Program
The data movement infrastructure for the modern data teams.
Try a 14-day free trial