Learn how to leverage Milvus and Airbyte to embed smart similarity search functionality into your applications
Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.
Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.
Milvus is a popular open-source vector database. Vectors are high-dimensional arrays of numbers. When working with Large Language Models (LLMs), "embeddings" specifically refer to the last hidden layer within the Encoder component of a transformer-based Deep Neural Network. This embedding layer is a set of vectors representing the semantic meaning of words or pixels (for text or images).
Given the wealth of information in the embedding layer of a trained LLM, efficient storage and retrieval of vectors has become crucial. Milvus is a vector database purpose-built for storing, indexing, and efficiently searching high-dimensional vector data. Vector databases are typically used for similarity searches across unstructured data, enabling improvements in Generative Chat responses, product recommendations, and other applications.
By using Airbyte, it's straightforward to transfer data from many different sources into Milvus, calculating vector embeddings of texts along the way.
The power of embeddings is to be able to search for relevant pieces of information, even if similar concepts are phrased differently. This article will use this functionality to make a website support form smarter by looking up relevant information on the fly. This will be used to inform the user about similar tickets that are already processed and highlight relevant knowledge base articles that could help resolve the problem without a support agent's help.
We will use Zilliz Cloud as our vector store, Airbyte to extract and load the data, the OpenAI embedding API to calculate embeddings, and Streamlit to build a smart submission form showing relevant data.
You will need:
On cloud.zilliz.com, you can sign up for a free cluster to store your embeddings vectors for the similarity search. Once you created an account, you need to set up a new cluster.
Individual entities (in our case, support tickets and knowledge base articles) are stored in a “collection” — after your cluster is set up, you need to create a collection. Choose a suitable name and set the Dimension to 1536 to match the vector dimensionality generated by the OpenAI embeddings service:
After creation, Zilliz will show you the endpoint and API key — note these away as we are going to need them in the next step.
Our database is ready, let’s move some data over! To do this, we need to configure a connection in Airbyte. Either sign up for an Airbyte cloud account at cloud.airbyte.com or fire up a local instance as described in the documentation.
Once your instance is running, we need to set up the connection — click “New connection” and pick the “Zendesk Support” connector as the source.
On Airbyte cloud, you can easily authenticate by clicking the Authenticate button. When using a local Airbyte instance, follow the directions outlined on the documentation page.
If you want to use another data source — the rest of this article is applicable to all kinds of text-based sources
After clicking the “Test and Save” button, Airbyte will check whether the connection can be established. If everything is working correctly, the next step is to set up the destination to move data to. Here, pick the “Milvus” connector.
The Milvus connector does three things:
Clicking “Test and save” will check whether everything is lined up correctly (valid credentials, collection exists and has the same vector dimensionality as the configured embedding, etc.)
The last step before data is ready to flow is selecting which “streams” to sync. A stream is a collection of records in the source. As Zendesk supports a large number of streams that are not relevant to our use case, let’s only select “tickets” and “articles” and disable all others to save bandwidth and make sure only the relevant information will show up in searches:
You can select which fields to extract from the source by clicking the stream name. The “Incremental | Append + Deduped” sync mode means that subsequent connection runs keep Zendesk and Milvus in sync while transferring minimal data (only the articles and tickets that have changed since the last run).
As soon as the connection is set up, Airbyte will start syncing data. It can take a few minutes to appear in your Milvus collection.
If you select a replication frequency, Airbyte will run regularly to keep your Milvus collection up to date with changes to Zendesk articles and newly created issues.
You can check in the Zilliz cloud UI how the data is structured in the collection by navigating to the playground and executing a “Query Data” query with a filter set to “_ab_stream == \”tickets\””
As you can see in the Result view, each record coming from Zendesk is stored as separate entities in Milvus with all the specified metadata. The text chunk the embedding is based on is shown as the “text” property — this is the text that got embedded using OpenAI and will be what we will search on.
Our data is ready — now we need to build the application to use it. In this case, the application will be a simple support form for users to submit support cases. When the user hits submit, we will do two things:
In both cases, we will leverage semantic search using OpenAI embeddings. To do this, the description of the problem the user entered is also embedded and used to retrieve similar entities from the Milvus cluster. If there are relevant results, they are shown below the form.
You will need a local Python installation as we will use Streamlit to implement the application.
First, install Streamlit, the Milvus client library, and the OpenAI client library locally:
pip install streamlit pymilus openai
To render a basic support form, create a python file app.py:
To run your application, use Streamlit run:
This will render a basic form:
The code for this example can also be found on Github.
Next, let’s check for existing open tickets that might be relevant. To do this, we embed the text the user entered using OpenAI, then did a similarity search on our collection, filtering for still open tickets. If there is one with a very low distance between the supplied ticket and the existing ticket, let the user know and don’t submit:
Several things are happening here:
To run the new app, you need to set the environment variables for OpenAI and Milvus first:
When trying to submit a ticket that exists already, this is how the result will look:
The code for this example can also be found on GitHub.
As you can see in the green debug output hidden in the final version, two tickets matched our search (in status new, from the current organization, and close to the embedding vector). However, the first (relevant) ranked higher than the second (irrelevant in this situation), which is reflected in the lower distance value. This relationship is captured in the embedding vectors without directly matching words, like in a regular full-text search.
To wrap it up, let’s show helpful information after the ticket gets submitted to give the user as much relevant information upfront as possible.
To do this, we are going to do a second search after the ticket gets submitted to fetch the top-matching knowledge base articles:
If there is no open support ticket with a high similarity score, the new ticket gets submitted and relevant knowledge articles are shown below:
The code for this example can also be found on GitHub.
While the UI shown here is not an actual support form but an example to illustrate the use case, the combination of Airbyte and Milvus is a very powerful one — it makes it easy to load text from a wide variety of sources (from databases like Postgres over APIs like Zendesk or GitHub over to completely custom sources built using Airbyte’s SDK or visual connector builder) and index it in embedded form in Milvus, a flexible and robust vector search engine being able to scale to huge amounts of data.
Airbyte and Milvus are open source and completely free to use on your infrastructure, with cloud offerings to offload operations if desired.
Beyond the classical semantic search use case illustrated in this article, the general setup can also be used to build a question-answering chat bot using the RAG method (Retrieval Augmented Generation), recommender systems, or help make advertising more relevant and efficient.