Our new vector database destination allows you to create vector embeddings from any airbyte-supported source and load them directly into vector dabases. This unlocks retrieval-augmented generation and powerful similarity search use cases on your unstructured data. This destination is currently in alpha status and available on Airbyte Cloud and Open Source. Currently, Pinecone, Chroma and the embedded DocArray database are supported, with more options on the way.
In a previous article, we explained how Dagster and Airbyte can be leveraged to power LLM-supported use cases. The vector database destination makes this even easier as it takes care of chunking and embedding, allowing you to directly connect sources to the vector database through an Airbyte connection.
What are vector databases and how can they be used?
Modern companies produce large amounts of data. Some of this data (for example financial data or user analytics data) is nicely structured and can already be leveraged effectively with established data warehouses and BI dashboarding solutions.
However, there are also large amounts of data without machine-readable structure containing important information: documents, emails, wiki pages, chat messages, … - this data is unstructured in the sense that interesting insights can’t simply be extracted using SQL queries and dashboards. LLMs and vector databases can help bridging this gap by extracting semantic information or meaning out of regular prose.
Vector databases come into play by converting this unstructured data into mathematical vectors. Unlike traditional databases that rely on exact matches, vector databases gauge data similarities in high-dimensional spaces, yielding more nuanced and accurate results. This means enhanced search accuracy and deeper insights from stored information, paving the way for smarter data-driven decisions and experiences.
LLMs, or Large Language Models, are experts at understanding and generating text. When teamed with vector databases, which store and search data based on its deeper meaning rather than just exact words, they create a powerful partnership. By leveraging a vector database, an LLM can have access to continuously new information that it wasn’t trained on, and pull insights that truly match the intent behind your query, rather than just skimming for surface-level keywords. This duo promises a smarter and more intuitive way to tap into vast data resources.
How can Airbyte help leveraging vector databases?
The newly introduced destination allows you to configure the full pipeline from extracting records from a large variety of sources, over separating unstructured and structured data, preparing and embedding text contents of records to loading them into vector databases from where they can be accessed by LLMs from a single UI.
This means all the existing advantages of the Airbyte platform are extended to vector databases:
- Large catalog of sources that can be connected in minutes
- The ability to quickly integrate long tail sources using the connector builder interface
- Incremental syncs to avoid costly and long running jobs
- A centralized overview of the current state of all replications
To get started with sending data to vector stores, follow the following steps:
- Make sure your local Airbyte instance is up to date or sign up for an account on cloud.airbyte.com
- Sign up for an OpenAI account to calculate embeddings
- Create a new destination by following the destination documentation
If you are interested in leveraging Airbyte to ship data to your LLM-based applications, please take a moment to fill out our survey so we can make sure to prioritize the most important features.