Article

Ingesting Data Into Vectara with Airbyte

Ofer Mendelevitch

•

January 16, 2024

•

4 mins

Summarize with ChatGPT

Introduction

When building a GenAI or semantic search application with Vectara, one of the most important considerations is how to architect your data ingestion pipeline that transfers data from your systems into Vectara in a way that scales well, is robust to system failures, and provides a mechanism for incremental updates.

We recently added a Vectara “destination connector” to Airbyte’s list of destination connectors, which makes this process quick and easy.

In this blog post we will review this new Vectara connector, explain how to set up an ETL flow from one of Airbyte’s source connectors, and provide an end-to-end example for ingesting text from documents in a Google Drive into Vectara using Airbyte.

Vectara and Airbyte - Better Together

Vectara offers powerful generative AI capabilities for developers via an easy-to-use API. Often referred to as “RAG-in-a-box”, Vectara’s platform simplifies the development of GenAI applications by taking care of the heavy lifting required for building RAG (retrieval augmented generation) applications – document chunking, embedding, vector storage, state-of-the-art retrieval and summarizationare all handled behind-the-scenes and in a scalable and secure fashion.

When building a GenAI application, there are two main data flows:

The “ingest” flow where data is processed from its source and indexed into Vectara.
The “query” flow, invoked when a user issues a query and Vectara responds to that query with a highly accurate list of matching results and a generative summary to the user query.

Vectara’s API provides native support for indexing text and uploading files, however it is often up to the user to take care of the details. In particular, the application developer is required to build the code that would crawl the source data, convert it into text, and ingest it into Vectara using the API. With the complexity of data sources in the enterprise, this can easily become more complex than originally anticipated and hard to maintain over time. Furthermore, as the source data updates, the developer needs to ensure incremental updates are performed properly.

Enter Airbyte, an open source tool for data movement, with 350+ connectors designed specifically to help address the challenges faced while designing a data ingestion pipeline. Besides offering a large breadth of connectors for APIs and databases that work out-of-the-box, Airbyte solves common data integration problems like incremental syncs, schema evolution and sync observability in a single place in a consistent manner for all your data integration needs.

Using Airbyte’s capabilities for data movement at scale using the Vectara destination connector makes it easier for GenAI developers to build scalable enterprise-grade GenAI applications with immediate access to a growing list of data connectors to deal with various enterprise needs.

Example: RAG based on Documents in Google Drive

Google Drive is pretty ubiquitous (although other competing solutions like Microsoft’s OneDrive, Box.com and Dropbox are also pretty good), and especially for those using Google Suite - it’s common to have all your documents hosted there.

But how many times did you want to find that document you know exists in your Google Drive but you just can’t remember where it is?

Or maybe you want to ask a question and have a response based on all the relevant documents hosted on your Google Drive?

I know I have.

So let’s build an example of a GenAI application that allows you to do just that, using Airbyte to index your Google Drive into Vectara and build a question-answering application using that data.

Setting up Vectara

Our first step would be to sign-up for an account on Vectara (if you don’t already have one).

You can create a corpus for your application but the Vectara connector also includes the capability to generate this target corpus for you, and all you need is a corpus name. Note that the Vectara destination adds a special metadata field to the Vectara corpus called _ab_stream, which is used internally in the ingest flow.

Vectara’s free account can include up to 50MB of text extracted from your Google Drive documents. If you need to index larger amounts of text, you can add a credit card to increase your Vectara account data quota to the size you need, or contact sales for the Vectara Scale plan.

Setting up an Airbyte pipeline: Google Drive to Vectara

For this blog post, I’ll use a publicly available Google Drive folder that includes seven of William Shakespeare texts. However, in order to simulate a real-life situation where the folder is private, I’ve made a copy of it to my local Google Drive folder under the name “shakespeare”

To install Airbyte on an EC2 instance, we follow the instructions in this guide. Usually we would need to follow this step-by-step guide to set up the right permissions for the Google Drive folder to ensure we have the right credentials setup allowing the Google Drive source connector to work properly.

To setup the connection, we go to the main Airbyte dashboard screen at `localhost:8000`:

Now let’s set up the connection, between our Google Drive and Vectara following these 3 simple steps:

Pick the “Google Drive” Airbyte source and configure it to access the source folder:

Enter the Google Drive folder URL
Add the default stream type, and make sure to pick “document file type” as the file type
Pick service-account authentication and copy the GCP credentials JSON (in its entirety) that you have setup in the authentication step above, in the line provided

Select the Vectara destination and set it up as follows:

Under “customer ID”, we enter our Vectara account ID
For “corpus name” - we can pick a name for our target Vectara corpus. The Vectara destination connector automatically generates this corpus for you with that name.
The “parallelize” button, if turned on, would parallelize the Vectara ingest process using multi-threading, which can improve the speed of the overall ingest process.
Under “Authentication”, enter your Vectara OAUTH2 client ID and client Secret
We specify under “fields to store as metadata” the data fields from the source that we would like the Vectara connector to use as metadata. In this case we choose “_ab_source_file_url” and “_ab_source_file_last_modified” which are fields available from the Google Drive source connector.
Under “text fields to index with Vectara” we simply choose the field “content” which includes the text content of each document on the drive.

Configure the connection:

We will keep the default “scheduled” sync with “daily” update periods.
For the stream, choose ““incremental / append” for sync-mode. This choice is suitable for most scenarios involving Google Drive where you want the connection to only sync updates from Google Drive, and not have to reindex the whole drive on every change.

That’s it. Once you finish configuring the connection and it’s enabled, AIrbyte will automatically sync the contents of the source Google folder with Vectara.

Airbyte will make sure your Vectara corpus stays in sync with your source data with minimal overhead and notify you in case the data replication can’t be completed.

Querying the data

Now all the content of these seven wonderful works of Shakespeare are ingested into Vectara, let’s do some querying.

In Vectara’s Console the “query” tab can be use to run some sample queries:

Let’s try “Who is Juliet?”

Or we can ask “Who is King Lear?” to get the following:

Of course you can use Vectara’s query API to run your own queries and integrate them into your GenAI App. Or use the vectara-answer tool to build a full-functional question-answering application.

Summary

In this blog post we’ve seen the power of using Airbyte’s powerful data movement platform together with Vectara’s trusted platform to build GenAI applications.

We’ve provided an example for moving data from Google Drive to Vectara, but of course many other data sources are available via Airbyte such as Salesforce, Airtable, Asana, BigQuery or Elastic, just to name a few. For more complex setups, it can make sense to first move raw data into a warehouse to perform transformations before eventually loading them into Vectara - this post about ELTP architectures goes into greater detail of such a setup.

We encourage you to try this new exciting integration with your own data, and let us know how it works - we’re always excited to hear about your GenAI projects.

And as always if you have any questions, please feel free to join the Vectara Discord server or Airbyte’s Slack and let us know.

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program ->

The data movement infrastructure for the modern data teams.

Try a 14-day free trial

About the Author

Ofer Mendelevitch leads developer relations at Vectara. He has extensive hands-on experience in machine learning, data science and big data systems across multiple industries, and has focused on developing products using large language models since 2019. Prior to Vectara he built and led data science teams at Syntegra, Helix, Lendup, Hortonworks and Yahoo! Ofer holds a B.Sc. in computer science from Technion and M.Sc. in EE from Tel Aviv university, and is the author of "Practical data science with Hadoop".