End-to-end RAG using Airbyte Cloud, S3 and Vectara

Join our newsletter to get all the insights on the data stack

In this blog post, we'll walk you through setting up an end-to-end Retrieval-Augmented Generation (RAG) pipeline using Airbyte Cloud, Amazon S3, and Vectara.

We'll show you how to effortlessly load vector data into Vectara using an Airbyte connection and then leverage OpenAI to perform Retrieval-Augmented Generation (RAG).

Prerequisites

Airbyte Cloud Account: If you are new to Airbyte Cloud ,Sign up here and get the ball rolling.
AWS S3 Bucket: Create an S3 bucket and upload your data files that you want to load into Vectara. Ensure you have the necessary AWS credentials to access this bucket.
Vectara Account: Have your account's customer ID , corpus ID and API Key at your fingertips.
OpenAI API Key:
- Create an OpenAI Account: Sign up for an account on OpenAI.
- Generate an API Key: Go to the API section and generate a new API key. For detailed instructions, refer to the OpenAI documentation.

Setup the AWS S3 Source

To setup the source s3 in airbyte cloud, follow the steps and you're good to go:

In the Left Sidebar, Click on Sources

On Top Right Side, Click on + New source

Now Search for s3, and finally select s3

Follow the instructions in the AWS S3 Source Connector Documentation to set up your S3 bucket and obtain the necessary access keys.

Enter Bucket Name: Provide the name of the bucket containing files to replicate.
Add a Stream:
- File Format: Select from CSV, Parquet, Avro, or JSONL. Toggle Optional fields for more configurations.
- Stream Name: Give a name to the stream.
- Globs (Optional): Use a pattern (e.g., **) to match files. For specific patterns, refer to the Globs section.
- Days To Sync (Optional): Set the lookback window for file sync.
- Input Schema (Optional): Define a custom schema or use default ({}).
- Validation Policy (Optional): Choose how to handle records not matching the schema (emit, skip, or wait for discovery).
- Schemaless Option (Optional): Skip schema validation.

To authenticate your private bucket:

If using an IAM role, enter the AWS Role ARN.
If using IAM user credentials, fill the AWS Access Key ID and AWS Secret Access Key fields with the appropriate credentials.

For More details about each field for S3 source setup visit here.

All other fields are optional and can be left empty.

After this click on setup the source, once setup is successful we are ready to use S3 as a source.

Set up the Vectara destination

To setup Vectara as a destination in airbyte cloud, follow the steps and you'll be hitting the ground running::

In the Left Sidebar: Click on Destinations

On Top Right Side: Click on + New destination

Now search for Vectara and finally select it

Start Configuring the Vectara destination in Airbyte:

Destination name: Provide a friendly name.
Customer ID: Enter your customer ID here.
Corpus Name: Enter your corpus name here.
OAuth Client ID: Enter your OAuth Client ID here.
OAuth Client Secret: Enter your OAuth Client Secret here.

To get a more detailed overview of Vecatara destination, visit this

Set up the connection

In the Left Sidebar: Click on Connections->click on new connection -> Select S3 Source->

On Top Right Side: Click on + New connection

Define Source : Select S3

Define Destination : Select Vectara

Select streams : Now you will be able to see all stream you have created in S3 source, Activate the stream and click next on the bottom right conner

Now select schedule of jobs and click setup the connection.

Now we can successfully sync data from S3 to Vectara

Retrieval-Augmented Generation (RAG) with Vectara

RAG takes language models to the next level by pulling relevant information from a database, allowing them to craft spot-on and contextually rich responses. In this segment, we'll guide you through the process of setting up RAG with Vectara.

For your convenience and quick reference, we've supplied a Google Colab notebook. Feel free to tinker with and delve into the fully operational RAG code in Google Colab .

def get_response(query, model_name="gpt-3.5-turbo"):

    # Get similar chunks from sources/tables in Vectara
    chunks = get_similar_chunks_from_vectara(query)

    if len(chunks) == 0:
        return "I am sorry, I do not have the context to answer your question."
    else:
        # Send chunks to LLM for completion
        return get_completion_from_openai(query, chunks, model_name)

    query = 'What data do you have?'
    response = get_response(query)

    Console().print(f"\n\nResponse from LLM:\n\n[blue]{response}[/blue]")

Conclusion

In this tutorial, we illustrated how to harness Vectara and OpenAI for Retrieval-Augmented Generation (RAG), demonstrating the seamless integration of data from Vectara and the power of OpenAI's language models. This dynamic duo allows you to build intelligent AI-driven applications, such as chatbots, which can tackle complex questions with ease. Vectara takes the hassle out of managing and retrieving vector data, making it an indispensable tool for efficient and scalable data integration. This, in turn, supercharges your AI solutions, enabling them to deliver top-notch, context-aware responses based on thorough data analysis.

About the Author

Should you build or buy your data pipelines?

Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.

Download now

Join our newsletter to get all the insights on the data stack

Should you build or buy your data pipelines?

About the Author

About the Author

Join our newsletter to get all the insights on the data stack

Prerequisites

Setup the AWS S3 Source

Set up the Vectara destination

Set up the connection

Retrieval-Augmented Generation (RAG) with Vectara

Conclusion

About the Author

About the Author

Should you build or buy your data pipelines?

Similar use cases