Learn how to build a full data stack using Airbyte Cloud, Terraform, and dbt to move data from S3 -> BigQuery -> Pinecone for interacting with fetched data through an LLM and form a full fledged RAG.
Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.
Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.
Airbyte S3 Pinecone RAG repo provides a quickstart template for building a full data stack using Airbyte cloud, Terraform, and dbt to move data from S3 -> BigQuery -> Pinecone for interacting with fetched data through an LLM and form a full fledged Retrieval-Augmented Generation.
This quickstart is designed to minimize setup hassles and propel you forward.
The source to fetch data is from s3 to make the content searchable in Pinecone. Follow the S3 source docs for information on configuring a S3 source.
BigQuery will store the raw API data (csv in example) from our sources and also the transformed data from dbt. You'll need a BigQuery project and a dataset with a service account that can control the dataset. Airbyte's BigQuery destination docs lists the requirements and links describing how to configure.
Pinecone is the vector database we will use to index documents and their metadata, and also for finding documents that provide context for a query. You'll need a Pinecone account, an API key, and an index created with 1536 dimensions, as OpenAI returns vectors of 1536 dimensions. See the Pinecone docs for more information.
OpenAI is used both in processing the query and also provides the LLM for generating a response. The query is vectorized so it can be used to identify relevant items in the Pinecone index, and these items are provided to the LLM as context to better respond to the query. You'll need an OpenAI account with credits and an API key. If you already have an account with OpenAI, you can generate a new API key by visiting this link: platform.openai.com/api-keys
Get the project up and running on your local machine by following these steps:
2. Navigate to the directory:
3. Set Up a Virtual Environment:
4. Install Dependencies:
raw_data
for Airbyte and transformed_data
for dbt.How to create a dataset:
raw_data
or transformed_data
).airbyte-service-account
).dbt-service-account
) and assign the roles.How to create a service account and assign roles:
How to generate JSON key:
The following steps will execute Terraform and dbt workflows to create the necessary resources for the integration. To do this, you'll need to provide some configuration values. Copy the provided .env.template
file to .env
and set its values. Then run the following command to source the environment variables into your shell so they are available when running Terraform and dbt:
Don't forget to re-run the above command after making any changes to the .env
file.
Create the sources and destinations within your Airbyte environment, you can follow the create connections to define connections between them to control how and when the data will sync.
You can find the Airbyte workspace ID from the URL, e.g. https://cloud.airbyte.com/workspaces/{workspace-id}/connections
.
Airbyte allows you to create connectors for sources and destinations, facilitating data synchronization between various platforms. In this project, we're harnessing the power of Terraform to automate the creation of these connectors and the connections between them. Here's how you can set it up:
1. Navigate to the Airbyte Configuration Directory:
Change to the relevant directory containing the Terraform configuration for Airbyte:
2. Modify Configuration Files:
Within the infra/airbyte
directory, you'll find three crucial Terraform files:
provider.tf
: Defines the Airbyte provider.main.tf
: Contains the main configuration for creating Airbyte resources.variables.tf
: Defines variables whose values are populated from the .env
file.If you're using Airbyte Cloud instead of a local deployment you will need to update the Airbyte provider configuration in infra/airbyte/provider.tf, setting the bearer_auth
to an API key generated at https://portal.airbyte.com/.
3. Initialize Terraform:
This step prepares Terraform to create the resources defined in your configuration files.
4. Review the Plan:
Before applying any changes, review the plan to understand what Terraform will do.
5. Apply Configuration:
After reviewing and confirming the plan, apply the Terraform configurations to create the necessary Airbyte resources.
6. Verify in Airbyte UI:
Once Terraform completes its tasks, navigate to the Airbyte UI. Here, you should see your source and destination connectors, as well as the connection between them, set up and ready to go.
If you're using Airbyte Cloud, we configured Terraform to assemble the URL for you:
will print something like:
Before building the dbt project, transforming the raw s3 data, the source tables must exist in the BigQuery dataset. Open the Airbyte UI and navigate to the Connections page. Click the Sync Now button for S3 to BigQuery
to start the sync.
Once the sync is complete you can inspect the tables in BigQuery.
dbt (data build tool) allows you to transform your data by writing, documenting, and executing SQL workflows. Setting up the dbt project requires specifying connection details for your data platform, in this case, BigQuery. Here’s a step-by-step guide to help you set this up:
1. Navigate to the dbt Project Directory:
Change to the directory containing the dbt configuration:
You'll find a profiles.yml
file within the directory. This file contains configurations for dbt to connect with your data platform, and is preconfigured to pull the connection details from environment variables.
You can use the dbt_project/example/schema.yml
to make your own dbt_model and create your own sql files corresponding to the dbt_model names
2. Utilize Environment Variables (Optional but Recommended):
To keep your credentials secure, you can leverage environment variables. An example is provided as the dev
config in profiles.yml
.
3. Test the Connection:
Once you’ve updated the connection details, you can test the connection to your BigQuery instance using:
If everything is set up correctly, this command should report a successful connection to BigQuery.
4. Run the Model
With the connection in place, you can now build the model to create the S3
view in BigQuery, which the configured Airbyte connection will use to sync S3 data into Pinecone.
You can inspect the provided dbt_project/models/marts/purchase_data.sql to see removing null values after querying for purchase patterns.
You should now see the s3_data
view in BigQuery.
With the source data transformed it is now ready to publish into the Pinecone index. Head back to the Connections and start a sync for Publish BigQuery Data to Pinecone
.
After Airbyte has published the S3 data text into your Pinecone index, it is ready to be interacted with via an LLM model. After providing a couple more environment variables, the provided query.py will provide an interactive session to ask questions about your data.
Once you've set up and launched this initial integration, the real power lies in its adaptability and extensibility. Here’s a roadmap to help you customize and harness this project tailored to your specific data needs:
stuff
chain type requires the query and discovered documents all fit within a single tokenized input to the LLM. This is fine for short queries and documents, but if you use longer sources you will need to use another chain type such as map_reduce
.