Learn how to build a RAG pipeline, extracting data from a file source using PyAirbyte, storing it in a Pinecone vector store, and then using LangChain to perform RAG on the stored data.
Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.
This notebook demonstrates simple RAG (Retrieval-Augmented Generation) pipeline with Pinecone and PyAirbyte. The focus is to showcase how to use source-file
on PyAirbyte.
source-file
Install the dependencies and import them.
For this quickstart purpose we will extract CSV data related to reviews on a clothing brand that being hosted publicly so no need to create any credentials. Find more details about the data: https://cseweb.ucsd.edu/~jmcauley/pdfs/recsys18e.pdf.
In this quickstart we extract data with JSONL
format and it's being compressed, we will see how it reflects on the config below. You can find the documentation related to source file specification for Airbyte here.
Connection check succeeded for `source-file`.
As we can see here, there is a reader_options
that helps us to control how we are going to access the data. There are a lot of options for different file format that covers the common configurations. Make sure you check the documentation for more detailed implementation.
Started reading at 22:15:59.
Read 82,790 records over 14 seconds (5,913.6 records / second).
Wrote 82,790 records over 9 batches.
Finished reading at 22:16:14.
Started finalizing streams at 22:16:14.
Finalized 9 batches over 0 seconds.
Completed 1 out of 1 streams:
Completed writing at 22:16:14. Total time elapsed: 15 seconds
Completed `source-file` read operation at 05:16:14.
Here, we are only interested on the reviews.
['I liked the color, the silhouette, and the fabric of this dress. But the ruching just looked bunchy and ruined the whole thing. I was so disappointed, I really waned to like this dress. Runs a little small; I would need to size up to make it workappropriate.',
"From the other reviews it seems like this dress either works for your body type or it doesn't. I have a small waist but flabby tummy and this dress is perfect for me! The detail around the front hides everything and the clingyness of the dress makes me look curvier than usual. The material is thick but clings to your bum (enough that when you walk the bum jiggle shows through!) and the slit is a bit high so it's not necessarily office appropriate without tights, but it's a good dress with tights or for an occasion.",
"I love the design and fit of this dress! I wore it to a wedding and was comfortable all evening. The color is really pretty in person too! The fabric quality seems decent but not great so I'm not sure how many washes it will make it through.",
"I bought this dress for work it is flattering and office appropriate. It hits just above my knees and I am pretty short at 5'1. Depending on how you adjust the top it can be a little low cut in the front, especially if you have a short torso. The material is on the thinner side, so should be great for summer/early fall and will work with tights underneath as well. I love it!",
'This is a very professional look. It is Great for work !']
Populate data for the vector store. For this demo purpose, we just load the first 100 reviews.
For below block of code, you can refer to this LangChain documentation. We will just use it here:
1. "Cute dress! Very comfy"
2. "Cute dress! Very comfy"
3. "This was a great dress"
PyAirbyte source-file
provides easy way for use to extract data from some file systems with varied formats. It also offers some flexibilities and options on how we want to extract the data, which is convenient.
Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.