Automatically Create AI Embeddings using the PGVector Destination Connector Alex Cuoci
•
•
January 9, 2025
•
8 min read
Summarize this article with:
If you’ve dabbled at all in building AI applications, the concept of embeddings will be a core component to this. Computers don’t “get” words or pictures the way we do. They need everything boiled down to numbers to process it. That’s where embeddings come in. They take complex stuff, like the meaning of a word or the vibe of a song, and map it to a series of numbers in a “vector space” (basically, a fancy math grid). This embedding may represent a row of data in your database, data in an LLM, or the question a user types into tools like chatGPT.
Once you have an embedding representing a piece of data, services such as openAI can perform what is called a similarity search by comparing how “close” embeddings are in relation to their number value. The closer the two embeddings are, the more likely the result is related to your question. If you want to understand more how embeddings and similarity search, check out Susan Chang's presentation from a recently community event we held.
So how do you create embeddings in the first place? A lot depends on how you are developing your app. If you have a small set of data, you could write a python script to create embeddings when you load the data into a vector-enabled datastore: something like PGVector.
# Generate embeddings using OpenAI def generate_embeddings(texts): embeddings = [] for text in texts: response = openai.Embedding.create( input=text, model="text-embedding-ada-002" # Replace with your desired model ) embeddings.append(response['data'][0]['embedding']) return embeddings # Insert embeddings into the database def insert_embeddings_to_db(data, embeddings, table_name="embeddings_table"): conn = psycopg2.connect( host=DB_HOST, port=DB_PORT, dbname=DB_NAME, user=DB_USER, password=DB_PASSWORD ) cursor = conn.cursor() # Ensure your table has columns like: id, text, and embedding sql = f""" INSERT INTO {table_name} (text, embedding) VALUES %s """ values = [(row['text'], embedding) for row, embedding in zip(data.to_dict(orient='records'), embeddings)] # Use execute_values for bulk insertion execute_values(cursor, sql, values) conn.commit() cursor.close() conn.close()
However, if you are loading larger volumes, and/or you are frequently refreshing/updating data then writing a series of Python scripts is going to quickly become a maintenance nightmare. Thankfully, the Airbyte PGVector Destination Connector can generate embeddings without a single line of code, and since you are building on the Airbyte platform, all of that complex retry and sync logic is automatically taken care of.
Let’s say I am moving data from Stripe into a Postgres database with the PGVector extension, which, in my example, is running on Supabase. The goal is to build an AI bot that provides a chat interface for me to search Customers, Invoices , Products. I’ve already created my Stripe Source connector in Airbyte, now I just need to move the data to Postgres.
To create the embeddings and move data to Supabase, all you need to do create a new Destination connector, add your hosting details, and provide an API key to your preferred AI provider. I’m using OpenAI, but Airbyte supports Cohere, Fake, Azure OpenAI, and any openAI compatible sources, with Hugging Face coming soon as well. Check out the docs for more info .
VIDEO
Now that Airbyte has moved the data and created the embeddings, I can jump back to my python app and write a bot using openAI chat completions API based on data in Postgres, using my newly created embeddings to perform the similarity search.
def generate_embedding(query): """Generate an embedding for the input query using OpenAI.""" response = openai.Embedding.create( input=query, model="text-embedding-ada-002" ) return response['data'][0]['embedding'] def search_database(query, table_name, embedding_column, text_columns, top_n=5): """Search the database for similar items based on the query embedding.""" conn = psycopg2.connect( host=DB_HOST, port=DB_PORT, dbname=DB_NAME, user=DB_USER, password=DB_PASSWORD ) cursor = conn.cursor() # Generate query embedding query_embedding = generate_embedding(query) # Prepare SQL for vector similarity search sql = f""" SELECT id, {', '.join(text_columns)}, 1 - (embedding <=> cube(%s)) AS similarity FROM {table_name} ORDER BY embedding <=> cube(%s) LIMIT {top_n} """ cursor.execute(sql, (query_embedding, query_embedding)) # Fetch and display results results = cursor.fetchall() cursor.close() conn.close() return resultsI skimmed across a lot of the detail around the AI bot. Don't worry, we are putting the finishing touches on a AI course which will be available shortly. If you want to learn more about how to build end-to-end AI apps with Airbyte, make sure you sign up for the Developer Newsletter and check out the online tutorials .
Limitless data movement with free Alpha and Beta connectors
Introducing: our Free Connector Program ->