Article

Automatically Create AI Embeddings using the PGVector Destination Connector

•

January 9, 2025

•

8 min read

Summarize this article with:

If you’ve dabbled at all in building AI applications, the concept of embeddings will be a core component to this. Computers don’t “get” words or pictures the way we do. They need everything boiled down to numbers to process it. That’s where embeddings come in. They take complex stuff, like the meaning of a word or the vibe of a song, and map it to a series of numbers in a “vector space” (basically, a fancy math grid). This embedding may represent a row of data in your database, data in an LLM, or the question a user types into tools like chatGPT.

Once you have an embedding representing a piece of data, services such as openAI can perform what is called a similarity search by comparing how “close” embeddings are in relation to their number value. The closer the two embeddings are, the more likely the result is related to your question. If you want to understand more how embeddings and similarity search, check out Susan Chang's presentation from a recently community event we held.

So how do you create embeddings in the first place? A lot depends on how you are developing your app. If you have a small set of data, you could write a python script to create embeddings when you load the data into a vector-enabled datastore: something like PGVector.


# Generate embeddings using OpenAI
def generate_embeddings(texts):
    embeddings = []
    for text in texts:
        response = openai.Embedding.create(
            input=text,
            model="text-embedding-ada-002"  # Replace with your desired model
        )
        embeddings.append(response['data'][0]['embedding'])
    return embeddings

# Insert embeddings into the database
def insert_embeddings_to_db(data, embeddings, table_name="embeddings_table"):
    conn = psycopg2.connect(
        host=DB_HOST,
        port=DB_PORT,
        dbname=DB_NAME,
        user=DB_USER,
        password=DB_PASSWORD
    )
    cursor = conn.cursor()

    # Ensure your table has columns like: id, text, and embedding
    sql = f"""
    INSERT INTO {table_name} (text, embedding)
    VALUES %s
    """
    values = [(row['text'], embedding) for row, embedding in zip(data.to_dict(orient='records'), embeddings)]
    
    # Use execute_values for bulk insertion
    execute_values(cursor, sql, values)
    conn.commit()
    cursor.close()
    conn.close()

‍

However, if you are loading larger volumes, and/or you are frequently refreshing/updating data then writing a series of Python scripts is going to quickly become a maintenance nightmare. Thankfully, the Airbyte PGVector Destination Connector can generate embeddings without a single line of code, and since you are building on the Airbyte platform, all of that complex retry and sync logic is automatically taken care of.

‍

Let’s say I am moving data from Stripe into a Postgres database with the PGVector extension, which, in my example, is running on Supabase. The goal is to build an AI bot that provides a chat interface for me to search Customers, Invoices, Products. I’ve already created my Stripe Source connector in Airbyte, now I just need to move the data to Postgres.

To create the embeddings and move data to Supabase, all you need to do create a new Destination connector, add your hosting details, and provide an API key to your preferred AI provider. I’m using OpenAI, but Airbyte supports Cohere, Fake, Azure OpenAI, and any openAI compatible sources, with Hugging Face coming soon as well. Check out the docs for more info.

‍

Now that Airbyte has moved the data and created the embeddings, I can jump back to my python app and write a bot using openAI chat completions API based on data in Postgres, using my newly created embeddings to perform the similarity search.

‍


def generate_embedding(query):
    """Generate an embedding for the input query using OpenAI."""
    response = openai.Embedding.create(
        input=query,
        model="text-embedding-ada-002"
    )
    return response['data'][0]['embedding']

def search_database(query, table_name, embedding_column, text_columns, top_n=5):
    """Search the database for similar items based on the query embedding."""
    conn = psycopg2.connect(
        host=DB_HOST,
        port=DB_PORT,
        dbname=DB_NAME,
        user=DB_USER,
        password=DB_PASSWORD
    )
    cursor = conn.cursor()

    # Generate query embedding
    query_embedding = generate_embedding(query)

    # Prepare SQL for vector similarity search
    sql = f"""
    SELECT id, {', '.join(text_columns)}, 1 - (embedding <=> cube(%s)) AS similarity
    FROM {table_name}
    ORDER BY embedding <=> cube(%s)
    LIMIT {top_n}
    """
    cursor.execute(sql, (query_embedding, query_embedding))

    # Fetch and display results
    results = cursor.fetchall()
    cursor.close()
    conn.close()
    return results

I skimmed across a lot of the detail around the AI bot. Don't worry, we are putting the finishing touches on a AI course which will be available shortly. If you want to learn more about how to build end-to-end AI apps with Airbyte, make sure you sign up for the Developer Newsletter and check out the online tutorials.

Suggested Read

OpenAI Embeddings

‍

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program ->

Try the Agent Engine

We're building the future of agent data infrastructure. Be amongst the first to explore our new platform and get access to our latest features.

Try it free Talk to sales

The data movement infrastructure for the modern data teams.

Try a 30-day free trial

About the Author

Alex Cuoci is a product lead at Airbyte focused on launching the context platform for software engineers building AI agents. Previously, Alex led product for Airbyte's offerings for data engineering teams. Before joining Airbyte, Alex was a product manager at Datadog, where he shipped observability products for asynchronous and serverless applications.

‍