PostgreSQL as a Vector Database: A Complete Guide

September 3, 2024
20 min read

Vector databases are data management systems that enable you to store and query vector data types. You can integrate them with LLMs to generate vector embeddings, which are numerical representations of data records. Using these embeddings, your team can conduct semantic searches and retrieve accurate outcomes from LLMs.

According to the Markets and Markets report, the global vector database market will increase to USD 4.3 billion by 2028 at a CAGR of 23.3%. Several specialized vector databases can be used to store and query vector data. However, they are complex and have a higher learning curve.

To overcome such limitations, you can opt for PostgreSQL as a versatile and reliable vector database management system. It offers an extension, pgvector, that allows you to handle vector data types efficiently. In this article, you will learn how to generate and store Postgres vector embeddings for building robust AI and ML-based applications.

What is PostgreSQL?

PostgreSQL

PostgreSQL is an open-source object-relational database that allows you to handle complex data workloads using SQL. It was developed in 1986 as part of the POSTGRES project at the University of California. Postgres, as it is commonly known, is a preferred database system by data scientists due to its robust architecture, reliable features, and active open-source community.

You can use Postgres to perform in-depth analysis across multiple data types, including boolean, date, geometric, and arrays. It is a highly flexible database that enables you to work with various programming languages, such as Python, JavaScript, C/C++, and Ruby.

Postgres offers these versatile features as it supports numerous extensions, which are modules that provide extra functionality to any database. In Postgres, you can load the required extensions in the database and use them like built-in features to get highly optimized database performance.

PostGIS, pg_stat_, pg_partman, pgcrypto, and Postgres_fdw are some of the extensions supported by Postgres. Another extension is pgvector, which enables Postgres to act as a vector database. Let’s understand this in more detail in the following section.

What is a pgvector?

The pgvector is an extension for PostgreSQL that allows you to store, query, and index vector data types. It also enables you to generate and store Postgres vector embeddings. You can use these embeddings to perform similarity and semantic searches in recommendation systems, content-based filtering software, or generative AI applications.

Postgres has become the preferred vector database in several organizations because it is easier to use than other specialized vector databases like Pinecone.

You can integrate pgvector into your Postgres database in the following way:

Step 1: Enabling pgvector Extension

You can use the below command to enable the pgvector extension:

CREATE EXTENSION vector;

Step 2: Create a Table in Postgres and Insert Data

Now, create a table in Postgres with vector data type and insert data records using the code below. It is important to note that documents here represent the list of sentences, and embeddings are the vector embeddings for corresponding sentences:

cur = conn.cursor()
cur.execute("CREATE TABLE items (id bigserial PRIMARY KEY, text TEXT, embedding vector(" + str(DERIVED_EMB_SIZE)+ "));")
cur.close()


cur = conn.cursor()
for index, item in enumerate(documents):
  my_doc = {"id":index, "text" :documents[index],"embedding":embeddings[index].tolist()}
  cur.execute("""INSERT INTO items(id,text,embedding) VALUES (%(id)s, %(text)s, %(embedding)s)""", my_doc)
cur.close()

Step 3: Retrieve Vector Data

Here, you will be able to find Cosine Similarity between existing docs and user query. The user_query_embedding is vector embedding of the user query, which will be used to search the Knowledge Base.

To retrieve the data records in the Postgres vector database, you can use the following code:


cur = conn.cursor()
cur.execute("""SELECT text, 1 - (embedding <=> '""" + str(user_query_embedding) + """') AS cosine_similarity FROM items order by 2 desc""")
for r in cur.fetchall():
  print(r[0], r[1])
cur.close()

How to Use Pgvector For a PostgreSQL Vector Database?

If you are looking to create, store, and query vector embeddings using pgvector and OpenAI LLM framework, you can follow the steps explained here. First, you will have to create an OpenAI API key by signing up for an OpenAI Developer account. Then, you can store it as an environment variable and use it to access the OpenAI API key in your Python program:

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) 
openai.api_key  = os.environ['OPENAI_API_KEY']

You can follow the steps given here to generate Postgres vector embeddings using pgvector:

Step 1: Install pgvector Extension

You should first install the pgvector extension into your self-hosted or managed PostgreSQL ecosystem, whichever you are using. For Mac and Linux, follow the below command:

cd /tmp
git clone --branch v0.7.4 https://github.com/pgvector/pgvector.git
cd pgvector
make
make install # may need sudo

For Windows, first, install C++ support in Visual Studio and run the following command:

call "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat"

Use the below code to complete the installation process:

set "PGROOT=C:\Program Files\PostgreSQL\16"
cd %TEMP%
git clone --branch v0.7.4 https://github.com/pgvector/pgvector.git
cd pgvector
nmake /F Makefile.win
nmake /F Makefile.win install

Step 2: Create Embeddings for Your PostgreSQL Vector Database

Vector embeddings help you retrieve related vector data records. To do this, you will have to create embeddings of text related to your LLM query using the OpenAI API key.

You can extract or scrap any data from blogs, documentation, or other information sources as per your requirement. To scrap data, you can take the help of a Python package like BeautifulSoup or LLM frameworks like LangChain and store it in a CSV file. Here is a sample code for it:

df = pd.read_csv('blog_posts_data.csv')
df.head()

Helper functions can calculate the cost of embedding to estimate expenses. You will be charged on a per-token basis for creating embeddings using OpenAI.

Tokens are a sequence of characters found in text, and one token is nearly three-fourths of a word. LLMs like GPT-3 and GPT-4 are trained to understand the statistical relation between tokens and can predict the next token in the series.

There are limitations to the number of tokens OpenAI can use to create an embedding in a single request. To eliminate this drawback, you can opt for the chunking technique, which involves breaking down the text data into smaller chunks. Each chunk consists of a fixed token size, and the number of tokens to be included in embedding depends on your use case.

Now, you can tokenize the information and generate embeddings using the following code:

def get_embeddings(text):
   response = openai.Embedding.create(
       model="text-embedding-ada-002",
       input = text.replace("\n"," ")
   )
   embedding = response['data'][0]['embedding']
   return embedding

You can create embeddings for each chunk of text using the given code:

for i in range(len(new_list)):
   text = new_list[i][1]
   embedding = get_embeddings(text)
   new_list[i].append(embedding)

Create a new dataframe from the list as shown below:

df_new = pd.DataFrame(new_list, columns=['title', 'content', 'url', 'tokens', 'embeddings'])
df_new.head()

You can save the original content and the embeddings associated with it in a CSV file for future reference.

Step 3: Use pgvectors to Store Embeddings in a PostgreSQL Vector Database

In the next step, you can use pgvector to store the embeddings created earlier in the PostgreSQL vector database. To create a Postgres database in your local system, you can refer to the official documentation here. If you are looking to create a Postgres database using a cloud-based service of your choice, you can visit the official page of that cloud service.

After creating the Postgres database, you can export the connection string as an environment variable to be read in your Python program.

import os

connection_string  = os.environ['POSTGRES_CONNECTION_STRING']

Connect your Postgres database using the psycopg2 Python library using the given code:

conn = psycopg2.connect(connection_string)
cur = conn.cursor()

You can use the register_vector() command to register the vector type with your connection. This command is used to define a custom function to determine the distance or similarity between vector data points.

register_vector(conn)

Proceed to create a table to store embeddings and metadata.

table_create_command = """
CREATE TABLE embeddings (
            id bigserial primary key, 
            title text,
            url text,
            content text,
            tokens integer,
            embedding vector(1536)
            );
            """

cur.execute(table_create_command)
cur.close()
conn.commit()

You can now ingest the vector embeddings into the database using the batch ingestion technique and index the vector data for faster retrieval.

register_vector(conn)
cur = conn.cursor()

Prepare the list of tuples to add embeddings:

data_list = [(row['title'], row['url'], row['content'], int(row['tokens']), np.array(row['embeddings'])) for index, row in df_new.iterrows()]

Now, use execute_values to perform batch insertion:

execute_values(cur, "INSERT INTO embeddings (title, url, content, tokens, embedding) VALUES %s", data_list)
conn.commit()

You can integrate this Postgres vector database with any LLM to enhance the accuracy of response generation.

Whenever you give a prompt to an LLM, such as OpenAI, the text in the prompt is first converted into vector embeddings. The model then uses the Postgres vector database to find vector embeddings similar to or related to the input embeddings.

These embeddings provide the LLM with the context of the query and help it retrieve relevant information. Based on this information, the LLM generates the required outcome for the input query.

How can Airbyte Help You to use PostgreSQL as a Vector Database?

Airbyte

Data integration is an important process when leveraging PostgreSQL as a vector database. It involves collecting and consolidating data from multiple sources into Postgres. After installing pgvector, your data can be stored as vector embeddings. You can then integrate Postgres with LLMs to build highly functional AI-based applications.

You can use Airbyte, a data integration platform, for efficient data integration. It offers an extensive library of 350+ connectors, which you can use to extract structured or unstructured data from any source. If the connector you wish to use is not in the existing set of connectors, you can build one using Airbyte’s Connector Development Kit (CDK).

You can convert the extracted data into vector embeddings using LLM providers like OpenAI or Cohere, which are supported by Airbyte. It also allows you to directly load unstructured data into vector databases like Pinecone, Milvus, or Weaviate. The data loaded to the vector database can be used to perform semantic and context-based searches in LLMs to optimize their outcomes.

Here are some key features of Airbyte:

  • Change Data Capture: The change data capture (CDC) feature of Airbyte enables you to instantly capture changes made in source data and reflect them in the vector database. You can leverage this feature to update vector embeddings, which can help LLM generate more relevant and current responses.
  • RAG Operations: You can integrate Airbyte with an LLM framework like LangChain to perform RAG transformations like chunking to optimize information retrieval and streamline the outcomes of LLMs.
  • Data Security: Airbyte provides robust data security features such as single sign-on (SSO) and role-based access control mechanisms. The platform complies with data regulatory framework provisions like GDPR, CCPA, and HIPAA to ensure data privacy.

To leverage Airbyte for generating PostgreSQL vector embeddings, you can follow the below steps:

Step 1: Add a Source

Set up a new source
  • Log in to your Airbyte account, and then click the Sources tab from the main navigation bar on the left.
  • In the Search box, enter the name of your source connector and click on its icon to go to the setup page. Here, enter the necessary information, including the source name and authentication method.
  • Then, click the Set up Source button to complete the Source configuration. 

Step 2: Add a Destination

Set up a new destination
  • After the source, set up PostgreSQL as the destination. To do this, click on the Destinations tab from the main navigation bar on the left.
  • In the Search box, enter PostgreSQL and click on its icon to go to the setup page and enter the necessary details.
  • Then click the Set up Destination button to finish configuring the destination.

Step 3: Set up and Configure the Connection

  • You can set up a new connection after setting up the source and destination.
  • Select the source and destination of your connection.
  • Now, set up the connection configuration. For instance, you can set up the replication frequency or schema change propagation.
Configure the Connection
  • Then, click on the Set Up Connection button to complete the configuration.

Now, you can extract the data from the source data system and load it to the Postgres database. After this, you can enable the pgvector extension in Postgres to store and query vector embeddings, which can then be used to produce accurate results in LLMs.

Conclusion

PostgreSQL is a reliable and high-performing database that can handle large volumes of data. You can integrate it easily with most other data systems. The database also supports a wide range of extensions that enhance its functionalities. One such extension is pgvector, which enables PostgreSQL to query vector data, making it a preferred choice for many AI and ML applications.

This blog gives a complete overview of the Postgres vector database. It explains how to install pgvector to generate, store, and query Postgres vector embeddings. You can harness these Postgres vector embeddings in recommendation systems, visual search software, applications generating text-based responses, and for personalized marketing in an enterprise.

FAQs

Is Postgres good for vector databases?

Postgres utilizes the pgvector extension to store, query, and retrieve vector data. The database can handle high volumes of data, making it suitable for large-scale AI applications. It also facilitates SQL-based querying, transactions, and security mechanisms.

Postgres also has active community support that regularly contributes to the development of its functionality. This makes Postgres a good vector database solution.

What is the best way to store vector embeddings?

Specialized vector databases like Pinecone, Milvus, or Weaviate are the best solution for storing vector embeddings. However, you can also store vector embeddings in databases such as Postgres with pgvector extension, MongoDB Atlas, or Apache Cassandra, as they support vector data types.

Is PGVector a vector database?

Yes, pgvector is an open-source extension for PostgreSQL that converts it into a vector database. It enables you to store and search identical vector embeddings in Postgres, making the database suitable for use in search-based AI and ML applications.

What is the maximum size of a Pgvector vector?

The pgvector supports vectors with a maximum dimension of up to 2000. It supports half vectors or halfvec up to 4000 dimensions, a new feature of pgvector 0.7.0 that reduces memory usage in storing vector data. This version also supports vector data types such as bits up to 64,000 dimensions and sparse vectors up to 1000 non-zero elements.

Is Postgres faster than MongoDB?

MongoDB is a NoSQL database management system that usually performs better than Postgres for operations involving high write loads, horizontal scalability, and unstructured data. Postgres performs better in complex queries, transactions, and structured data operations.

Limitless data movement with free Alpha and Beta connectors
Introducing: our Free Connector Program
The data movement infrastructure for the modern data teams.
Try a 14-day free trial