OpenAI Embeddings 101: A Perfect Guide For Data Engineers

September 4, 2024
25 min read

Organizations often need help understanding unstructured text data, such as customer feedback or documents. While this data is crucial, analyzing it using traditional methods can be challenging. OpenAI embeddings help convert unstructured text data into numerical representations, making it easier to process and analyze. 

With these embeddings, teams can enhance search capabilities, automate content categorization, and improve recommendations. This leads to smarter decisions and better customer experiences. OpenAI embeddings provide a powerful solution to unlock the potential of text data, driving more efficient and accurate data-driven results.

This article will explain OpenAI embeddings, its models, and use cases in detail. Let’s explore!

What are Embeddings?

Embeddings

Embeddings are numerical representations of data that help machine learning models understand and compare different items. These embeddings convert raw data, such as images, text, videos, and audio, into vectors in a high-dimensional space where similar items are placed close to each other. This process simplifies the task of processing complex data, making it easier for ML models to handle tasks like recommendation systems or text analysis.

What are OpenAI Embeddings?

OpenAI Embeddings

OpenAI embeddings are numerical representations of text created by OpenAI models such as GPT that help you represent the meaning of the text through vectors. They convert words and phrases into numerical form, allowing for the calculation of similarities or differences between them. This is useful for tasks such as clustering, searching, and classification.

Beyond these applications, OpenAI embeddings utilize advanced machine learning algorithms to examine words and their contextual meanings. This results in more precise representations and helps you detect the same patterns and relationships in a large dataset, making it useful for semantic analysis.

How Do Embeddings Work?

Understanding the workings of embeddings gives you insights into how the text is transformed into significant numerical data. Explore all the steps in detail:

Start With a Piece of Text

First, begin by selecting a piece of text, whether a phrase, sentence, or other fragment. This text will act as raw input for creating embeddings.

Break the Text Into Smaller Units

The text is then broken down into smaller units called tokens. Each token will represent a word, character, or phrase, depending on the tokenization method. Tokenization captures the essential elements of the text for further analysis.

Convert Each Token Into a Numeric Representation

Each token is converted into a numeric representation that can be processed by algorithms. These numeric values are initial embeddings that reflect the basic properties of each text.

Neural Network Processing

In this step, the numeric representation of each token is passed through a neural network. The network then processes these tokens, understanding the text's context and meaning by capturing deeper patterns and relationships between the text.

Vector Generation for the Input

After processing, the neural network generates a vector. This vector contains the context and meaning of the input text. The vector or embedding can then be utilized in various applications, such as searching, clustering, and classification.

OpenAI Embedding Models

OpenAI offers a range of embedding models designed to do various text analysis-related tasks. 

Let’s discuss some of the models here in detail.

Models Descriptions Output Size Computational Efficiency Use Cases
Text Embedding-3-Large This third-generation embedding model offers the greatest capability for both English and non-English text, making it ideal for complex text analysis tasks. 3,072 dimensions Lower (more resource-intensive) It is suitable for detailed tasks such as complex semantic analysis.
Text Embedding-3-Small It is an enhanced version of the ADA model and third-generation model. This provides improved performance for faster processing. 1536 dimensions Higher (less resource-intensive) It is ideal for tasks like simple keyword searches and quick text classification.
Text Embedding-ada-002 This second-generation embedding model exceeds the capability of 16 previous models, optimizing performance for diverse text embedding tasks. 1536 dimensions Moderate (Balanced efficiency) Versatile for a range of tasks such as content recommendations and general text analysis.

OpenAI Embedding Use Cases For Data Engineers

In this section, you will explore how data engineers can utilize these embeddings to resolve real-world problems and enhance performance.

Semantic Search and Information Retrieval

OpenAI embeddings help you find more accurate search results by understanding the meaning and context of your queries, even if they have synonyms. You can use these embeddings to create similarity indexes, leading to faster and more efficient retrieval of relevant information.

For example, if you search for “buy Bluetooth mouse,” embeddings provide results related to the electronic device instead of the animal. Embeddings capture the hidden semantics that help you retrieve contextually relevant information about your topic. 

Text Classification and Clustering

OpenAI embeddings capture semantic nuances and enable text classification based on predefined categories. They also help you create accurate models for semantic analysis and topic identification tasks. For instance, customer reviews can be classified as positive or negative on any website using text classification techniques.

You can use embeddings to group texts that share similar concepts or themes. This is particularly useful in identifying underlying patterns, discovering hidden entity relationships, and sorting large datasets by topic.

Recommendation Systems

Embeddings enhance recommendation systems by analyzing your purchase history and browsing behavior and providing semantically related suggestions. By understanding the relationships between different items, embeddings match preferences with similar content. 

This approach provides more accurate and personalized recommendations. For example, if you enjoy a particular genre of movies, the system can recommend other films with similar themes, improving user satisfaction and engagement.

Anomaly Detection

You can employ OpenAI embeddings to analyze the underlying structure of the data and distinguish between genuine anomalies and normal variations. This helps reduce false positives and makes it easy to detect anomalies even in real-time applications.

For instance, embeddings can help you identify unusual transaction patterns that differ from usual behavior, enabling early detection of fraudulent activities. This approach provides a more accurate method for detecting anomalies rather than conventional methods.

Natural Language Processing Tasks

OpenAI embeddings can be used to pre-train machine-learning models on large datasets, improving their efficacy for downstream tasks. These include text summarization, topic modeling, and machine translation.

For example, embeddings can help a machine translation system correctly translate a phrase like "It's raining cats and dogs" into a similar idiom of another language. This is despite the fact that the literal meaning of the words is nonsensical. 

How to Use OpenAI Embeddings?

To use OpenAI Embeddings, you can follow these steps.

Step 1: Set up the Python Environment

You can visit the official link to download and install Python on your local system. After performing the installation steps, you must install virtualenv by running the following command in your terminal:


pip install virtualenv

Create a virtual environment to manage dependencies in your project folder by navigating to it and running the below command in the command line interface (CLI):


python -m venv myenv

You can now activate the virtual environment.

On Mac, execute the code below:


source myenv/bin/activate 

On Windows CLI, use:


‘myenv\Scripts\activate.bat’

Step 2: Import the OpenAI & Libraries

Before importing the necessary libraries, ensure you install each using the command mentioned below in your CLI:


pip install -U openai, pandas

Create a Python file (.py extension) to import the required libraries and OpenAI API, or use Jupyter Notebook and execute all the code mentioned below.


import os
import openai
import pandas as pd    

Now, you can set up the Open API key by replacing "YOUR_API_KEY" with your actual API key in the code below:


openai.api_key = "YOUR_API_KEY"

Step 3: Create a Function to Get Embeddings

Follow this section to build a function that can create embeddings from textual information. Here, you can use the ada version 2 model, text-embedding-ada-002, to generate embeddings cost-effectively.


def get_embedding(text_to_embed):
	response = openai.Embedding.create(
    	model= "text-embedding-ada-002",
    	input=[text_to_embed]
	)
	embedding = response["data"][0]["embedding"]
    
	return embedding

Using a Sample Dataset

For the sake of simplicity, you can use a sample dataset to understand how OpenAI embeddings work. You can consider an example from Kaggle, which discusses the reviews for musical instruments left by users on Amazon. These reviews can be analyzed to produce insights, which can help you understand customer behavior and expand business opportunities.

Import the data from Kaggle to your notebook and print the first five rows from the dataset:


data_URL =  "https://raw.githubusercontent.com/keitazoumana/Experimentation-Data/main/Musical_instruments_reviews.csv"

review_df = pd.read_csv(data_URL)
review_df.head()

Output:

The only useful information in this table is the reviewText column, which contains customers' reviews. To extract only the review column in a different DataFrame, execute the code below:


review_df = review_df[['reviewText']]
print("Data shape: {}".format(review_df.shape))
display(review_df.head())

Output:

Data shape: (10261, 1)

The above table shows the first five rows of the new DataFrame, containing only the reviewText column. This entire dataset contains 10261 rows, which is a huge amount. You can use 100 random rows to optimize associated costs for this example:


review_df = review_df.sample(100)

Step 4: Call Your Function With the Text

After preparing the dataset, you can now use the get_embedding function to generate embeddings from each row of the DataFrame. The code below does that for you. Execute the code in your notebook to create a new embedding column representing how OpenAI processes textual data.


review_df["embedding"] = review_df["reviewText"].astype(str).apply(get_embedding)
review_df.reset_index(drop=True)

Print the top 10 rows of the new DataFrame:


review_df.head(10)

Output:

In addition to the above steps, you can also understand text similarity by performing case studies for cluster analysis. Visualizing each cluster can help you understand how OpenAI embeddings work. You can follow this in-depth tutorial to learn more about text processing in LLMs.

How Airbyte Helps in Enhancing Embeddings?

In the previous section, you used a demo dataset with 100 rows. However, in real-world applications, you can have a huge amount of data to embed to create an accurate OpenAI agent. This is especially applicable if your organization deals with data from different sources. Storing this data into a single destination is critical, as it enhances data accessibility for model training.

To integrate data from numerous sources into a single destination, you can leverage no-code tools like Airbyte.

Airbyte

Airbyte is a data replication tool that provides 350+ pre-built data connectors for extracting data and loading it into a destination of your choice. It allows you to load unstructured data into popular vector databases, including Pinecone, Weaviate, and Milvus, which can help train LLMs.

With its support for RAG-specific transformations, such as LangChain chunking and OpenAI embeddings, you can transform and store data in a single operation. Airbyte extends its functionality by offering an extensive Python library, PyAirbyte, to perform data movement.

Here’s how you can extract data using PyAirbyte. Run the code below in Jupyter Notebook to do so:


%pip install --quiet airbyte

Import the PyAirbyte library:


import airbyte as ab

To get all the available connectors, execute the code below:


ab.get_available_connectors()

You can now check the list of connectors and create and install the source of your choice:

source: ab.Source = ab.get_source("source-faker")

Configure the source by setting the count according to the size of the dataset in the code below:


source.set_config(
    config={
        "count": 50_000,
        "seed": 123,
    },
)

You can now verify the source connection setup:

source.check()

Finally, you can select all of the source's streams and read data into the internal cache:


source.select_all_streams()
read_result: ab.ReadResult = source.read()

Now, you can feed this data to your get_embedding function to turn textual information into a vector representation. Extracting data this way allows you to use the daily data you work with to train your LLM agent. It allows you to cater to your customer’s specific requirements.

OpenAI Embedding Pricing

You can evaluate the cost of using OpenAI embeddings by comparing the pricing of its embedding models. Below are detailed pricing of these models to help you select the most suitable option. 

  • Text Embedding-3-Large: The standard rate is $0.130 per million tokens, and the batch API rate is $0.65 every million tokens.
  • Text Embedding-3-small: Costs $0.020 per million tokens at the standard rate, and batch API costs around $0.010 per million tokens.
  • Ada v2: At the standard price, it costs $0.100 per million tokens, and for batch API, it costs $0.050 per million tokens.

Few Alternatives to OpenAI Embeddings

Many alternatives to OpenAI embeddings are available in the market. Here are the top three options you can explore and decide if they suit your organization’s unique needs.

1. Cohere

Cohere

Cohere’s embedding API is suitable for short texts up to 512 tokens. It uses a method inspired by Reimers and Gurevych to create detailed embeddings for each token and average them to complete the text representation. The API reduces extra content for longer texts to stay within the 512-token limit but still provides robust embeddings. 

2. Mistral

Mistral

Mistral offers a powerful embedding API that helps you transform text into numerical data, which is useful for tasks such as sentimental analysis and text classification. Mistral’s Embedding APIs are easy to use, reliable, and capable of handling large amounts of data. By utilizing Mistral’s embedding API, you can build advanced AI models and gain better insights from your data.

3. NLP Cloud

NLP Cloud

NLP Cloud provides an embedding API using Multilingual Mpnet Base v2, which offers 768- dimensional embeddings. It has faster response times and allows you to use a pre-trained model, create a custom model, or upload your own for a specific task. NLP Cloud makes it easy to test embeddings locally and use them reliably.

Conclusion

OpenAI embeddings are powerful tools for data engineers to maximize the potential applications of unstructured text data. By converting text into numerical representations, embeddings enable machines to understand and process text more effectively. This improves various tasks, such as anomaly detection, natural language processing, and personalized recommendations.

By understanding how embeddings work and exploring the various alternatives offered by different vendors, data engineers can leverage OpenAI embeddings to streamline several NLP tasks. This also encourages them to solve complex problems and drive innovation for their organization's sustainable growth. 

FAQs

How does ChatGPT create embeddings?

ChatGPT uses neural networks to create embeddings by training them on large amounts of text data. This enables them to represent words and phrases as high-dimensional vectors (embeddings) based on their context.

How big are OpenAI embeddings?

The length of the embedding vector for text-embedding-3-small is 1536, and for text-embedding-3-large, it is 3072. This dimensionality reflects the amount of information captured about the text data.

Can I use OpenAI Embeddings for free?

OpenAI embeddings are not free to use. The cost varies depending on your usage, specific models, and the volume of tokens passed. For more information, visit the OpenAI pricing page or talk to their sales team.

What model does OpenAI use for embedding?

OpenAI uses various embedding models, with text-embedding-ada-002 being a popular choice. This model is known for its balance of performance and efficiency, making it popular for generating high-quality embeddings.

Are OpenAI embeddings better than BERT?

OpenAI embeddings capture subtle semantic relationships and are suitable for question answering, while BERT is more suited for text classification and named entity recognition.

Are OpenAI embeddings normalized?

Yes. OpenAI normalizes embeddings, scaling their values to maintain a consistent magnitude. This helps you quickly compare embeddings and measure the similarity between different texts.

Limitless data movement with free Alpha and Beta connectors
Introducing: our Free Connector Program
The data movement infrastructure for the modern data teams.
Try a 14-day free trial