Explore the basics of image and multimodal embeddings in AI. Learn how embeddings capture data attributes and improve product recommendations and image searches.
In machine learning and artificial intelligence applications, it is necessary for algorithms to understand the "meaning" of text, images, sounds, and more. To achieve this, data—whether text, images, or audio—is transformed into multidimensional vectors, known as embeddings. These vectors capture various abstract attributes of the data. Each dimension of the vector represents a different attribute, and the value along each dimension indicates the importance of that attribute.
This article is the first part of a two-part series. Here, we introduce the concept of embeddings using an intuitive, fictional e-commerce example. We'll explore the basics of image embeddings and multimodal embeddings (combining image and text) and demonstrate their practical applications.
In part two , we discuss text-based embeddings, covering traditional word embeddings, contextualized word embeddings, and sentence embeddings.
Intuitive Introduction to Embeddings Consider a situation where you want to discover (and recommend) products similar to other products.
Simplistically, to discover similar products, you might choose other products which are in a similar price range, or in the same category, or with similar user-ratings, and so on. However, such a system does not allow to quantify the extent of similarity of two different products. It also does not account for the fact that certain product attributes are more important than other attributes.
To do this in a more structured and mathematical way, you use a multidimensional vector. The dimensions (axes) of this vector can represent attributes like:
Value for money User ratings Ease of use And other relevant features Such a vector is called an embedding . The multidimensional space in which these vectors exist is called the embedding space . Each of the above attributes is represented by an axis of this multidimensional space.
The value of the vector along each dimension is the relative significance of the attribute which that dimension represents. You can rank different products by computing a relevant metric, such as the squared sum of the individual attribute values, analogous to the distance in a Cartesian system. You can compute the relative similarity of different products, by computing, for example, the dot product of their representative vectors You can determine products most similar to a given product by computing the "nearest neighbors" of the embedding vector. Consider a rudimentary example where you have three attributes (say, "value for money", "average customer satisfaction", "environmentally friendly") to define a product. Each dimension can take a value between 0 and 99.
You can create three tensors in Python to represent the embeddings of three products with different values along each dimension:
import torch
product1 = torch.tensor([20., 30., 40.])
product2 = torch.tensor([21., 29., 42.])
product3 = torch.tensor([72., 99., 12.])
Notice that product1 and product2 have values close to each other along each dimension, while product3 is different.
If you plot these on a 3D graph, product1 and product2 will be positioned close together while product3 will be farther away. Later sections illustrate how to mathematically represent this intuitive understanding.
In the above example, the dimensions of the embedding vector correspond to human-understandable subjective attributes. But in practical machine learning systems, such as those used for text and images, the vector dimensions represent abstract attributes.
You neither choose nor necessarily understand what these attributes signify. You just specify the number of dimensions you want your vectors to have. The training process then determines the appropriate values along each dimension, based on your training goals and input data.
In general, the more granular you want the system’s understanding to be and the more computing power you are prepared to expend, the higher the number of dimensions you choose.
Distance Metric Now that you have represented products as vectors in attribute-space, you need a way to compute and compare their relative similarity. The quantity that expresses the difference between two vectors is called the distance metric . Some common distance metrics are:
Euclidean Distance (L2 Norm) L2 Squared Distance Manhattan Distance (L1 Norm) Minkowski Distance (Lp Norm) Haversine Distance Hamming Distance Dot Product Cosine Similarity Chebyshev Distance Jaccard Index Sorensen-Dice Index The details of different metrics and their use-cases is a broad topic and beyond the scope of the current article.
As a rule of thumb, you use the same metric that was used to train the model used to generate the embeddings. AI applications frequently use cosine similarity, so the examples in this article use it too.
Vectors that are very similar to each other have a high cosine similarity score and dissimilar vectors have a low score.
Cosine Similarity You can now compute the cosine similarities of the products you represented earlier as vectors like this in Python:
cosine_similarity = torch.nn.CosineSimilarity(dim=0)
cosine_similarity(product1, product2)
Output:
tensor(0.9993)
cosine_similarity(product1, product3)
Output:
tensor(0.7383)
The numerical values of the cosine similarity confirm the observation made earlier using human judgment: product1 and product2 are more similar to each other compared to product1 and product3 .
If you’d like to run this code, this Google Colab notebook includes the above code as well as other examples.
Based on similar principles as the above simplistic example, the following sections illustrate the use of embeddings in practical applications such as determining the relative similarity of images with each other using image embeddings, and determining the similarity of image and text pairs using multimodal embeddings.
Image Embeddings The semantic meaning of an image is captured as a whole by all the dimensions of its embedding vector. This sounds vague, but it is the reality. Individual dimensions of the embedding vector do not have a specific meaning by themselves. The training process and the training dataset determine the abstract semantic attributes of the image to encode along each vector dimension.
A crude (but incorrect) analogy is that one dimension captures the presence of certain patterns like concentric circles, another dimension captures the presence of certain types of angular shapes, and so on.
The attributes that need to be captured depend on the task at hand. A classification model might need to focus on different attributes compared to a recognition model. Thus, it is important to train the model on a dataset that sufficiently represents the real-world data on which the trained model is expected to work.
One of the key challenges in vision development is to classify images and recognise similar images. Expressing images as vectors makes it easy to mathematically compute their similarity.
Consider, for example, Google Image Search , where you upload an image and it returns images that look similar to the one you uploaded. An easy way to do this is by returning images whose embedding vectors have high cosine similarity values to the uploaded image.
Some use-cases of image embeddings are:
Searching for similar images Determining the similarity of given images Classifying images ResNet-50 is a well known model used for image classification tasks. It is a convolutional neural network with 50 layers.
The default pre-trained model, which is available from PyTorch, was trained on the ImageNet-1k dataset . This dataset has over 1 million images split into 1000 classes. The image embeddings encode only such features of the images as are necessary to accurately bucket them into the limited number of labels.
To express images using embeddings, one needs to use tensors with a large number of dimensions. An image is worth a thousand words, the old saying goes. Even a rudimentary model, such as ResNet-50, uses embedding tensors with 2048 dimensions. Consider also that ResNet was trained on low resolution color images with only 224 x 224 pixels.
Image Embeddings in Practice To get a hands-on understanding of image embeddings, let’s import the ResNet-50 model with the default (pre-trained) weights:
import torchvision.models as models
model_resnet = models.resnet50(weights="ResNet50_Weights.DEFAULT")
Then, inspect the layers of the model:
Scroll to the bottom and notice that the description of the final layer looks like this:
(fc): Linear(in_features=2048, out_features=1000, bias=True)
This is a linear layer accepting a tensor with 2048 dimensions. This final layer uses the image-features encoded in the 2048 dimensions of the embedding vector to predict the class of the image. It outputs a tensor with 1000 dimensions. These correspond to the 1000 classes of images into which the input image is to be classified.
In this example, the goal is not to classify the input image, but to get its embedding vector. The embedding vector, with 2048 dimensions, is output by the pre-final layer. So, extract the layers of the model up to and including the pre-final layer, while excluding the final (classifier) layer:
model_resnet_em = torch.nn.Sequential(*(list(model_full.children())[:-1])) model_resnet_em.eval()
The above network outputs the 2048-dimensional embedding vector of the image.
Manually get the URL of some interesting images. In this example, the first two images are cats and the third is a motorcycle:
url1 = "https://images.pexels.com/photos/2071881/pexels-photo-2071881.jpeg"
url2 = "https://images.pexels.com/photos/2071882/pexels-photo-2071882.jpeg"
url3 = "https://images.pexels.com/photos/1715193/pexels-photo-1715193.jpeg"
Image from url1 Image from url2 Image from url3 Before processing arbitrary images, it is necessary to standardize them. Define a preprocessing function to do this:
import torchvision.transforms as transforms
preprocess = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.5, .5, .5], std=[0.5, .5, .5])
])
This preprocessing function resizes input images to 256 x 256 pixels and then considers only the central 224 x 224 pixels, because the ResNet model was trained on images of this size.
Let’s create another function to extract an image file from a URL and standardize the downloaded image for further processing:
from PIL import Image
import requests
from io import BytesIO
def get_processed_image(my_url):
res = requests.get(my_url)
image = Image.open(BytesIO(res.content))
processed_image = preprocess(image)
return processed_image.unsqueeze(0)
Now, get the embeddings of all three images:
embedding_image1 = model_resnet_em(get_processed_image(url1)).squeeze()
embedding_image2 = model_resnet_em(get_processed_image(url2)).squeeze()
embedding_image3 = model_resnet_em(get_processed_image(url3)).squeeze()
Then, take a look at the shape and dimension of the embeddings:
print(embedding_image1.shape)
print(embedding_image1.dim_order())
Import the cosine similarity function, then evaluate and print the cosine similarities of different image pairs:
cos = torch.nn.CosineSimilarity(dim=0)
print(cos(embedding_image1, embedding_image2))
print(cos(embedding_image1, embedding_image3))
print(cos(embedding_image2, embedding_image3))
The results are shown below:
tensor(0.8518, grad_fn=<SumBackward1>) tensor(0.0650, grad_fn=<SumBackward1>) tensor(0.0712, grad_fn=<SumBackward1>)
Not surprisingly, the two cat images (1 and 2) have a high similarity score whereas the cat and motorcycle image pairs (1 and 3, and, 2 and 3) have a very low similarity score.
If you’d like to try this yourself, check this Google Colab notebook , which includes the above code as well as other examples.
Multimodal Embeddings ResNet, discussed in the previous section, is trained to classify input images into those 1000 classes in its training dataset. It doesn't do well on other tasks, such as classifying images into classes that are synonyms of the classes in the training data, or determining whether an image's text description is accurate, or finding images that match a text description, and so on.
Multimodal models address such challenges. As the name suggests, multimodal models are based on two different types of media. For example, images and text, or audio and text.
The CLIP (Contrastive Language Image Pre-training) model is trained jointly on images and text. Its training dataset consists of 400 million pairs of images and their associated text descriptions. This data was sourced from the open internet.
The text is fed through a text-model, such as a transformer, to generate an embedding. The image is fed through an image-model, such as ResNet or Vision Transformer (ViT), to generate an image embedding. The model is set up such that the generated text embedding and image embedding have the same number of dimensions.
Training is based on the cosine similarity of text embeddings and image embeddings. Suppose that {I, T} represent pairs of images and their corresponding text descriptions. The model is trained on pairs of image embeddings and text embeddings such that it assigns:
A high similarity score to pairs consisting of an image's embedding and the embedding of its corresponding text. For example pairs {I1, T1}, {I2, T2}, and so on have a high similarity score. A low similarity score to all other pairs, such as {I1, T2}, {I1, T3}, {I2, T1}, {I2, T3} , and so on. Thus, the model embeds both text and images into the same latent space. Since it represents text using transformer-based models which use the attention mechanism, it is able to "comprehend" the meaning of new words and phrases. It is therefore able to use this comprehension to do tasks like:
Find images matching a new text description: semantic search Estimate the accuracy of arbitrary text descriptions for a given image: zero shot classification Answer questions about the contents of the image: visual search Because of such capabilities, CLIP was also used in the training of the well-known Dall-E image generation models.
Multimodal Embeddings in Practice To try a hands-on example to better understand multimodal embeddings using the CLIP model, start by importing the relevant packages:
from transformers import CLIPProcessor, CLIPModel
The CLIPModel package is the model itself; the CLIPProcessor package is a wrapper for the tokenizer and the image processor. Instantiate the model and the tokenizer with a pre-trained model openai/clip-vit-base-patch32 :
from transformers import CLIPProcessor, CLIPModel = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor_clip = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
Get the URLs of some interesting images and assign them to variables:
url_cat = "https://images.pexels.com/photos/2071881/pexels-photo-2071881.jpeg"
url_motorcycle = "https://images.pexels.com/photos/1715193/pexels-photo-1715193.jpeg"
In the above examples, the first image is a cat and the second is a motorcycle.
Let’s extract the motorcycle image:
from PIL import Image
image_motorcycle = Image.open(requests.get(url_motorcycle, stream=True).raw)
Now, create a pair of description texts:
text_motorcycle = "a photo of a motorcycle on the road"
text_cat = "a photo of a cat"
Pre-process and tokenize both the above texts together with the motorcycle image:
inputs = processor_clip(text=[text_motorcycle, text_cat], images=image_motorcycle, return_tensors="pt", padding=True)
Run the CLIP model on the tokenized input:
outputs = model_clip(**inputs)
The output logits show the score of each image-text pair. You can see that the first pair (text_motorcycle and image_motorcycle) has a high score:
outputs.logits_per_image.softmax(dim=1)
Output:
tensor([[9.9988e-01, 1.2363e-04]], grad_fn=<SoftmaxBackward0>)
Since we want to understand embeddings in greater detail, extract the text embedding and the image embedding into separate variables. You can use these embeddings to replicate the computations independently.
text_embedding_motorcycle = outputs["text_embeds"][0].squeeze()
text_embedding_cat = outputs["text_embeds"][1].squeeze()
image_embedding_motorcycle = outputs["image_embeds"].squeeze()
Notice that both the text and image embeddings have the same dimensions:
print(text_embedding_cat.shape)
print(image_embedding_motorcycle.shape)
Evaluate the cosine similarity of text_embedding_motorcycle with image_embedding_motorcycle and of text_embedding_cat with image_embedding_motorcycle :
import torch
cos = torch.nn.CosineSimilarity(dim=0)
print(cos(text_embedding_motorcycle, image_embedding_motorcycle))
print(cos(text_embedding_cat, image_embedding_motorcycle))
The output is shown below:
tensor(0.2706, grad_fn=<SumBackward1>) tensor(0.1806, grad_fn=<SumBackward1>)
Notice that the motorcycle image with the cat text has a lower similarity score than the motorcycle image with the motorcycle text.
This Google Colab notebook includes the above code as well as other examples. As an exercise, also try to use both text pairs with the image of the cat and check their similarities.
Note that in practice, this is not how to use the CLIP model. It is typically used with arrays of images and text descriptions to find the most accurate pair, using the Softmax function. But in this example, you do things manually because the goal is to study embeddings and their similarity scores.
You also take this approach if the goal is to extend the model’s functionality to other tasks than what it was originally packaged for.
Other Types of Embeddings Many other AI and related applications use embeddings. A couple of common examples are presented below without going into any detail.
Audio Embeddings Tools like OpenAI's Whisper speech-to-text model internally use embeddings. To compute the embeddings, the input is based on the short-term power spectrum of the audio. By using a sliding window to analyze consecutive intervals of the audio signal, the model is able to treat it as a sequence of audio snippets. The model is then able to combine embeddings of audio sequences with text embeddings for corresponding time intervals.
Video Embeddings To analyze video, you can consider each video as a sequence of images or keyframes. The contents of each image are analyzed using an image-text model like CLIP. This approach is useful for simple applications like video similarity search - where you try to find videos that are similar to an uploaded image.
However, for more sophisticated applications, such as those using an AI text to video tool , it is essential to model the temporal relationship between video frames. A video is not just a collection of pictures arranged randomly—it is the specific sequence of those pictures that gives the video meaning.
Older models such as video2vec (from around 2016) use RNN-based techniques to do this. Modern tools like Video Transformers and Video Vision Transformers use positional embeddings and semantic embeddings to understand videos.
Graph Embeddings Graphs are data structures with pairs of nodes and edges. Many types of information, such as social networks, are best represented as a graph structure. Graphs can get large and complex, and difficult to analyze.
Graph embeddings help to represent nodes of the graph in an embedding space. This makes it easier to, for example, discover similar nodes, for applications like recommendation systems.
Similarly, knowledge graphs have triplets of entities, relationships, and facts. They are used to represent and organize knowledge bases, which consist of descriptions of different related topics.
Knowledge graph embeddings help to identify hidden patterns like information that is not explicitly specified, but can be implied given other adjacent relationships. They are also useful as an alternative to manual labeling of datasets for downstream applications like training other machine learning models.
Text Embeddings Text embeddings represent in vector-form the semantic meaning of words and sentences. They are widely used for tasks like named entity recognition, text generation, sentiment analysis, and more. The second part of this article discusses text embeddings in detail and with examples.
Conclusion In this deep dive into embeddings, we've covered the foundational concepts and practical applications of image and multimodal embeddings. By transforming data into multidimensional vectors, we can capture abstract attributes and mathematically quantify similarities between different entities, whether they are products in an e-commerce store, images in a gallery, or pairs of images and text descriptions.
We've explored how embeddings allow for more accurate product recommendations, image classification, and multimodal understanding, showcasing the power of this technique in various AI applications. From our hands-on examples with Python code, you can see how these embeddings are computed and utilized to achieve high similarity scores for related items and low scores for unrelated ones.
An important aspect of working with embeddings is managing and integrating them into your data infrastructure. Tools like Airbyte can streamline this process by integrating data to vector databases, which are essential for storing and querying embeddings effectively.
Airbyte's recent support for vector databases, powered by LangChain, makes it easier to manage embeddings for various AI applications. You can learn more about their AI capabilities and vector database integration on their product page and blog .
Looking ahead, part two of the series will delve into text-based embeddings, where we'll discuss how words, sentences, and documents can be represented in vector form.