Discover text-based embeddings in AI with practical examples on Word2Vec, BERT, and Sentence Transformers. Learn how Large Language Models use these techniques.
The first part of this two-part series introduced the concept of embeddings and demonstrated the use of image embeddings and multimodal embeddings. For an intuitive understanding of embeddings, it is helpful to re-visit the introduction to embeddings section of part one.
This article discusses text-based embeddings. It includes practical examples to illustrate:
Traditional word embeddings using Word2Vec Contextualized word embeddings using BERT Sentence embeddings using sentence transformer models You’ll also learn about Large Language Models (LLMs), such as Falcon and Mistral, which use text-embeddings based on the transformer architecture.
Word Embeddings To encode words as numbers for computational purposes, one might naively use 1-hot vectors, where each dimension represents one word. This is suboptimal because:
A large vocabulary leads to high-dimensional 1-hot vectors. These sparse vectors are computationally inefficient. The approach does not capture the relationship between words or their meanings. Thus, the embeddings cannot be used for higher order tasks like Natural Language Processing (NLP) or text comprehension and generation. The solution to this is to use embeddings. Embeddings are dense multidimensional vectors which also capture the semantic relationship between the words.
Two words with similar meanings are also close to each other in vector space. The goal in training these models is to express (encode) words as vectors in such a way that semantically close words are mapped to mathematically similar vectors. These models are based on neural networks.
The Word2Vec Model The Word2Vec model is based on a shallow neural network which has only one hidden layer. The size of the hidden layer is the dimension of the embedding vector.
The general idea is words that are similar in meaning are used in similar contexts. As a corollary, words that are used in similar contexts must have similar meanings. This is the distributional hypothesis.
The model iteratively analyzes each word in the training dataset. The word being analyzed in the current iteration is called the central word. The size of the context is decided by the window size parameter. With a window size of, say 2, the model considers neighboring words that are up to 2 words away from the central word.
As an example, in the sentence below, consider the iteration when “brown” is the central word:
The quick brown fox jumps over the lazy dog. There are two modeling approaches in Word2Vec:
Skip-gram model: The model tries to predict the surrounding words (context) given a central word. This is expressed as the conditional probability P( {“the”, “quick”, “fox”, “jumps”} | “brown” ).Continuous bag of words model (CBOW): The model tries to predict the probability of a central word (target), given its surrounding words (context). P( “brown” | {“the”, “quick”, “fox”, “jumps”} ).With the Skip-gram model, for example, the output layer of the neural network applies the Softmax function on the embeddings in the hidden layer to predict the context words given the central word. Thus, the training process tries to represent the words as vectors in such a way that it maximizes the probability of predicting the context given the central word.
Assuming independent probabilities:
For every word in the training dataset (corpus of text), the model tries to maximize the probability of predicting the context given that word. For the entire training corpus, this probability can be expressed as:
In the above expression, W_c represents individual context words. W_w represents the center word under consideration. The first (inner) product over the window-size is the probability over one iteration. The second (outer) product is over the entire corpus.
The conditional probability of a particular context word W_c given a center word W_w can be expressed as:
In the above expression, V_c is the vector representation of an individual context word and V_w is the vector representing an individual central word. V_c’ is the set of all the other context vectors where C is the set of all available contexts. The vectors V_w represent individual word embeddings.
The Word2Vec Model in Practice To understand how to use Word2Vec embeddings in Python, import the prerequisite packages:
import gensim
from gensim.models import Word2Vec
import gensim.downloader as api
It is common practice to train the model using your own dataset. For training models like Word2Vec, the training dataset is typically just a large corpus of text. Using a domain-specific corpus of text helps the model to account for domain-specific terminology and jargon words.
In this case however, it is easier to use a pre-trained model. The example model used below is pre-trained on a dataset of Google News articles:
model = api.load("word2vec-google-news-300")
Use the model to generate embeddings for a few example words. By default the model generates array-embeddings. Convert them to tensors to make it easier for further computations:
america = model.get_vector("america")
russia = model.get_vector("russia")
mango = model.get_vector("mango")
guava = model.get_vector("guava")
By default the Word2Vec model generates embeddings in a 300-dimensional space. Check the shape of an embedding:
The typical way to use Word2Vec is to generate a list of words "most-similar" to a given word:
model.most_similar(mango)
You can also use the word directly:
model.most_similar("mango")
The output looks like the sample below. It shows the words most similar to the input word as well as the similarity score. Notice that guava has a similarity score of 0.7192.
[('mango', 1.0),
('mangoes', 0.7850891351699829),
('mangos', 0.7233096957206726),
('guava', 0.7192398905754089),
…
Compute the cosine similarity for pairs of words, for example, “mango” and “america”, “america” and “russia”, and so on:
cos = torch.nn.CosineSimilarity(dim=0)
print(cos(torch.tensor(mango), torch.tensor(america))) print(cos(torch.tensor(america), torch.tensor(russia)))
print(cos(torch.tensor(guava), torch.tensor(mango)))
The output looks like the example shown below:
tensor(0.0701) tensor(0.5405) tensor(0.7192)
In the output, notice that “mango” and “guava” have a much higher similarity score than “mango” and “america”.
Notice also that the cosine similarity which you manually computed for “guava” and “mango” above matches the score generated earlier.
You can find the full code in this Google Colab notebook , as well as other examples.
Word Embeddings Limitations The biggest limitation of traditional word embeddings is that they are static. They are not context-aware.
This means that in a given model, a word always maps to the same vector, regardless of the context. The same word used in two different sentences has the same embedding. This is a problem because the same word can have different meanings in different contexts.
Also, these models cannot automatically comprehend the meaning of new words or even new compound words which combine known words.
Word Embeddings Use Cases In applications such as Named Entity Recognition (NER), the context of the word is not very relevant. “General Motors” is the name of a company entity, regardless of the context.
In such cases, a simple tool like Word2Vec is the right choice. It runs as a lookup function in constant time and is highly performant.
Contextualized Word Embeddings The solution to the shortcomings of models like Word2Vec is the attention mechanism, which is used in transformer-based language models.
Transformers are used in specialized models like BERT and Sentence Transformers as well as large general purpose models like Mistral and OpenAI. These models decompose text not into words, but into tokens. A token is analogous to a sub-word or a syllable.
The tokenization process breaks down a block of text into smaller units called tokens. Larger words are broken down into smaller sub-words and then decomposed into multiple tokens.
Tokens are encoded not as text, but as a single integer number (not a vector). Byte Pair Encoding (BPE) is an algorithm commonly used for tokenization.
Using the attention mechanism, the vector representation of each token is based on the tokens before and after it. This allows the model to account for the context of each token. Thus, the same word (or token) occurring in different sentences in the text is represented using different vectors.
Furthermore, because these models encode text at the sub-word level, they can tease out compound words into individual components. Because they account for the word's context, they can also try to comprehend (represent as a vector) the meaning of previously unseen words.
A traditional word embedding denotes the individual word in relation to all the other words in the entire corpus (vocabulary). It encodes the dictionary meaning of the word. A contextualized word embedding denotes the word in relation to its context. It is the meaning of the word (token), as it is used in a particular context.
Operationally, a trained Word2Vec model can be thought of as a lookup table. Each word is mapped to a static embedding vector. In contrast, transformer based models can be thought of as a set of matrices or weight vectors.
The training process optimizes the value of these matrices. These matrices are then used to explicitly compute the embedding vector of each token based on the given input sentence. So, the same token, as part of different input sentences (sequences), will have different embedding vectors.
The BERT Model In Practice This section illustrates contextual word embeddings by showing that:
The same word used in similar contexts has the same meaning. When used in different contexts, the same word can have different meanings. The same word used in different contexts is represented using different vectors. Embeddings of a word used in similar contexts have a high similarity score. Conversely, embeddings of a word used in different contexts have a low score. To do this, this example uses the word “train” in different contexts, with different meanings.
To start with, declare a few example sentences, each featuring the same word, “train”, with different meanings:
sentence1 = 'I want to train to run in the New York marathon.'
sentence2 = 'I am sitting in the running train.'
sentence3 = 'The train is running late today.'
BERT is a transformer based language model. It generates contextual embeddings. As it is an older model predating the likes of OpenAI and Mistral, it is a good learning tool for word embeddings.
To try BERT in Python, import the prerequisite packages:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
model = AutoModel.from_pretrained('bert-base-cased', output_hidden_states=True).eval()
Tokenization To start with, take a closer look at how tokenizing works in contextual models. The BERT model has a limited vocabulary. When it encounters a new word, it decomposes it into subwords. Try to tokenize a few test words:
print(tokenizer.tokenize("superman"))
print(tokenizer.tokenize("running"))
The word “running” is part of its vocabulary and is tokenized as is. “Superman” is not in the vocabulary and is broken into subwords - “super” and “man”, each of which are part of the vocabulary. This Hugging Face article on tokenizers discusses this concept in greater depth.
In the previous example, each of the sentences have simple words which are part of the BERT vocabulary. In case your sentences have complex or new words which are not part of the vocabulary, there will be more than one token per word.
In such situations, you need to write a loop to get the different word IDs corresponding to each word. This simple example doesn’t show how to do that. In this example, however, each word corresponds to a single token.
Tokenize the three sentences you declared earlier:
tokens1 = tokenizer(sentence1, return_tensors='pt')
tokens2 = tokenizer(sentence2, return_tensors='pt')
tokens3 = tokenizer(sentence3, return_tensors='pt')
Take a closer look at the contents of the tokens:
print(tokens1.input_ids)
print(tokens2.input_ids)
print(tokens3.input_ids)
The output looks like the example below:
tensor([[ 101, 146, 1328, 1106, 2669, 1106, 1576, 1107, 1103, 1203, 1365, 14147, 119, 102]]) tensor([[ 101, 146, 1821, 2807, 1107, 1103, 1919, 2669, 119, 102]]) tensor([[ 101, 1109, 2669, 1110, 1919, 1523, 2052, 119, 102]])
Notice that each sentence starts and ends with the same pair of tokens - 101 and 102. These correspond to tags used to mark the beginning and end of sentences. These are the BOS (beginning of sentence) and EOS (end of sentence) tags respectively.
For example, consider the sentence:
The train is running late today. After including the BOS and EOS tags, this is equivalent to
/BOS The train is running late today. /EOS
Compare the words in the sentences with the numbers in the tokenized tensors. If you observe closely, you notice that the word “train” is represented by the number 2669 in each of the above tokenizations.
In general, language models are not used for explicitly computing the embeddings of word vectors. Their most common use case is to work on entire sequences. There are no built-in functions to evaluate word embeddings. So, you need to extract the embeddings of the relevant tokens corresponding to the words you are interested in.
Declare an index number for the word of interest, in this case “train”, in each of the sentences.
'I want to train to run in the New York marathon.' 'I am sitting in the running train.' 'The train is running late today.'
In the first sentence, “train” is the fourth word, with an index position of 3 (counting starts from 0). However, you also need to consider the BOS token at the start. So, increase the word index by 1. Thus, the word index of “train” in the first sentence is 4. In the second sentence, “train” is the seventh word, with a word index of 6. After including the BOS token, its index is 7. Similarly, in the third sentence, the word index of “train” is 2. word_index1 = 4
word_index2 = 7
word_index3 = 2
You will use these indices in the next section to extract the word embedding from the sentence embeddings.
Generating Contextualized Embeddings To generate the embeddings for each sentence, run the model on each tokenized sentence:
with torch.no_grad():
output1 = model(**tokens1)
output2 = model(**tokens2)
output3 = model(**tokens3)
The final layer of the model is often a Softmax layer used to make a prediction based on the values of the embeddings. Thus, the actual embedding vectors are contained in the output of the pre-final layer.
Extract the values of the pre-final layer:
prefinal_layer_output1 = output1.hidden_states[-1].squeeze()
prefinal_layer_output2 = output2.hidden_states[-1].squeeze()
prefinal_layer_output3 = output3.hidden_states[-1].squeeze()
Take a look at the tensor generated by the pre-final layer. Notice that it has the same number of vectors as the number of words in the input sentences. Notice also that each embedding vector has 768 dimensions:
print(prefinal_layer_output1.shape)
print(prefinal_layer_output2.shape)
print(prefinal_layer_output3.shape)
The output looks like the example below:
torch.Size([14, 768]) torch.Size([10, 768]) torch.Size([9, 768])
For each input sentence, notice that the number of vectors is the same as the number of tokens (BOS + words of the sentence + period + EOS). Notice also that each embedding vector has 768 dimensions.
To get the embedding of the relevant word, extract the vector corresponding to the word position of “train”. Use the word_index variable declared in the previous section:
word_embeddings1 = prefinal_layer_output1[word_index1].squeeze()
word_embeddings2 = prefinal_layer_output2[word_index2].squeeze()
word_embeddings3 = prefinal_layer_output3[word_index3].squeeze()
word_embeddings1.shape
You now have the embedding for the word “train” in each of the sentences. Each embedding vector has 768 dimensions.
Cosine Similarity Because BERT is a contextual model, the embedding of the same word, “train”, is going to be different in the context of different sentences. This means:
Each of the embeddings of the same word you extracted in the previous section are different. Furthermore, the same word, used in different contexts, has different meanings. As a corollary, the same word, used in similar contexts, must have similar meanings. To demonstrate this numerically, use cosine similarity. As a reminder, print out the three input sentences:
print("1 - word - train, in - ", sentence1)
print("2 - word - train, in - ", sentence2)
print("3 - word - train, in - ", sentence3)
The output is shown below:
1 - word - train, in - I want to train to run in the New York marathon. 2 - word - train, in - I am sitting in the running train. 3 - word - train, in - The train is running late today.
Calculate and print the cosine similarity of different pairs of embeddings of the same word:
print("cosine similarity of 1 and 2 - ", cos(word_embeddings1, word_embeddings2))
print("cosine similarity of 1 and 3 - ", cos(word_embeddings1, word_embeddings3)) print("cosine similarity of 2 and 3 - ", cos(word_embeddings2, word_embeddings3))
The output shows the cosine similarity of the word “train” used in different contexts:
Cosine similarity of 1 and 2 - tensor(0.6910) Cosine similarity of 1 and 3 - tensor(0.6959) Cosine similarity of 2 and 3 - tensor(0.8268)
In sentence 2 and 3, the word "train" is used in the same meaning: as a vehicle. In sentence 1, “train” is used with the meaning of exercise or preparing for a marathon.
So, the embeddings of the word "train" have a high similarity in the context of sentence pair 2 and 3. In the context of sentence pair 1 and 2, as well as pair 1 and 3, the corresponding embeddings of the word "train" have a low similarity value.
This Google Colab notebook includes the above code as well as other examples.
Sentence Embeddings Just like a word embedding is a vector representation of a word in multidimensional vector space, a sentence embedding is a vector representation of an entire sentence.
A naive approach to generating sentence embeddings is by using a model like BERT and then taking the sum or average of the embeddings of the words of the sentence. However, you get better results by using a model that was trained to work with sentences.
The goal of sentence transformer models is to express variable length sentences as a fixed-length vector. Sentences that are closely related must be close to each other in vector space. Thus, the training data, whether labeled or unlabeled, should include groups of related sentences.
Examples of labeled data include:
Datasets consisting of pairs of the form {{S1, S2}, L}, where S1 and S2 are sentences and L is a label denoting their similarity to each other. Datasets consisting of pairs of the form {S, C} where S is a sentence and C is a label (of one or more tags) denoting the category of the sentence. For example: Examples of unlabeled data, used for unsupervised learning, include:
Data consisting of pairs of the form {S1, S2} where S1 and S2 are pairs of related texts. Examples of unlabeled datasets include: In many cases, it is useful to train or finetune models based on domain specific datasets, such as the Medical Question Pairs dataset.
Sentence Transformers in Practice The Sentence Transformers Python package is a convenient way to get embeddings for sentences. To get an idea of embeddings using Sentence Transformers, import the necessary packages:
from sentence_transformers import SentenceTransformer
Instantiate a SentenceTransformer with a specific model, in this case, the MiniLM model :
model = SentenceTransformer("all-MiniLM-L6-v2")
Declare a few example sentences. Notice that the first two sentences are semantically similar while the third is unrelated:
sentence1 = ["I like running"]
sentence2 = ["I ran a marathon"]
sentence3 = ["Black cats are lucky"]
Generate the embeddings of these sentences:
embeddings1 = model.encode(sentence1, convert_to_tensor=True).squeeze()
embeddings2 = model.encode(sentence2, convert_to_tensor=True).squeeze()
embeddings3 = model.encode(sentence3, convert_to_tensor=True).squeeze()
Compute the similarities of different pairs of sentences:
print(cos(embeddings1, embeddings2))
print(cos(embeddings1, embeddings3))
print(cos(embeddings2, embeddings3))
The output is shown in the example below:
tensor(0.5606) tensor(0.1132) tensor(0.1674)
In the output, notice that the embeddings of the first two sentences have a much higher similarity score compared to the other pairs.
This Colab notebook includes the above code as well as other examples.
Sentence Transformers have many use-cases, such as paraphrasing sentences, summarizing blocks of text, image search (in multimodal models), sentiment detection, duplicate detection, and more.
OpenAI Embeddings As of 2024, OpenAI out-performs other general-purpose AI models. However, it is a closed model. You access it via an API key and pay for every request. The payment depends on the usage. Processing longer texts with more recent or advanced models is more expensive.
The API endpoint to generate embeddings allows you to use one of three models:
Text-embedding-3-large : This is the most expensive model and it generates embeddings with 3072 dimensions. Text-embedding-3-small : This model generates embeddings in 1536 dimensions Text-embedding-ada-002 : This model also generates embeddings of length (number of dimensions) 1536. To use OpenAI embeddings, import the OpenAI package:
from openai import OpenAI
Declare a client object using your API key:
client = OpenAI(api_key="YOUR_API_KEY")
Declare a function to generate and return embeddings:
def get_embedding(text, model="text-embedding-ada-002"):
return client.embeddings.create(input = [text], model = model).data[0].embedding
Declare some a sentence:
Use the function you declared to fetch the embeddings for the sentence:
m_s1 = torch.tensor(get_embedding(s1))
Check the shape of the embedding vector:
As an exercise, generate the embeddings of a few different sentences and compare their similarity. Use the cosine function as shown in previous examples.
Conclusion Embeddings are the building blocks of modern AI tools. Computers and algorithms cannot understand text and images the same way that humans do. However, it is possible to numerically encode human understanding in the form of multidimensional vectors called embeddings.
Each of the dimensions represents an abstract attribute of the semantic understanding of the content.
This article series provided an intuitive motivation for the use of embeddings. It explained the most commonly used embedding types: for images, text and multimodal embeddings. It also illustrated with simple code examples how to get the embedding vectors using different models and how to use them for practical purposes.
An important aspect of working with embeddings is managing and integrating them into your data infrastructure. Tools like Airbyte can streamline this process by integrating data to vector databases, which are essential for storing and querying embeddings effectively.
Airbyte's recent support for vector databases makes it easier to manage embeddings for various AI applications. You can learn more about their AI capabilities and vector database integration on our product page and blog .