How to Create Bert Vector Embeddings? A Comprehensive Tutorial

•

August 30, 2024

•

20 Mins Read

Summarize with ChatGPT

Vector embeddings, including word embeddings, are a powerful Natural Language Processing (NLP) technique that helps machines understand and interpret text more efficiently. The introduction of Bidirectional Encoder Representations from Transformer (BERT) has further enhanced this NLP task.

BERT’s ability to interpret text bi-directionally allows it to grasp the full context of a sentence. So, it can capture even subtle differences in meaning and provide a deeper understanding of language. Creating BERT embeddings enables AI systems to handle complex aspects of language with high precision.

This comprehensive tutorial will help you learn about word embeddings, BERT and its architecture, steps to create BERT embeddings, and practical use cases.

What Is Word Embedding?

Word embedding is an NLP technique used for language modeling and feature learning. It can be unsupervised, supervised, or self-supervised, depending on the specific application. Before creating word embeddings, you first need to tokenize the text by breaking it down into individual words. Each word is then mapped to an index value in a pre-defined vocabulary. Once tokenized, you can get into the word embedding process, where you must convert these words into dense, continuous vectors of real numbers.

In word embedding space, words with similar meanings are positioned closer together. For example, the words “king” and “queen” would be similar in the vector space and reflect their related meanings. This ability helps AI models capture the semantic meanings of words based on their context and relationships in a large corpus of text.

Embeddings are multidimensional, which indicates that they have varying dimensions depending on the model’s complexity. An 8-dimensional embedding would be sufficient for small datasets, while large datasets may benefit from embeddings up to 1024 dimensions. A high-dimensional embedding will extract fine-grained relationships across the words but requires more data to learn.

A simple illustration of word embedding, which represents each word as a 4-dimensional vector:

Once you learn enough about word embeddings, you can easily use them for various NLP tasks such as sentiment analysis, machine translation, and named entity recognition.

Popular Techniques to Create Word Embeddings

Let’s take a look at popular techniques to create word embedding:

Word2Vec

Word2Vec is a neural-network-based NLP model introduced by Google researchers in 2013 to create word embeddings. This model takes a large corpus of text as input and generates a vector space where each unique word is assigned a corresponding vector. Word2Vec model uses two types of model architectures as follows:

Continuous Bag-Of-Words (CBOW): This architecture works like a fill-in-the-blank exercise. It learns how a word influences the probability of other words within the specific context window.

In the above illustration, the input layer has the context words, and the output layer contains the current word. A hidden layer includes the dimensions needed to represent the words in the output layer.

Continuously Sliding Skip-Gram: In this model, the current word is used to predict the surrounding window of context words. The skip-gram architecture gives more priority to nearby context words compared to those that are farther away.

In the given image, the input layer involves the current word, while the output layer has context words. The hidden layer contains the dimensions in which you want to represent the current word in the input layer.

TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure to compute the mathematical significance of words in text documents. A TF-IDF value is assigned to each word, and the value is calculated by multiplying the TF and IDF values.

Consider three documents with one single sentence each. Let’s see how to find the TF-IDF value:

Document 1: [He is John]

Document 2: [He is Jacob]

Document 3: [He isn’t Paul or Sam]

In the above example, “He” is used in all three documents, “is” in two documents, and “or” is only in one document. Based on these frequencies, let’s calculate the TF and IDF values.

TF: It is calculated as the ratio of the number of target terms in the document to the total number of words in that document.

According to the above example, the TF values for each document are as follows:

Document 1: [0.33, 0.33, 0.33]

Document 2: [0.33, 0.33, 0.33]

Document 3: [0.20, 0.20, 0.20, 0.20, 0.20]

IDF: It is computed by taking the logarithm of the ratio of the total count of documents to the number of documents that contain the target term. In the given example, the IDF values are:

“He”: Log(3/3) = 0

“is”: Log (3/2) = 0.17

“John”: Log(3/1) = 0.47

“Jacob”: Log(3/1) = 0.47

“Isn’t”: Log(3/1) = 0.47

“Paul”: Log(3/1) = 0.47

“or”: Log(3/1) = 0.47

“Sam”: Log(3/1) = 0.47

Once TF and IDF values have been calculated, you must multiply both values to produce TF-IDF values for each document and create a vector for each document. The vector will have eight terms, which is the number of elements equal to the total number of unique words across all documents.

Document 1: [0, 0.056, 0.155, 0, 0, 0, 0, 0]

Document 2: [0, 0.056, 0.155, 0, 0, 0, 0, 0]

Document 3: [0, 0.094, 0.094, 0.094, 0.094, 0, 0, 0]

A problem arises when a term like “He” has an IDF value of 0, making its TF-IDF value also 0. Additionally, terms that do not appear in a document, for example, the term Paul does not appear in the first document, are also given a value of 0. To avoid confusion between these cases, TF-IDF values are often smoothed by adding 1 to them before vectorization.

For the above example, the vectorized values might look like this:

Document 1: [1, 1.056, 1.155, 0, 0, 0, 0, 0]

Document 2: [1, 1.056, 1.155, 0, 0, 0, 0, 0]

Document 3: [1, 1.094, 1.094, 1.094, 1.094, 0, 0, 0]

Bag of Words (BoW)

BoW is the most fundamental approach that helps you to convert text into vectors. In this method, each column in the vector represents a word and the values in each row indicate how many times that word appears in a sentence.
Consider a movie review example,

Review 1: This movie is great.

Review 2: The movie is not great.

Review 3: I love this movie. Watch it, you will love it too.

As an initial step, you must find a vocabulary of unique words:

[This, movie, is, great, the, not, I, love, Watch, it, you, will, too]

In this vocabulary list, there are 13 unique words. Next, you need to represent each movie review by a vector of 13 dimensions. During vectorization, you must check each word in the vocabulary that appears in the review. If the word is present, place 1 in the corresponding position of the vector. If the word is not present, place a 0 in that position.

Vectorization of Review 1: [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Review 1 includes only four words, “This, “movie,” “is,” and “great” are present, each appearing once, so their corresponding position in the vector is marked as 1. All other words in the vector are marked as 0.

Similarly, you can create the vector space for Reviews 2 and 3 as follows:

Vectorization of Review 2: [0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

Vectorization of Review 3: [1, 1, 0, 0, 0, 0, 1, 2, 1, 2, 1, 1, 1]

In this vector space for Review 3, the words “love” and “it” each appear twice, so their respective positions are marked as 2.

What Is BERT?

BERT is one of the modern large language models developed by Google in late 2018. It is a machine-learning framework that helps machines understand the full context of words in a sentence. It does this by reading the text in both directions simultaneously. This bidirectional approach allows BERT to grasp the whole meaning of a sentence by looking at the words that come before and after each word.

BERT uses a deep learning component, transformers, that processes all words in a sentence in parallel. This speeds up training and allows BERT to handle large datasets more effectively. BERT is released under an open-source license, allowing you to download and use it for various NLP tasks freely. These models allow you to extract high-quality language features from your text or fine-tune them on various NLP tasks to achieve expected predictions.

Google recently released 24 pre-trained mini-BERT models. Of all 24, BERT-Tiny is the most suitable model for low latency and high throughput applications.

The graph below shows the comparison of all few BERT models with respect to latency:

The below graph demonstrates the comparison of the BERT models with respect to throughput:

From this graphical representation, you can see that the BERT-Tiny model provides 25 to 50 ms p95 latency and generates an 11GB catalog embedding file. This is a great enhancement over other BERT models.

Power your workflows with Airbyte's AI-ready connectors

Talk to our team→

Architecture of BERT

Let’s discuss BERT model architecture in detail:

Word-Piece Tokenization

Once you input a text into the BERT model, it first splits words in the text into sub-word tokens using word-piece tokenization. Each token is converted into a unique ID from a vocabulary list.

Bidirectional Encoding

The traditional model may process text in one direction, either from left to right or right to left. However, BERT analyzes text bi-directionally at the same time. When predicting a word or analyzing its context, BERT considers the full range of surrounding words. So, bidirectional encoding allows BERT to capture the complete meaning of a text, leading to more accurate and refined language understanding.

Multi-Layered Transformers

BERT consists of multiple transformer layers stacked on top of each other, that is, 12 layers in BERT-base and 24 layers in BERT-large. Each layer refines the representations of the tokens to improve the model performance. It is done by applying a self-attention mechanism to weigh the importance of each token in relation to every other token in the sequence.

Such a mechanism allows BERT to capture relationships between words, no matter their distance from each other in the text. Multiple self-attention mechanisms run in parallel, each focusing on different parts of the text. The outputs are then combined for a better understanding of context.

After self-attention, the output is processed by a feedforward neural network to capture more complex patterns and interactions in the text.

Masked Language Modeling (MLM)

During pre-training, BERT randomly masks some of the tokens in the input. Then, it trains the models to predict these masked words based on the context provided by the other tokens. This helps the model learn to interpret and generate contextually relevant words.

Fine-tuning

BERT is pre-trained on massive amounts of text data and is capable of learning rich language representations. As a next step, you can perform fine-tuning, which involves training the pre-trained BERT model according to your needs. Since BERT is highly adaptable, you adjust it to perform several NLP tasks like sentiment analysis, named entity recognition, or question answering.

‍

Suggested Read: Data Tokenization vs Encryption

What Makes BERT Great for Embeddings?

An embedding is a numerical representation of a categorical feature, such as a movie genre, a dog's breed, or an employee's job designation. Each feature maps to a fixed list of real numbers, which are learned during training. This process transforms categorical features into dense, continuous vectors, extracting relationships and similarities between categories.

BERT's bidirectional training approach helps generate high-quality embeddings. To represent text data, BERT uses an embedding layer that consists of three different types of embeddings as follows:

Token Embeddings: Before a text is fed into the BERT model, the BERT tokenizer converts it into a list of integer token IDs. For each unique ID, the BERT has a respective embedding specifically trained to represent that token. The model’s embedding layer is responsible for mapping these tokens to their corresponding embeddings.

Position Embeddings: Position embeddings indicate the position of each token within the input text. Although there are 30,522 distinct token embeddings, only 512 position embeddings exist within BERT. This is because the BERT model supports a maximum input sequence length of 512 tokens.

Token-type Embeddings: Token-type embeddings, also called segment embeddings, were initially trained for Next Sentence Prediction (NSP). In NSP, the model is given pairs of sentences A and B and must predict whether sentence B follows logically from sentence A. There are only two token-type embeddings in BERT: one for representing tokens in the first sentence and another for those in the second sentence.

The BERT embedding layer can compute the final embedding for each token by summing up the three embeddings and then applying normalization to the sums.

A simple illustration of how the BERT embedding layer calculates the embeddings for the string “hello, world” is given below:

Generating High-Quality BERT Embeddings with Hugging Face

Hugging Face, an AI community, offers a way to work with BERT models such as Roberta, DistilBert, BERT-Tiny, and many more to generate word embeddings. Here is a simple example that creates BERT embeddings based on the given input sentence using Hugging Face’s bert-base-cased model:

This selected model supports only the English language. However, Hugging Face also provides a multilingual, case-sensitive base model covering the top 104 languages with the most extensive Wikipedia datasets. This versatility makes the BERT models applicable globally.

Step-by-Step Implementation of BERT for Embeddings

Here are the detailed steps to implement the BERT model for creating embeddings:

Launch Google Colab and click on File > New notebook in Drive.
Install the transformers module using the pip command:


!pip install transformers

Import the following Python libraries.


import random
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

Enter the code below to initialize the random seed for PyTorch to achieve high reproducibility and manage the GPU randomness.


RandomSeed = 52
random.seed(RandomSeed)
torch.manual_seed(RandomSeed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(RandomSeed)

Then, load the BERT pre-trained model and the tokenizer. Here, let’s use the “bert-base-uncased” model, which will convert all the upper characters in the text to lowercase.


tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

Now, consider an input sequence and tokenize it using the BERT tokenizer. This will encode these word or subword tokens into unique identifiers using batch_encode_plus(). You can also add special tokens such as CLS and SEP by setting the add_sepecial_tokens parameter to True value.


text = "This is a computer science portal"
encoding = tokenizer.batch_encode_plus(
    [text],                	
    padding=True,          	
    truncation=True,       	
    return_tensors='pt',  	
    add_special_tokens=True
)
 
token_ids = encoding['input_ids']  
print(f"Token ID: {token_ids}")
attentionMask = encoding['attention_mask']  
print(f"Attention mask: {attentionMask}")

After executing the above code, you will get the following output:

For each token, the attention value is 1, indicating that the BERT model is considering the entire input sequence when generating embeddings. Achieving an attention score of 1 is good as it ensures that the model captures all contextual information. However, the attention score may decrease if the input text is very long. In this case, all tokens receive full attention,

In this step, you have to forward the tokens and the encoded input through the BERT model to generate embeddings for each token.


with torch.no_grad():
    outputs = model(input_ids, attention_mask=attention_mask)
    word_embeddings = outputs.last_hidden_state  
print(f"Word Embeddings Shape: {word_embeddings.shape}")

Once you run the above code, you can see the shape of the generated embedding space as below:

The shape of the word embeddings is [1, 12, 768], where 768 represents the dimensionality or hidden size of the embeddings generated by the BERT model. Here, each token is encoded as a 768-dimensional vector. The number 8 corresponds to the total number of tokens in the input text after tokenization. The 1 denotes the batch dimension, indicating the number of sentences processed.

You can verify the embeddings by decoding the token IDs back to the text using the following code snippet:


decodedText = tokenizer.decode(token_ids[0], skip_special_tokens=True)
print(f"Decoded Text: {decodedText}")
tokenizedText = tokenizer.tokenize(decodedText)
print(f"Tokenized Text: {tokenizedText}")
encodedText = tokenizer.encode(text, return_tensors='pt')  
print(f"Encoded Text: {encodedText}")

Here is the decoded text:

You can finally extract and print the word embeddings generated by BERT using the below Python code:


for token, embedding in zip(tokenized_text, word_embeddings[0]):
    print(f"Word Embeddings are: {embedding}")
    print("\n")

Here is the output:

Since the output is very long, only a small portion of it has been shown for better understanding.

Practical Use Cases of BERT Embeddings

Let’s take a look at a few practical use cases of BERT embeddings:

Text Classification

BERT embeddings will help you to improve your text classification tasks by capturing the semantic meaning of words and phrases within context. Whether you are building sentiment analysis models, spam detecting systems, or topic categorization algorithms, BERT provides a useful language representation of text that enhances the accuracy of the classification.

Named Entity Recognition (NER)

BERT embeddings allow you to identify and classify entities within a text, such as names, organizations, or locations. This is crucial in data science tasks like extracting relevant information from large, unstructured datasets. Such embeddings also help automate data labeling processes in natural language processing pipelines.

Question Answering Systems

By using BERT embeddings, you can build complex question-answering (QA) systems that understand and process user queries more efficiently. The QA systems are beneficial in customer support, knowledge management, and any scenario where automated responses to challenging questions are needed.

Recommendation Systems

You can employ BERT embeddings to improve your recommendation systems. This enables the systems to understand user preferences through text-based reviews. Your recommendation systems can create more personalized and relevant suggestions by analyzing the content and sentiment within user-generated feedback.

How Can Airbyte Help Build a Data Pipeline to Store Embeddings & Utilize it?

Training large language models like BERT requires a lot of relevant data. This data often comes from many sources, such as Kaggle, Hugging Face, Wikipedia, etc. As the first step in training an LLM is to integrate all this data and keep it up-to-date, you need a powerful, robust data pipeline in place.

To help you with that, you can leverage Airbyte, a data integration and replication platform. It allows you to transfer large amounts of data from multiple systems to a destination of your choice using its 350+ pre-built connectors. If you cannot find a connector that meets your requirements, you can create one in minutes using the CDK feature.

Here are some of the standout features of Airbyte:

PyAirbyte Pipeline: If you are a Python developer, the PyAirbyte feature, an open-source, developer-friendly library, will help you build a custom pipeline according to your needs using Airbyte connectors.‍
Integration with dbt: Airbyte allows you to integrate with dbt to create custom transformations using SQL scripts. ‍
Data Security: Airbyte offers numerous security measures, including TLS or HTTPS encryption, SSH Tunneling, and access control methods, to ensure the safety of your data integration process. It also offers compliance certifications such as SOC Type II assessment or ISO standards.

Apart from all these Airbyte features, one of the notable benefits is that it allows you to streamline your generative AI workflows by ingesting all the unstructured data into vector databases. During the integration process, Airbyte allows you to perform the following transformations that are essential for RAG-based workflows:

LangChain-powered Chunkings: This processing step helps you break down the large text records into smaller chunks to fit within the model’s context window. It also decides which fields to use as the main context and which ones are additional metadata. ‍
OpenAI or Cohere-enabled Embeddings: Airbyte allows you to convert text into vector representations using OpenAI's or Cohere's pre-trained model.

After converting the data into vectors, you can store the embeddings with an index in the Pinecone database and utilize it for similarity search.

Steps to Build a Pipeline to Store Embeddings in Pinecone

Consider a scenario where you stored massive amounts of data in a Local JSON file. You may want to load all data into a vector database, Pinecone, so your LLMs can readily use it to generate human-like responses.

Here is a quick guide to this integration process:

Step 1: Launch Airbyte

Log into Airbyte directly from here or use its open-source version for free.
If you have chosen Airbyte Cloud, create your Airbyte account with Google, GitHub, or SSO.
Once navigated to Airbyte’s homepage, you can create your data pipeline to start the data integration process.

‍

Step 2: Configure JSON as Your Source

Set up your data source using the following steps:

Click on Sources from the left side of your Airbyte dashboard.
Search for the JSON connector and click on the File (CSV, JSON, Excel, Feather, Parquet) connector.

On the source page, specify the dataset name, choose the File Format as JSON, and provide the URL path to access the file. The setup guide on the right side of the page provides more information.

Once you provide all the necessary fields, click on the Set up source button.

Step 3: Configure Pinecone as Your Destination

After configuring your JSON source, you must configure Pinecone as the destination. Before you start, ensure the following prerequisites are in place:

Create an OpenAI or Cohere secret API key for embeddings.
Create Pinecone API key for indexing.

Now, use the steps to complete the destination configuration:

Select the Destination option from the dashboard.
Search for the Pinecone connector and select it.

Once you are redirected to the destination page, you must perform three steps: Under the Processing section, specify the Chunk size, Fields to store as metadata, and Text fields to embed. Next, go to the Embedding section, and choose an OpenAI or Cohere pre-trained model for creating embeddings. Whatever model you may select, you must provide the respective API key. In the Indexing section, you can enter the Index for Pinecone in your project to load data into. Then, specify the Pinecone Environment to use and the Pinecone API key that matches with the environment.

After specifying all the required details, click on the Set up destination button.

Step 4: Set Up Your First Sync

After your source and destination are defined, set up your first sync using the following steps:

Click on the Connection option and choose Select an existing source.
Select JSON as your source and Pinecone as your destination.
Then, choose the streams that you need to sync.

Airbyte’s CDC approach will help you track the latest modifications in the JSON file and copy them into the Pinecone. Once you have finished the destination configuration, you can verify that the data is loaded in sync with your target system.

Conclusion

BERT embeddings are a powerful technique that allows machines to understand, interpret, and process natural language. BERT’s contextual embeddings extract complex meanings and relationships between words with its bi-directional nature of model training. By leveraging BERT’s pre-trained models and fine-tuning them on specific tasks, you can achieve significant improvements in AI-driven applications. Hugging Face offers various BERT base and multilingual models to generate embeddings. To try it yourself with Python programming, follow the step-by-step guide outlined in this article.

Suggested Read:

NLP Pipeline

FAQs

Can BERT be used for embedding?

Yes, BERT can be used to create embeddings. This will allow the model to capture context-rich representations of words, facilitating deeper language understanding and learning.

What is the size of the BERT embedding vector?

The size of BERT embeddings may vary from model to model. However, the BERT base model has 768 dimensions while the BERT large model has 1024 dimensions.

What is the difference between BERT and word-to-vector?

BERT produces contextualized embeddings using a deep transformer architecture. In contrast, Word-to-vector (Word2Vec) generates static embeddings where each word has a fixed vector representation regardless of context. Another difference is that BERT uses a feedforward neural network, which enables it to capture word meanings that change with the context. On the contrary, Word2Vec does not, as it is based on shallow neural networks.

What are the advantages of BERT embeddings?

Here are the benefits of BERT embeddings:

Unlike traditional models that read text unidirectionally, BERT reads in both directions.
BERT embeddings can be fine-tuned for a variety of NLP tasks, making them adaptable to specific needs.
BERT word embeddings help the machines capture nuances and different meanings of words depending on their context.

Can BERT be used for topic modeling?

Yes, BERTopic is a topic modeling technique that uses the BERT model. This technique can be used to create document embeddings and improve clustering and thematic analysis.

What is the BERT embedding layer?

The BERT embedding layer refers to the initial layers of the BERT model responsible for converting input tokens into continuous vector representations. This layer includes token embeddings, positional embeddings, and token-type embeddings. These three were combined and fed into the subsequent layers of the BERT model to generate the final embedding.

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program

The data movement infrastructure for the modern data teams.

Try a 14-day free trial