Postgres Full Text Search: 3 Critical Aspects

•

January 23, 2024

•

30 min read

Summarize with ChatGPT

As organizations have increased adoption of database services, it has led to the notable rise in enterprise use of open software. Postgres is an open-source relational database management system that provides business organizations with reliable, scalable, and cost-effective database solutions.

Besides its flexibility, extensive community support, and constant development, Postgres offers built-in support for the full text search feature. It is a crucial functionality for modern applications that deal with extensive volumes of text data daily. You can apply full text search on any natural language text component that is strings of human-readable words separated by space and punctuation.

This article will explore the capabilities of the Postgres full text search function and delve into its three main aspects.

What is Postgres Full Text Search?

Postgres full text search is a technique used to search text-based documents stored in computers or databases. It facilitates indexing of natural language text documents and identifying documents that match a specific query. You can use the full text search feature to generate excerpts from the matching terms highlighted in the results.

This feature offers you support for a wide range of SQL data types, operators, and functions to aid in indexing, querying, and ranking documents. You can also leverage other unique SQL functionalities like the SELECT DISTINCT clause, which presents only the unique rows in your dataset.

Postgres full text search not only enhances the speed of retrieving documents but also provides support for dictionaries, allowing you precise control over token normalization. To understand Postgres full text search better, take a look at some key terms associated with this feature.

Key Terms

Document: In a full text search, a document serves as the primary unit for searching text. It can be an article, email, or a simple text file. Within Postgres, a document is a textual field within a row of database tables or a combination of textual fields that can be stored in multiple tables. The text search engine is responsible for parsing documents and establishing connections between lexemes (keywords) and their parent documents. These associations are later utilized to search for specific query words within the documents.‍
Dictionary: Dictionaries play a crucial role in full text search in two ways. Firstly, they exclude words that are irrelevant to the query. Secondly, dictionaries standardize the words in a document to ensure matches for different derived forms of the same word. This process can be understood as creating lexemes, where word results are successfully normalized for improved performance. Postgres offers predefined dictionaries for various languages and templates for creating custom dictionaries. Each template offered by Postgres has distinct characteristics, catering to diverse linguistic requirements.‍
Stop Words: These are common words that lack discriminatory value and are often omitted during a full text search. Stop words can include words like ‘a’, ‘in’, and ‘the’ that are prevalent in every English language text but provide minimal relevance. In Postgres, the treatment of stop words varies with dictionaries. Some dictionaries normalize the words before checking for stop words, while others examine the stop words list before creating lexemes. Ultimately, the aim is to reduce noise within the textual document and enhance the effectiveness of full text searches.

How is Postgres Full Text Search Different from Other Textual Search Operators?

Textual search operators in Postgres database have been in existence for many years. Some of these operators include ~,~*, LIKE, and ILIKE. However, these textual operators lack certain essential properties needed by modern information retrieval systems. The limitations include the absence of linguistic support, making it challenging to handle derived words effectively. There is also a lack of ordering or ranking of search results when numerous matching documents are found. Additionally, the above operators do not have index support, thereby exhibiting slower processing time for scanning large documents.

To address these shortcomings, full text search introduces features like the preprocessing of documents and the creation of an index for swift searches. The preprocessing parses documents into tokens (numbers or words) and various classes of tokens can be identified and managed differently. Through Postgres, you can employ a standard parser or create a custom parser.

After preprocessing, the tokens are converted into lexemes. Lexemes are normalized strings representing different forms of the same word. For the normalization process, Postgres uses the various in-built dictionaries it has. While the platform offers standard dictionaries, you can also create custom ones per your requirements.

Postgres stores all your preprocessed documents in an optimized format, such as sorted arrays of normalized lexemes. The index has positional information, ranking the documents based on the density of the query words. This comprehensive approach takes the efficiency of full text search to the next level.

Postgres Full Text Search Types

Postgres has introduced two data types specifically designed to support the full text search feature: tsvector and tsquery. The former represents a document optimized for text search, while the latter type represents a text query. Let’s understand both of them with examples.

tsvector Type

A tsvector value is a sorted list of distinct lexemes. The sorting and elimination of duplicate lexemes occurs automatically during the input. Understand how this query works with the following example:


Input: SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
Output: 'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'

Integer positions can be attached to lexemes, indicating the source word’s location in the document. Positional information supports proximity ranking, and duplicate positions for the same lexeme are discarded. Positions range between 1 to 16383 only. Let’s understand this by extending the previous example:


Input: SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::tsvector; 
Output: 'a':1,6,10 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'on':5 'rat':12 'sat':4

Lexemes with positions can also be labeled with a weight ranging from A to D. D is the default weight assigned and it will not be visible in the output. These weights reflect document structure and influence search ranking functions. Look at the example below:


Input: SELECT 'a:1A fat:2B,4C cat:5D'::tsvector;
Output:'a':1A 'cat':5 'fat':2B,4C

tsquery Type

A tsquery value stores lexemes that are to be searched for. With this type, you can combine Boolean operators (& for AND, | for OR, and ! for NOT), as well as the phrase search operator <-> (FOLLOWED BY).

Additionally, a variant <N> of the FOLLOWED BY operator specifies the distance between the lexemes that you are searching for. Each <-> is equivalent to <1>. You can use parentheses to enforce the grouping of the operators. Using the sentence in the above tsvector example, check out how each of the operators works:


Input: SELECT 'fat & rat'::tsquery;
Output: 'fat' & 'rat'


Input: SELECT 'fat & (rat|cat)'::tsquery;
Output: 'fat' & ('rat'|'cat')

 
Input:SELECT 'fat & rat & ! cat'::tsquery;
Output:'fat' & 'rat' & !'cat'

Weights can be applied to the lexemes in tsquery. However, the labeling restricts the lexemes to match only the tsvector lexemes with the corresponding weights. Let’s apply this using the same example:

If you wish to specify a prefix matching your lexemes, you can use * in tsquery. Check out how it is applied in the following example:


Input: SELECT 'super:*'::tsquery;
Output: 'super':*

The tsquery type requires normalized words before the conversion. For this normalization, you can use the to_tsquery function. Here is a quick example of this function:


Input: SELECT to_tsquery('Fat:ab & Cats');
Output: 'fat':AB & 'cat'

Postgres Full Text Search Limitations

As of the current version, the Postgres full text search has the following limitations:

The length of each lexeme must be less than 2 kilobytes.
The number of lexemes must be less than 2^64.
The length of a tsvector, which includes the lexemes and positions, must be less than 1 megabyte.
Position values in tsvector must be greater than 0 and not more than 16,383.
The number of positions in a lexeme should not be more than 256.
The number of nodes in a tsquery, including lexemes and operators, must be less than 32,768.
The match distance in an <N> (FOLLOWED BY) tsquery operator cannot be more than 16,384.

These limitations are important to consider when working with the full text search functionality in Postgres to ensure optimal performance.

How can you Improve Postgres Full Text Search With Airbyte?

Postgres full text search can scan through large datasets and provide you with results that you seek in no time. However, the accuracy of your search results hinges on your ability to keep your datasets up-to-date. Since your datasets are used and shared by several departments in the organization, it becomes rather difficult to keep track of all the changes.

The best way to ensure your data’s integrity is by bringing it all together under a streamlined data pipeline. Data integration and replication platforms like Airbyte can help you greatly in this situation. Airbyte has an extensive range of pre-built connectors; all you have to do is set the destination as Postgres. The platform allows you to load data from several different sources and unify all your datasets in one place. You can perform various queries on Postgres, including full text search to filter meaningful data from the whole set.

The Final Word

The Postgres full text search feature is an efficient solution to implement when you have to scan extensive volumes of textual data. Its in-built functions and operators reduce your dependency on other text search engines, allowing you to conduct quick searches and analyze data within the platform itself.

Postgres has robust functionalities that help you conduct data analysis and derive meaningful insights from your datasets. To utilize all its functionalities, including full text search, you must consolidate the datasets beforehand. Handle large volumes of data on Postgres with ease by establishing a data pipeline with Airbyte. Sign up today and get started within a few minutes!

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program

The data movement infrastructure for the modern data teams.

Try a 14-day free trial