How to perform full text search in PostgreSQL

Whenever you have textual data in your database and want to query through it using natural language - be it a search bar in a web application or a complex query in a data analysis tool - you need to use full-text search capabilities. PostgreSQL offers several methods for implementing full-text search, each suited to different requirements and complexity levels. In this article, we will explore three primary methods: Basic Search using the LIKE and ILIKE operators, Advanced Search using tsvector casting combined with to_tsquery, and a more advanced method involving a generated tsvector column.

Basic Search: LIKE and ILIKE Operators

The LIKE operator in PostgreSQL is used for pattern matching within strings. It allows you to search for a specified pattern in a column. For example, to find movies with a title containing the word 'Star', you can use:


SELECT * FROM movies WHERE title LIKE '%Star%';

The percent signs (%) serve as wildcards, representing any sequence of characters. In this case, the query will match titles containing 'Star', 'Star Wars', 'Star Trek', 'Ice Star' etc.

The ILIKE operator is similar to LIKE but performs a case-insensitive search. This is particularly useful when you don't know the exact casing of the text you're searching for. For example:


SELECT * FROM movies WHERE title ILIKE '%star%';

This query will match titles containing 'star', 'Star', 'STAR', etc.

Creating a case insensitive search can be naturally achieved with ILIKE, as demonstrated above, without needing additional steps to manage text casing.

These simple pattern matching operators are useful for basic search requirements, and are particularly useful for doing live search when you want to search parts of words (so as users type, you can fetch results that match the partial word they've typed).

Advanced Search: Using tsvector and to_tsquery

For more advanced full-text searches, PostgreSQL provides text search vectors (tsvector) and text search queries (to_tsquery). A tsvector is a sorted list of distinct lexemes, which are words that have been normalized to merge different forms of the same word (e.g., singular and plural forms).

To use tsvector, you cast a text column into a tsvector and then compare it with a tsquery. For example:

SELECT * FROM movies WHERE to_tsvector('english', overview) @@ to_tsquery('english', 'Star');

This query searches for movies where the 'overview' column contains the word 'Star'. The '@@' operator is used to perform the match. This approach is more powerful and flexible than LIKE/ILIKE, as it supports ranking and relevance, and can efficiently handle noise words and synonyms.

Note how in both the to_tsvector and to_tsquery functions, we specify the language as 'english'. This is important because it tells PostgreSQL which language to use for stemming and stop words. If you're working with text in a different language, you should specify the appropriate language code. This helps PostgreSQL to normalize the text and remove common words that are not useful for search purposes. You can view the list of supported languages by querying the pg_ts_config table.

SELECT cfgname FROM pg_ts_config;

The PostgreSQL documentation has a comprehensive list of functions and operators for working with tsvector and to_tsquery.

More Advanced: Generated tsvector Column

While the previous approach is powerful, it can be inefficient for large datasets, especially when performing full-text searches across multiple columns. This is because the tsvector casting needs to be done on the fly for each row, which can be slow especially for large data sets. For the best performance in full-text search, especially with large datasets, you can add a generated tsvector column to your table and index it. This method involves adding a new column that automatically generates a tsvector from one or more existing text columns. For example:


ALTER TABLE movies

ADD COLUMN fts_doc_en

GENERATED ALWAYS AS (to_tsvector('english', title || ' ' || original_title || ' ' || overview))

STORED;

This essentially tells PostgreSQL to compute the tsvector for the column(s) every time a new row is inserted or updated, and store the result in the fts_doc_en column. This means when you query this table PostgreSQL doesnt need to compute the tsvector on the fly, it can just use the precomputed tsvector.

To farther imporve performance, you can create an index on this column:


CREATE INDEX idx_fts_doc_en ON movies USING GIN (fts_doc_en);

This setup significantly improves search performance because the index allows PostgreSQL to quickly find relevant rows without scanning the entire table.

Comparison and Summary

  • LIKE and ILIKE Operators: Best for simple pattern matching when full-text search capabilities are not required. It's straightforward but can be inefficient for large datasets, especially when the pattern does not start with a wildcard.

  • tsvector and to_tsquery: Suited for applications requiring more sophisticated search capabilities, such as relevance ranking and noise word filtering. It's more powerful than LIKE/ILIKE but requires more complex queries and might still be inefficient for large datasets without proper indexing.

  • Generated tsvector Column: Ideal for large datasets and applications where search performance is critical. By storing a pre-computed tsvector and indexing it, you significantly reduce query times for full-text searches. This method requires additional storage for the generated column and the index but offers the best performance for full-text search operations.

In summary, the choice between LIKE/ILIKE, tsvector/to_tsquery, and a generated tsvector column depends on your application's specific needs. For simple use cases, LIKE or ILIKE might suffice. However, for applications requiring advanced search capabilities and high performance, utilizing tsvector with to_tsquery or a generated tsvector column with indexing is more appropriate.