Split string in PostgreSQL using regexp_split_to_array

PostgreSQL offers a variety of functions for splitting strings. For more complex cases, regexp_split_to_array is a versatile function that allows for splitting a string into an array of elements based on a specified pattern, which can be a regular expression. This capability makes it particularly useful for dealing with complex splitting criteria that go beyond simple delimiters like commas or spaces.

Basic Example

Before diving into more complex use cases, let's start with a basic example to understand how regexp_split_to_array works.


SELECT regexp_split_to_array('OpenAI,GPT-4,ChatGPT', ',');

This query splits the string 'OpenAI,GPT-4,ChatGPT' into an array of strings using , as the delimiter. The result is an array: {'OpenAI', 'GPT-4', 'ChatGPT'}.

Methodology

The function regexp_split_to_array takes two primary arguments:

1. The string to split: The target string you want to divide into parts.

2. The regular expression pattern: A pattern defining the delimiter or the rule for splitting the string. This pattern can be as simple as a single character or as complex as needed.

Optionally, you can also specify flags to alter the behavior of the pattern matching, such as case sensitivity.

Advanced Use Cases

Moving beyond basic delimiters, let's explore some complex scenarios where regexp_split_to_array shines.

Splitting Strings with Multiple Delimiters

Consider a scenario where your data is inconsistently delimited using commas, semicolons, or spaces. Using a regular expression, you can split the string by any of these characters.


SELECT regexp_split_to_array('Data Science;AI,ML Deep Learning', '[,; ]+');

In this query, [,; ]+ is a regular expression that matches any occurrence of a comma, semicolon, or space. The + ensures that multiple consecutive delimiters are treated as a single delimiter, preventing empty strings in the output array.

Splitting Strings While Ignoring Delimiters in Quoted Strings

Sometimes, delimiters within quoted strings should be ignored. For example, splitting a CSV line where commas within quotes are not delimiters.


SELECT regexp_split_to_array('OpenAI,"GPT, 4",ChatGPT', ',(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)');

This query splits the string by commas not enclosed in double quotes. The regular expression ,(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$) looks for commas that are followed by an even number of quotes, ensuring that commas inside quotes are not considered as delimiters.

Smart Splitting Based on Word Boundaries

In cases where you need to split text based on word boundaries (e.g., to extract words), regular expressions allow for sophisticated patterns.


SELECT regexp_split_to_array('OpenAI develops GPT-4.', '\\W+');

Here, \\W+ matches any sequence of non-word characters as delimiters, effectively splitting the string into words. This pattern is useful for text processing where punctuation marks, spaces, or other separators need to be ignored.