Common Table Expressions (CTEs) in PostgreSQL

In the realm of SQL databases, managing complex queries efficiently and maintaining readability can be quite challenging, especially when dealing with large datasets and intricate relationships. PostgreSQL, a powerful open-source relational database, offers a feature called Common Table Expressions (CTEs) to address this issue. CTEs, introduced through the WITH clause, not only enhance the readability and maintainability of SQL code but also enable the execution of recursive queries, which are essential for dealing with hierarchical or tree-structured data. This blog post delves into the concept of CTEs in PostgreSQL, illustrating their utility in simplifying complex queries and executing recursive operations with practical examples.

What is a CTE?

A Common Table Expression (CTE) provides a way to write auxiliary statements for use in a larger SQL query. It's akin to defining a temporary result set that you can reference within a SELECT, INSERT, UPDATE, or DELETE statement. A CTE is defined using the WITH clause followed by a CTE name and an AS clause, which contains a subquery.

You can think of a CTE as an ad-hoc view or a temporary table that exists only for the duration of the query. It's particularly useful for breaking down complex queries into smaller, more manageable parts, thereby improving the readability and maintainability of your SQL code.

Simple Use Case: Simplifying Complex Queries

CTEs shine in scenarios where queries are complex, involving multiple subqueries, joins, and aggregations. By breaking down these complex operations into smaller, more manageable parts, CTEs enhance the readability and maintainability of your SQL code.

Example 1: Organizing Multiple Aggregations

Imagine you have a sales database with tables for sales and products. You want to find the total sales per product category.

Without CTEs, your query might involve nested subqueries, which can quickly become unwieldy. With CTEs, you can simplify this process:


WITH product_sales AS (

    SELECT

        p.category,

        SUM(s.amount) AS total_sales

    FROM sales s

    JOIN products p ON s.product_id = p.id

    GROUP BY p.category

)

SELECT category, total_sales

FROM product_sales

ORDER BY total_sales DESC;

This CTE, product_sales, calculates the total sales per category. The main query then selects from this CTE, producing a clean and understandable result set.

Of course, this is a relatively simple example, but the benefits of CTEs become more apparent as queries grow in complexity...

Recursive Queries: Handling Hierarchical Data

One of the most powerful features of CTEs is their ability to execute recursive queries. This is invaluable when dealing with hierarchical or tree-structured data, such as organizational charts, directory structures, or any data that has a parent-child relationship.

Example 2: Employee Hierarchy

Consider an employees table with an id, name, and manager_id column, where manager_id refers to the id of the employee's manager. To retrieve the hierarchy of employees under a specific manager, a recursive CTE can be used.


WITH RECURSIVE employee_hierarchy AS (

    SELECT

        id,

        name,

        manager_id,

        1 AS level

    FROM employees

    WHERE manager_id IS NULL -- Assuming top-level managers have NULL manager_id

    UNION ALL

    SELECT

        e.id,

        e.name,

        e.manager_id,

        eh.level + 1

    FROM employees e

    INNER JOIN employee_hierarchy eh ON e.manager_id = eh.id

)

SELECT *

FROM employee_hierarchy

ORDER BY level, name;

This recursive CTE starts with the top-level managers (where manager_id is NULL) and recursively joins the employees table to itself, traversing down the hierarchy. The level column is incremented at each step to indicate the depth of the hierarchy.

This query will return a result set showing the entire employee hierarchy, ordered by level and name. The recursive CTE allows you to navigate the hierarchical data structure without needing to know the depth of the hierarchy in advance.

This type of recursive query is extremene useful for handling hierarchical / nested data, for instance:

Organizational charts
Directory structures
Folder hierarchies

Benefits of Using CTEs

Readability and Maintenance: CTEs make complex queries more readable and easier to maintain by breaking them down into simpler parts.
Debugging Ease: You can test parts of a complex query independently, making debugging easier.
Performance: While CTEs may not always lead to performance improvements, they can make query optimization by the developer more straightforward.

Materialized and Non-Materialized CTEs

Common Table Expressions (CTEs) not only serve as a method to enhance the structure and readability of SQL queries but also introduce an intriguing aspect of query execution related to materialization. The concept of materialized and non-materialized CTEs revolves around how PostgreSQL executes these expressions, with significant implications for performance and optimization.

Essentially - there are two different ways that PostgreSQL can handle CTEs - materialized and non-materialized, which we will cover below:

Materialized CTEs: A Double-Edged Sword

A materialized CTE in PostgreSQL is one where the result of the CTE is computed and stored temporarily at the beginning of the query execution. This stored result is then used throughout the execution of the entire parent query. Materializing a CTE can be beneficial, particularly when the CTE contains expensive calculations that are used multiple times in the parent query. By computing these results once and reusing them, PostgreSQL avoids redundant work, potentially saving significant resources and time.

However, materialization comes with its drawbacks. When a CTE is materialized, PostgreSQL cannot leverage certain optimizations, such as pushing restrictions from the parent query down into the CTE. This limitation arises because materializing a CTE essentially creates a fixed temporary dataset, disconnecting it from any indexes or optimizations that could have been applied if the data were accessed directly from the underlying tables.

Consider the following example where a CTE is used twice in a query:

WITH w AS (
    SELECT * FROM big_table  -- Assume big_table has an INDEX on a field called key
)
SELECT * FROM w AS w1 
  JOIN w AS w2 ON w1.key = w2.ref  -- CTE w is called twice, so it is materialized by default
WHERE w2.key = 123;

In this scenario, big_table is materialized into a temporary copy, and the join operation occurs without the benefit of the index on the key field, potentially leading to suboptimal performance.

Non-Materialized CTEs: Optimizing with Flexibility

A non-materialized CTE, on the other hand, is not stored temporarily. Instead, it's integrated or "folded" into the main query, allowing PostgreSQL to optimize the entire query more effectively, including the use of indexes and pushing down restrictions.

To ensure that a CTE is not materialized and thus allows for such optimizations, one can explicitly declare it as NOT MATERIALIZED:

WITH w AS NOT MATERIALIZED (
    SELECT * FROM big_table
)
SELECT * FROM w AS w1 JOIN w AS w2 ON w1.key = w2.ref
WHERE w2.key = 123;

This declaration instructs PostgreSQL to treat the CTE as an integral part of the main query, leveraging indexes and achieving potentially significant performance improvements.

Default Behaviour: PostgreSQL's Choice

By default, the decision to materialize a CTE is made by PostgreSQL based on the characteristics and usage of the CTE within the parent query. If the CTE is non-recursive, side-effect-free (i.e., it contains no volatile functions), and referenced only once in the parent query, PostgreSQL will not materialize it. Otherwise, it will be materialized. The exact quote from the PostgreSQL manual is as follows:

However, if a WITH query is non-recursive and side-effect-free (that is, it is a SELECT containing no volatile functions) then it can be folded into the parent query, allowing joint optimization of the two query levels. By default, this happens if the parent query references the WITH query just once, but not if it references the WITH query more than once.

When to Materialize (or Not)

Deciding between materialized and non-materialized CTEs requires a nuanced understanding of your query and data. Materialization might hinder performance if it prevents the use of indexes or if the temporary dataset becomes large. Conversely, in cases where a CTE's result is used multiple times and contains heavy computations, materialization could be beneficial by avoiding repeated calculations.

There's no one-size-fits-all rule for when to use materialized versus non-materialized CTEs. The decision heavily depends on various factors, including the size of the tables involved, the fields selected, and the presence (or absence) of indexes that could be leveraged. As such, DBAs and developers are encouraged to experiment with both approaches, benchmarking their queries to identify which option yields better performance for their specific use case.

In essence, the choice between materializing a CTE or not illustrates the balance between computational redundancy and query optimization potential. While PostgreSQL provides the tools to control this aspect of query execution, it ultimately falls to the database professionals to harness their knowledge and experience to make informed decisions, ensuring queries are both efficient and effective.

Limitations and Considerations

Performance Overheads: In some cases, particularly with recursive CTEs, there can be performance overheads. It's essential to analyze and optimize recursive CTEs carefully, especially with large datasets.
Materialization: PostgreSQL might materialize the results of a CTE, storing them in a temporary space, which can sometimes lead to less-than-optimal performance. However, this behavior allows for complex calculations and operations that would otherwise be challenging.

Conclusion

Common Table Expressions in PostgreSQL offer a robust solution for simplifying complex SQL queries and performing recursive operations. By enhancing the readability and maintainability of SQL code, CTEs make it easier for developers to write, debug, and optimize their database queries. Whether you're dealing with complicated join operations or navigating hierarchical data structures, CTEs can significantly streamline your SQL query writing process. Like any powerful tool, however, it's important to use CTEs judiciously and be mindful of potential performance implications. With practice and careful consideration, CTEs can become an invaluable part of your PostgreSQL toolkit.

Understanding Common Table Expressions (CTEs) in PostgreSQL