Skip to main content

Outer to Inner Join Conversion

Channel of advanced SQL tuning

Copyright © 2022 PawSQL

Types of Joins

The types of SQL joins mainly include inner join, outer join (left outer join, right outer join, full outer join), and cross join. C.L. Moffatt explains them very nicely in his article Visual Representation of SQL Joins in a visual way. You will find the information you need in following diagram he presented in his article.

How does Join Type affect Query Execution

As we know, the database optimizer uses dynamic programming algorithm to estimate the cost of possible execution paths and select the execution plan with the smallest cost for execution. When the optimizer plans for table joins, it mainly performs three types of planning: access mode of data tables, order of table joins, and method of table joins.

When the optimizer plans for table joins, for Nested Loop Join method, it only considers the plan that the outer table is accessed before the inner table for each join operation. Therefore, when using outer joins, the outer and inner tables are explicitly defined in SQL (the outer table is the left table of left join/right table of right join), and the optimizer's selection is limited, which may lead to a lower-performing execution plan.

Let's use the example from the MySQL official document to illustrate:

Consider a query of this form, where R(T2) greatly narrows the number of matching rows from table T2:

SELECT * T1 FROM T1 LEFT JOIN T2 ON P1(T1,T2) WHERE P(T1,T2) AND R(T2) 

If the query is executed as written, the optimizer has no choice but to access the less-restricted table T1 before the more-restricted table T2, which may produce a very inefficient execution plan.

Outer Join to Inner Join Conversion

If R(T2) is a NULL Filtered Condition (NFC), the outer join above can be transformed into an inner join, i.e.,

SELECT * T1 FROM T1 JOIN T2 ON P1(T1,T2) WHERE P(T1,T2) AND R(T2) 

In this way, the optimizer can first apply R(T2) to obtain a very small result set, and then associate it with T1.

NFC Condition (NULL Filtered Condition)

A NULL Filtered Condition refers to a condition that filters out a row if the input is NULL and its evaluation result is FALSE or UNKNOWN. The rules for determining whether a condition is an NFC of an outer join operation are simple:

  • It is of the form A IS NOT NULL, where A is any column of the inner table.
  • It is a predicate that contains a reference to the inner table, and when one of its parameters is NULL, it evaluates to UNKNOWN.
  • It is a "AND" combination of NFCs.
  • It is a "OR" combination of NFCs.

Therefore, for this outer join:

T1 LEFT JOIN T2 ON T1.A=T2.A 

the following conditions are considered as NFCs:

  • T2.B IS NOT NULL
  • T2.B > 3
  • T2.C <= T1.C
  • T1.B < 3 AND T2.B IS NOT NULL
  • T2.B < 2 OR T2.C > 1

The following conditions are not considered as NFCs:

  • T2.B IS NULL
  • T1.B < 3 OR T2.B IS NOT NULL
  • T1.B < 3 OR T2.B > 3
  • T2.B in (1,2, NULL)

Outer Join Optimization in DBMS

Most of the optimizer in relational databases can provide the above outer join simplification rewrites. Here is an example of outer join simplification rewrites in MySQL and PostgreSQL.

  • Example SQL statement:
select c_custkey from orders left join customer on c_custkey=o_custkey where C_NATIONKEY  < 20
  • Execution Plan in MySQL:
-> Inner hash join (orders.O_CUSTKEY = customer.C_CUSTKEY)  (cost=20541.08 rows=20013)
-> Table scan on orders (cost=2529.21 rows=200128)
-> Hash
-> Filter: (customer.C_NATIONKEY < 20) (cost=0.35 rows=1)
-> Table scan on customer (cost=0.35 rows=1)
  • Execution Plan in PostgreSQL:
Hash Join  (cost=100.19..410.47 rows=33 width=4)
Hash Cond: (orders.o_custkey = customer.c_custkey)
-> Seq Scan on orders (cost=0.00..284.01 rows=10001 width=4)
-> Hash (cost=99.78..99.78 rows=33 width=4)
-> Bitmap Heap Scan on customer (cost=4.54..99.78 rows=33 width=4)
Recheck Cond: (c_nationkey < 20)
-> Bitmap Index Scan on c_nationkey_idx (cost=0.00..4.53 rows=33 width=0)
Index Cond: (c_nationkey < 20)

As seen from the execution plan, both MySQL and PostgreSQL support the rewrite optimization of outer join to inner join.

Outer Join Optimization in PawSQL

PawSQL implements similar rewrite optimization through the Outer2InnerConversionRewrite rule. It supports the rewrite optimization of left outer join, right outer join to inner join, and the rewrite optimization of full outer join to left outer join, right outer join, or inner join.

Note: The purpose of implementing the Outer2InnerConversionRewrite optimization in PawSQL is not to reproduce the rewrite optimization logic of the database optimizer. PawSQL implements it because it can trigger index recommendation functions that the database optimizer cannot achieve.

  1. Input SQL statement
select c_custkey from orders left join customer on c_custkey=o_custkey where C_NATIONKEY  < 20
  1. After Outer2InnerConversionRewrite applied, left join is converted to inner join
select c_custkey from orders inner join customer on c_custkey=o_custkey where C_NATIONKEY  < 20
  1. Recommended Index

The rewritten SQL statement recommends the following two indexes through the PawSQL index recommendation engine.

CREATE INDEX CONCURRENTLY PAW_IDX1028958902 ON PUBLIC.ORDERS(O_CUSTKEY);
CREATE INDEX CONCURRENTLY PAW_IDX1196677611 ON PUBLIC.CUSTOMER(C_NATIONKEY,C_CUSTKEY);
  1. Execution Plan:

After incorporating the recommended indexes into the optimizer planning scope, the execution plan is:

  • Execution Plan in PostgreSQL:
Nested Loop  (cost=0.57..85.53 rows=33 width=4)
-> Index Only Scan using paw_idx1196677611 on customer (cost=0.29..8.86 rows=33 width=4)
Index Cond: (c_nationkey < 20)
-> Index Only Scan using paw
  • Execution Plan in MySQL
-> Nested loop inner join  (cost=1.91 rows=6)
-> Filter: (customer.C_NATIONKEY < 20) (cost=0.35 rows=1)
-> Index scan on customer using PAW_IDX1360881332 (cost=0.35 rows=1)
-> Covering index lookup on orders using PAW_IDX1028958902 (O_CUSTKEY=customer.C_CUSTKEY) (cost=1.56 rows=6)

As we can see, through the entire optimization process of PawSQL, two indexes recommended by the index engine reduced the query cost from 410.47 to 85.53 on a PostgreSQL database with small size of data, resulting in a 379.91% performance improvement. Similarly, on a MySQL database server with a big size of data, the query cost decreased from 20541.08 to 1.91, resulting in a 553161.15% performance improvement.

About PawSQL

PawSQL is dedicated to automatic and intelligent database performance optimization. The products provided by PawSQL include:

  • PawSQL Cloud, an online automated SQL optimization tool that supports SQL auditing, intelligent query rewriting, cost-based index recommendations, suitable for database administrators and data application developers.
  • PawSQL Advisor, an IntelliJ plugin that is suitable for data application developers and can be installed via the IDEA/DataGrip marketplace by searching for "PawSQL Advisor" by name.
  • PawSQL Engine, which is the backend optimization engine of the PawSQL series of products, can be installed and deployed independently, and provides SQL optimization services through http/json interfaces. PawSQL Engine is provided for deployment and installation as a docker image.

Contact Us

Email: service@pawsql.com

Twitter: https://twitter.com/pawsql