IN Subquery Optimization
Problem Definition
An IN subquery is a type of subquery that takes the following form.
(expr1, expr2...) [NOT] IN (SELECT expr3, expr4, ...)
An IN subquery can be rewritten as an equivalent correlated EXISTS subquery or inner join, which can create a extra filtering condition. If the filtering condition has an appropriate index or is recommended by the PawSQL index recommendation engine, better performance can be achieved.
- IN Subquery to EXISTS conversion
The following IN subquery is used to retrieve user information for users who have placed orders within the past year:
select * from customer where c_custkey in (select o_custkey from orders where O_ORDERDATE>=current_date - interval 1 year)
It can be rewritten as an EXISTS subquery, which creates a extra filtering condition (c_custkey = o_custkey
):
select * from customer where exists (select * from orders where c_custkey = o_custkey and O_ORDERDATE>=current_date - interval 1 year)
- IN Subquery to INNER JOIN conversion
If the query result of a subquery is distinct, then the IN subquery can be rewritten as a join between two tables. This allows the database optimizer to plan a better table join sequence and also enables PawSQL to recommend better optimization methods.
For example, consider the following SQL query where c_custkey
is the primary key of the customer
table:
SELECT * FROM orders WHERE o_custkey IN (SELECT c_custkey FROM customer);
Here, orders
and customer
are two related tables, where c_custkey
is the primary key of the customer
table and o_custkey
is the foreign key in the orders
table that relates to the customer
table.
If the query result of the subquery is distinct, we can rewrite the query as a JOIN query, as shown below:
SELECT * FROM orders,customer where orders.o_custkey = customer.c_custkey;
By rewriting the IN subquery as a JOIN query, the database optimizer can plan a better table join sequence and choose a better execution plan while executing the query. In addition, such a query is also easier to understand and maintain.
Optimization of IN Subqueries in Databases
Both MySQL and PostgreSQL can adopt the optimal execution plan for IN and EXISTS.
- If there is no index on O_ORDERDATE, the execution plans of Query1 and Query2 on MySQL will both use pseudo-code implementation logic of IN subquery:
-> Nested loop inner join (cost=19847117.66 rows=198449671)
-> Table scan on customer (cost=1155.80 rows=9948)
-> Single-row index lookup on <subquery2> using <auto_distinct_key> (o_custkey=customer.C_CUSTKEY)
-> Materialize with deduplication (cost=22471.48..22471.48 rows=19949)
-> Filter: (orders.O_ORDERDATE = <cache>((curdate() - interval 1 year))) (cost=20476.61 rows=19949)
-> Table scan on orders (cost=20476.61 rows=199487)
- If an index is built on O_ORDERDATE, then their execution plans on MySQL will both use pseudo-code implementation logic of EXISTS subquery:
-> Nested loop semijoin (cost=22777.29 rows=5705)
-> Table scan on customer (cost=1155.80 rows=9948)
-> Filter: (orders.O_ORDERDATE = <cache>((curdate() - interval 1 year))) (cost=0.92 rows=1)
-> Index lookup on orders using o_idx_key (O_CUSTKEY=customer.C_CUSTKEY) (cost=0.92 rows=6)
- If the queried columns in the subquery are unique, the database will convert it to an inner join.
For the following SQL as an example
select * from orders where o_custkey in (select c_custkey from customer where c_phone like '139%')
The execution plan on MySQL is as follows (PostgreSQL is similar):
-> Nested loop inner join (cost=3541.61 rows=6313)
-> Filter: (customer.C_PHONE like '139%') (cost=1148.89 rows=1099)
-> Table scan on customer (cost=1148.89 rows=9888)
-> Index lookup on orders using idx_orders_ckey (O_CUSTKEY=customer.C_CUSTKEY) (cost=1.60 rows=6)
Note: In MySQL and PostgreSQL databases, IN or EXISTS are equivalent, and the database always try to choose the optimal execution plan based on indexes and statistical information.
Optimization of IN Subqueries in PawSQL
PawSQL will rewrite the IN subquery to EXISTS subquery or inner join query to help the index recommendation engine recommend suitable indexes, prompting the optimizer to adopt the optimal execution plan.
IN subquery to EXISTS conversion
- Original Query
select *
from tpch.customer
where customer.c_custkey in (
select orders.o_custkey
from tpch.orders
where orders.O_ORDERDATE >= current_date - interval '1' YEAR)
- Rewritten Query
select /*QB_1*/ *
from tpch.customer
where exists (select /*QB_2*/ orders.o_custkey
from tpch.orders
where orders.O_ORDERDATE >= current_date - interval '1' YEAR
and orders.o_custkey = customer.c_custkey)
- Index Recommended
CREATE INDEX PAW_IDX1072908633 ON tpch.ORDERS(O_ORDERDATE,O_CUSTKEY);
-- When table ORDERS(referenced in query block QB_2) serves as a DRIVE table in the join planning, PAW_IDX1072908633 can be used to do a COVERING index RANGE SCAN with condition(orders.O_ORDERDATE >= current_date - interval '1' YEAR).
CREATE INDEX PAW_IDX0031767282 ON tpch.ORDERS(O_CUSTKEY,O_ORDERDATE);
-- When table ORDERS(referenced in query block QB_2) serves as a DRIVEN table in the join planning, PAW_IDX0031767282 can be used to do a COVERING index RANGE SCAN with condition(orders.o_custkey = customer.c_custkey and orders.O_ORDERDATE >= current_date - interval '1' YEAR).
- Validation Details
- Query Plan(before)
-> Nested loop inner join (cost=65987720.69 rows=659855821)
-> Table scan on customer (cost=1149.80 rows=9888)
-> Single-row index lookup on <subquery2> using <auto_distinct_key> (o_custkey=customer.C_CUSTKEY)
-> Materialize with deduplication (cost=13874.51..13874.51 rows=66733)
-> Filter: (orders.O_ORDERDATE >= <cache>((curdate() - interval '1' year))) (cost=7201.21 rows=66733)
-> Table scan on orders (cost=7201.21 rows=200219)
- Query Plan(after)
-> Nested loop inner join (cost=3771444.20 rows=37693056)
-> Table scan on customer (cost=1149.80 rows=9888)
-> Single-row index lookup on <subquery2> using <auto_distinct_key> (o_custkey=customer.C_CUSTKEY)
-> Materialize with deduplication (cost=1150.65..1150.65 rows=3812)
-> Filter: (orders.O_ORDERDATE >= <cache>((curdate() - interval '1' year))) (cost=769.45 rows=3812)
-> Covering index range scan on orders using PAW_IDX1072908633 over ('2022-03-28' <= O_ORDERDATE) (cost=769.45 rows=3812)
Performance improves by 1648.67% after PawSQL optimization.
IN subquery to INNER JOIN conversion
- Original Query,
c_custkey
is the primary key of tablecustomer
.
select *
from tpch.orders
where orders.o_custkey in (
select customer.c_custkey
from tpch.customer)
- Rewritten Query
select *
from tpch.orders, tpch.customer
where customer.c_custkey = orders.o_custkey
- Index Recommended
CREATE INDEX PAW_IDX0455857015 ON tpch.ORDERS(O_CUSTKEY,O_CLERK);
-- When table ORDERS serves as a DRIVEN table in the join planning, PAW_IDX0455857015 can be used to do a index LOOKUP with condition(customer.c_custkey = orders.o_custkey).
- Validation Details
- Query Plan(before)
-> Nested loop inner join (cost=240790.71 rows=200219)
-> Table scan on orders (cost=20549.81 rows=200219)
-> Single-row covering index lookup on customer using key_idx (C_CUSTKEY=orders.O_CUSTKEY) (cost=1.00 rows=1)
- Query Plan(after)
-> Nested loop inner join (cost=21289.23 rows=53135)
-> Index scan on customer using key_idx (cost=1149.80 rows=9888)
-> Index lookup on orders using PAW_IDX0455857015 (O_CUSTKEY=customer.C_CUSTKEY) (cost=1.50 rows=5)
Performance improves by 1064.60% after PawSQL optimization.
🌐 About PawSQL
PawSQL is dedicated to automatic and intelligent database performance optimization. The products provided by PawSQL include:
- PawSQL Cloud, an online automated SQL optimization tool that supports SQL auditing, intelligent query rewriting, cost-based index recommendations, suitable for database administrators and data application developers.
- PawSQL Advisor, an IntelliJ plugin that is suitable for data application developers and can be installed via the IDEA/DataGrip marketplace by searching for "PawSQL Advisor" by name.
Contact Us
Email: service@pawsql.com
Website: https://www.pawsql.com