SQL优化技巧 - IN子查询优化

2024年1月15日 · 阅读需 11 分钟

Optimize your SQL Queries by Clicks!

问题定义

为了获取最近一年内有订单的用户信息，可以使用以下的三种写法去实现，它们在语义上是等价的。那它们的性能如何？如何对它们进行优化？这是本文讨论的问题。

Query1 - IN子查询(= ANY)

select * from customer where c_custkey in (select o_custkey from orders where O_ORDERDATE>=current_date - interval 1 year)

Query2 - EXISTS子查询

select * from customer where exists (select * from orders where c_custkey = o_custkey and O_ORDERDATE>=current_date - interval 1 year)

Query3 - JOIN方式

select c.* from customer c join (select distinct o_custkey from orders where O_ORDERDATE>=current_date - interval 1 year) o where o.o_custkey = c.custkey

IN子查询

IN子查询并不一定是非相关子查询，但是为了讨论方便，本文所述的IN子查询为非相关子查询。

Query1 - IN子查询(= ANY)

select * from customer where c_custkey in (select o_custkey from orders where O_ORDERDATE>=current_date - interval 1 year)

IN子查询的伪代码实现逻辑：

执行子查询语句，并得到结果集并去重，并将结果集存储在临时表中。
将主查询中的值逐一与子查询结果集中的值进行比较，如果匹配成功，则返回该行数据。
在第二步的比较时，
- 可以将子查询的结果集转化为一个哈希表，然后对于主查询中的每一行，都在哈希表中查找该行的值是否存在。
- 可以在上面建立一个唯一性索引，通过此索引和外表进行关联。不论适用哪一种方式，它的实际复杂度都为O(1)

时间复杂度

它的时间复杂度为**O(max(m,n)) + nlogn**，其中，m是外表的记录数，n为子查询的记录数。

可以看到，如果子查询的记录数比较大时，其时间复杂度较大，性能较差。

EXISTS子查询

Query2 - EXISTS子查询

select * from customer where exists (select * from orders where c_custkey = o_custkey and O_ORDERDATE>=current_date - interval 1 year)

实现逻辑如下：

对于主查询中的每一行，都执行一次子查询。
如果子查询返回的结果集不为空，则保留该行数据。

时间复杂度

因此它的时间复杂度为O(m*n), 其中m为外表的记录数，n为子查询的访问的记录数

如果子查询中的orders没有索引，则n为orders表的行数，
如果orders上有筛选率比较大的索引，则n为索引所筛选出的记录数。

可以看出，如果EXISTS的子查询中有筛选率非常高的索引，使用EXISTS子查询的性能比较好。

Join方式

为了保证语义一致性，使用join方式需要先进行去重操作。

Query3 - JOIN方式

select c.* from customer c join (select distinct o_custkey from orders where O_ORDERDATE>=current_date - interval 1 year) o where o.o_custkey = c.custkey

对比IN子查询的执行计划，可以看到Join方式就是IN子查询的执行计划的SQL化表达。

如果如果子查询中的查询列是唯一的，那么可以将其转换为内连接，从而获得更好的性能。

数据库中的IN子查询优化

事实上，MySQL和PostgreSQL都可以对IN和EXISTS采取最优的执行计划，

如果没有O_ORDERDATE上的索引，Query1和Query2在MySQL上的执行计划都是采用IN子查询的伪代码实现逻辑：

-> Nested loop inner join  (cost=19847117.66 rows=198449671)
    -> Table scan on customer  (cost=1155.80 rows=9948)
    -> Single-row index lookup on <subquery2> using <auto_distinct_key> (o_custkey=customer.C_CUSTKEY)
        -> Materialize with deduplication  (cost=22471.48..22471.48 rows=19949)
            -> Filter: (orders.O_ORDERDATE = <cache>((curdate() - interval 1 year)))  (cost=20476.61 rows=19949)
                -> Table scan on orders  (cost=20476.61 rows=199487)

如果在O_ORDERDATE建立一个索引，那么它们的执行计划都是采用EXISTS子查询的伪代码实现逻辑：

-> Nested loop semijoin  (cost=22777.29 rows=5705)
    -> Table scan on customer  (cost=1155.80 rows=9948)
    -> Filter: (orders.O_ORDERDATE = <cache>((curdate() - interval 1 year)))  (cost=0.92 rows=1)
        -> Index lookup on orders using o_idx_key (O_CUSTKEY=customer.C_CUSTKEY)  (cost=0.92 rows=6)

如果子查询中的查询列是唯一的，那么数据库会将其转换为内连接

譬如对于下面的SQL，

select * from orders where o_custkey in (select c_custkey from customer where c_phone like '139%')

MySQL的执行计划是这样的（PostgreSQL也是类似的）：

-> Nested loop inner join  (cost=3541.61 rows=6313)
    -> Filter: (customer.C_PHONE like '139%')  (cost=1148.89 rows=1099)
        -> Table scan on customer  (cost=1148.89 rows=9888)
    -> Index lookup on orders using idx_orders_ckey (O_CUSTKEY=customer.C_CUSTKEY)  (cost=1.60 rows=6)

可以看出，在MySQL和PostgreSQL数据库中，使用IN或是EXISTS的写法是等价的，数据库总是可以根据索引和统计信息采用最优的执行计划。

PawSQL中的IN子查询优化

PawSQL中会将IN子查询重写为EXISTS子查询或是内连接查询，从而帮助索引推荐引擎推荐合适的索引，促使优化器采用最优的执行计划。

IN子查询转换为EXISTS

原SQL

select *
from tpch.customer
where customer.c_custkey in (
           select orders.o_custkey
           from tpch.orders
           where orders.O_ORDERDATE >= current_date - interval '1' YEAR)

应用重写优化，转换为

select /*QB_1*/ *
from tpch.customer
where exists (select /*QB_2*/ orders.o_custkey
             from tpch.orders
             where orders.O_ORDERDATE >= current_date - interval '1' YEAR
         and orders.o_custkey = customer.c_custkey)

基于转换后的SQL，推荐索引

CREATE INDEX PAW_IDX1072908633 ON tpch.ORDERS(O_ORDERDATE,O_CUSTKEY);
-- 当QB_2中引用的表ORDERS作为驱动表时, 索引PAW_IDX1072908633可以被用来进行索引范围查找,过滤条件为(orders.O_ORDERDATE >= current_date - interval '1' YEAR); 该索引是个覆盖索引，可以避免回表.

CREATE INDEX PAW_IDX0031767282 ON tpch.ORDERS(O_CUSTKEY,O_ORDERDATE);
-- 当QB_2中引用的表ORDERS作为被驱动表时, 索引PAW_IDX0031767282可以被用来进行索引范围查找,过滤条件为(orders.o_custkey = customer.c_custkey and orders.O_ORDERDATE >= current_date - interval '1' YEAR); 该索引是个覆盖索引，可以避免回表.

性能验证

执行计划（优化前）

-> Nested loop inner join  (cost=65987720.69 rows=659855821)
  -> Table scan on customer  (cost=1149.80 rows=9888)
  -> Single-row index lookup on <subquery2> using <auto_distinct_key> (o_custkey=customer.C_CUSTKEY)
      -> Materialize with deduplication  (cost=13874.51..13874.51 rows=66733)
          -> Filter: (orders.O_ORDERDATE >= <cache>((curdate() - interval '1' year)))  (cost=7201.21 rows=66733)
              -> Table scan on orders  (cost=7201.21 rows=200219)

执行计划（优化后）

-> Nested loop inner join  (cost=3771444.20 rows=37693056)
  -> Table scan on customer  (cost=1149.80 rows=9888)
  -> Single-row index lookup on <subquery2> using <auto_distinct_key> (o_custkey=customer.C_CUSTKEY)
      -> Materialize with deduplication  (cost=1150.65..1150.65 rows=3812)
          -> Filter: (orders.O_ORDERDATE >= <cache>((curdate() - interval '1' year)))  (cost=769.45 rows=3812)
              -> Covering index range scan on orders using PAW_IDX1072908633 over ('2022-03-28' <= O_ORDERDATE)  (cost=769.45 rows=3812)

本次优化实施后，预计本SQL的性能将提升 1648.67%

IN子查询转换为内连接

条件：

子查询没有limit子句

子查询没有groupby子句

In子查询不是OR条件的一部分

IN列上无计算

对于外表的一条记录，子查询返回不超过一条数据

原SQL，c_custkey是customer表的主键，

   select *
   from tpch.orders
   where orders.o_custkey in (
              select customer.c_custkey
              from tpch.customer)

应用重写优化，转化为内连接

   select orders.*
   from tpch.orders, tpch.customer
   where customer.c_custkey = orders.o_custkey

基于转换后的SQL，推荐索引

   CREATE INDEX PAW_IDX0455857015 ON tpch.ORDERS(O_CUSTKEY,O_CLERK);
   -- 当ORDERS作为被驱动表时, 索引PAW_IDX0455857015可以被用来进行索引查找, 过滤条件为(customer.c_custkey = orders.o_custkey).

性能验证

执行计划（优化前）

-> Nested loop inner join  (cost=240790.71 rows=200219)
  -> Table scan on orders  (cost=20549.81 rows=200219)
  -> Single-row covering index lookup on customer using key_idx (C_CUSTKEY=orders.O_CUSTKEY)  (cost=1.00 rows=1)

执行计划（优化后）

-> Nested loop inner join  (cost=21289.23 rows=53135)
  -> Index scan on customer using key_idx  (cost=1149.80 rows=9888)
  -> Index lookup on orders using PAW_IDX0455857015 (O_CUSTKEY=customer.C_CUSTKEY)  (cost=1.50 rows=5)

本次优化实施后，预计本SQL的性能将提升 1064.60%

关于PawSQL

PawSQL专注数据库性能优化的自动化和智能化，支持MySQL，PostgreSQL，Opengauss等，提供的SQL优化产品包括

PawSQL Cloud，在线自动化SQL优化工具，支持SQL审查，智能查询重写、基于代价的索引推荐，适用于数据库管理员及数据应用开发人员，
PawSQL Advisor，IntelliJ 插件, 适用于数据应用开发人员，可以IDEA/DataGrip应用市场通过名称搜索“PawSQL Advisor”安装。
PawSQL Engine, 是PawSQL系列产品的后端优化引擎，可以独立安装部署，并通过http/json的接口提供SQL优化服务。PawSQL Engine以docker镜像的方式提供部署安装。

联系我们

邮件：service@pawsql.com
Twitter: https://twitter.com/pawsql
扫描关注PawSQL公众号

问题定义​

IN子查询​

IN子查询的伪代码实现逻辑：​

时间复杂度​

EXISTS子查询​

实现逻辑如下：​

时间复杂度​

Join方式​

数据库中的IN子查询优化​

PawSQL中的IN子查询优化​

IN子查询转换为EXISTS​

IN子查询转换为内连接​

关于PawSQL​

联系我们​

问题定义

IN子查询

IN子查询的伪代码实现逻辑：

时间复杂度

EXISTS子查询

实现逻辑如下：

时间复杂度

Join方式

数据库中的IN子查询优化

PawSQL中的IN子查询优化

IN子查询转换为EXISTS

IN子查询转换为内连接

关于PawSQL

联系我们