避免采用随机函数排序

问题定义

我们有时候会使用以下查询语句获取数据集的随机样本。

select * from orders order by rand() limit 10; 

MySQL的函数rand或PostgreSQL的函数random会返回一个在范围0到1.0之间的随机浮点数。

它的执行计划如下，

MySQL Plan

-> Limit: 1 row(s)  (actual time=203.465..203.465 rows=1 loops=1)
    -> Sort: `rand()`, limit input to 1 row(s) per chunk  (actual time=203.464..203.464 rows=1 loops=1)
        -> Stream results  (cost=20540.71 rows=200128) (actual time=0.035..173.789 rows=201800 loops=1)
            -> Table scan on l  (cost=20540.71 rows=200128) (actual time=0.030..121.293 rows=201800 loops=1)

PostgreSQL Plan

Limit  (cost=4962.22..4962.22 rows=1 width=66) (actual time=58.080..58.082 rows=1 loops=1)
  ->  Sort  (cost=4962.22..5391.96 rows=171898 width=66) (actual time=58.078..58.079 rows=1 loops=1)
        Sort Key: (random())
        Sort Method: top-N heapsort  Memory: 25kB
        ->  Seq Scan on orders l  (cost=0.00..4102.73 rows=171898 width=66) (actual time=0.010..26.818 rows=171898 loops=1)

如果customer表少于10,000行，则此方法效果很好。但是当您有1,000,000行时，排序的开销变得不可接受。原因很明显：我们将所有行排序，但只保留其中的一行。其实有更高效的方法来实现此需求。

解决方案

方案1 ：

如果一个数值类型的列上存在唯一索引,且该列的值分布均匀,可以通过查询重写使用更高效的方式来避免全表扫描和排序全部行。

   create unique index ord_idx_key on orders(o_orderkey)

重写后的SQL

select
    *
from
    orders
where
    o_orderkey >= (
    select
        floor( RAND() * ((select MAX(o_orderkey) from orders)-(select MIN(o_orderkey) from orders)) + (select MIN(o_orderkey) from orders)))
order by
    o_orderkey
limit 10;

MySQL的执行计划

-> Limit: 10 row(s)  (cost=0.03 rows=3) (actual time=76.877..124.157 rows=10 loops=1)
  -> Filter: (orders.O_ORDERKEY >= floor(((rand() * <cache>(((select #3) - (select #4)))) + (select #5))))  (cost=0.03 rows=3) (actual time=76.876..124.155 rows=10 loops=1)
      -> Index scan on orders using PAW_IDX0334337551  (cost=0.03 rows=10) (actual time=0.015..111.731 rows=106410 loops=1)
      -> Select #3 (subquery in condition; run only once)
          -> Rows fetched before execution  (cost=0.00..0.00 rows=1) (actual time=0.000..0.000 rows=1 loops=1)
      -> Select #4 (subquery in condition; run only once)
          -> Rows fetched before execution  (cost=0.00..0.00 rows=1) (actual time=0.000..0.000 rows=1 loops=1)
      -> Select #5 (subquery in condition; run only once)
          -> Rows fetched before execution  (cost=0.00..0.00 rows=1) (actual time=0.000..0.000 rows=1 loops=1)

PostgreSQL的执行计划

Limit  (cost=1.27..1.44 rows=1 width=66) (actual time=27.840..27.847 rows=1 loops=1)
->  Nested Loop  (cost=1.27..9724.17 rows=57299 width=66) (actual time=27.838..27.845 rows=1 loops=1)
    Join Filter: ((t1.o_orderkey)::double precision >= (floor(((random() * (($1 - $3))::double precision) + ($5)::double precision))))
    Rows Removed by Join Filter: 91974
    ->  Index Scan using ord_idx_key on orders t1  (cost=0.29..6714.94 rows=171898 width=58) (actual time=0.022..9.406 rows=91975 loops=1)
    ->  Materialize  (cost=0.98..1.02 rows=1 width=8) (actual time=0.000..0.000 rows=1 loops=91975)
          ->  Result  (cost=0.98..1.01 rows=1 width=8) (actual time=0.112..0.117 rows=1 loops=1)
                InitPlan 2 (returns $1)
                  ->  Result  (cost=0.32..0.33 rows=1 width=4) (actual time=0.080..0.082 rows=1 loops=1)
                        InitPlan 1 (returns $0)
                          ->  Limit  (cost=0.29..0.32 rows=1 width=4) (actual time=0.079..0.079 rows=1 loops=1)
                                ->  Index Only Scan Backward using ord_idx_key on orders  (cost=0.29..3704.51 rows=171898 width=4) (actual time=0.078..0.079 rows=1 loops=1)
                                      Index Cond: (o_orderkey IS NOT NULL)
                                      Heap Fetches: 0
                InitPlan 4 (returns $3)
                  ->  Result  (cost=0.32..0.33 rows=1 width=4) (actual time=0.013..0.014 rows=1 loops=1)
                        InitPlan 3 (returns $2)
                          ->  Limit  (cost=0.29..0.32 rows=1 width=4) (actual time=0.013..0.013 rows=1 loops=1)
                                ->  Index Only Scan using ord_idx_key on orders orders_1  (cost=0.29..3704.51 rows=171898 width=4) (actual time=0.013..0.013 rows=1 loops=1)
                                      Index Cond: (o_orderkey IS NOT NULL)
                                      Heap Fetches: 0
                InitPlan 6 (returns $5)
                  ->  Result  (cost=0.32..0.33 rows=1 width=4) (actual time=0.007..0.009 rows=1 loops=1)
                        InitPlan 5 (returns $4)
                          ->  Limit  (cost=0.29..0.32 rows=1 width=4) (actual time=0.006..0.007 rows=1 loops=1)
                                ->  Index Only Scan using ord_idx_key on orders orders_2  (cost=0.29..3704.51 rows=171898 width=4) (actual time=0.006..0.007 rows=1 loops=1)
                                      Index Cond: (o_orderkey IS NOT NULL)
                                      Heap Fetches: 0

方案2：

如果上述条件不满足, 我们需要创建一个映射表来进行优化

create table orders_key_map ( row_id int not NULL primary key, o_orderkey int not null);
SET @id = 0;
INSERT INTO orders_key_map SELECT @id := @id + 1, o_orderkey FROM orders;

> select * from orders_key_map;
+--------+-----------+
| row_id | o_orderkey |
+--------+-----------+
|      1 |         100 |
|      2 |         102 |
|      3 |         300 |
|      4 |         833 |
|      5 |        1116 |
+--------+-----------+

之后我们可以使用如下SQL来获取 orders 表的随机样本:

select
    *
from
    orders o, orders_key_map m
where
    o.o_orderkey = m.o_orderkey
    m.row_id >= (
    select
        floor( RAND() * ((select MAX(row_id) from orders_key_map)-(select MIN(row_id) from orders_key_map)) + (select MIN(row_id) from orders_key_map)))
order by
    row_id
limit 1;

关于PawSQL

PawSQL专注数据库性能优化的自动化和智能化，支持MySQL，PostgreSQL，openGauss，Oracle等，提供的SQL优化产品包括

PawSQL Cloud，在线自动化SQL优化工具，支持SQL审查，智能查询重写、基于代价的索引推荐，适用于数据库管理员及数据应用开发人员，
PawSQL Advisor，IntelliJ 插件, 适用于数据应用开发人员，可以IDEA/DataGrip应用市场通过名称搜索“PawSQL Advisor”安装。
PawSQL Engine, 是PawSQL系列产品的后端优化引擎，可以独立安装部署，并通过http/json的接口提供SQL优化服务。PawSQL Engine以docker镜像的方式提供部署安装。

联系我们

邮件：service@pawsql.com
Twitter: https://twitter.com/pawsql
扫描关注PawSQL公众号

避免采用随机函数排序

问题定义​

解决方案​

方案1 ：​

方案2：​

关于PawSQL​

联系我们​