SQL实战经验 - SQL查询中关于NULL的4个陷阱

2024年1月15日 · 阅读需 13 分钟

Optimize your SQL Queries by Clicks!

NULL值处理是应用开发人员最容易出错误的地方，主要的原因是大家习惯使用二元的布尔逻辑来思考判断，而数据库对于NULL值的处理逻辑是三值逻辑。事实上，数据库优化器中缺陷最多的其实也是和NULL值处理相关的逻辑。即使是有着几十年历史的DB2/Teradata等成熟的数据库软件，仍然有超过20%的缺陷和NULL处理相关。

本文深度解析NULL值陷阱出现的根本原因，总结了简单有效的判断逻辑；同时针对日常开发中四种常见陷阱，解释其适用条件及解决方案；最后介绍了PawSQL中和NULL值处理相关的优化规则及其实现原理。看完本篇文章，解决关于NULL值处理的所有疑问。

先说结论

下面的判断逻辑覆盖关系数据库中对于NULL值的所有场景，理解下面的逻辑，你就可以避免NULL陷阱。

NULL参与的所有的比较和算术运算符(>,=,<,!=,<=,>=,+,-,*,/)结果为unknown
unknown的逻辑运算(AND、OR、NOT）遵循三值运算的真值表
如果运算结果直接返回用户，使用NULL来标识unknown
如果运算结果是作为条件判断真假，那么需要通过三值逻辑进行运算，并最终通过以下映射逻辑确定整体判定
1. {false、unknown} -> false
2. {true} ->true
在UNION 或 INTERSECT等集合运算中，NULL 被视为彼此相等。

三值逻辑

在逻辑学中的三值逻辑（three-valued，也称为三元，或三价逻辑，有时缩写为3VL）是几个多值逻辑系统中的其中之一。有三种状态来表示真、假和一个表示不确定的第三值；这相对于基础的二元逻辑（比如布尔逻辑，它只提供真假两种状态）。

三值逻辑有三个真值（true、false、unknown），它的AND、OR、NOT运算的真值表如下：

点击并拖拽以移动

关于NULL的4个陷阱

1. 比较谓词与NULL

结论：=null并不能判断表达式为空, 判断表达式为空应该使用is null

假设有一张客户表，记录客户的编号，姓名和电话，其中电话和国家代码可以为空。

CREATE TABLE customer (
	c_custkey int4 NOT NULL,
	c_name varchar(25) NOT NULL,
	c_nationcode char(8) NULL,
	c_phone varchar(15) NULL,
    c_regdate date NULL,
	CONSTRAINT customer_pkey PRIMARY KEY (c_custkey)
);
insert into customer values(1, 'Randy', 'en', '13910010010', '20210911');
insert into customer values(2, 'Mandy', null, '13910010012', '20200211');
insert into customer values(3, 'Ray', 'us', null, '20180902');

通过下面的语句获取电话为空的客户是得不到的，它的结果永远为空

select * from customer where c_phone = null;

正确的写法应该是：

select * from customer where c_phone is null;

原因是：根据第一章节里的结论第一条c_phone = null的结果是unknown；然后根据结论里的4.1条unknown判断真假为false.

c_phone=null -> unknown -> false;

2. Case When与NULL

结论：case expr when nulll并不能判断表达式为空, 判断表达式为空应该case when expr is null 在where/having的筛选条件的错误写法还比较容易发现并纠正，而在藏在case 语句里使用null值判断就比较难以被发现，一般的SQL审核工具也难以处理。

譬如我们想要获取客户名称和国家的名称, 如果为空，默认是China：

select c_name, case c_nationcode 
				when 'us' then 'USA' 
				when 'cn' then 'China'
                when null then 'China' 
                else 'Others' end 
from customer

以上的语句并不会把国家代码为null转换为China；因为when null实际上是安c_nationcode=null进行运算的；正确的写法应该是：

select c_name, case when c_nationcode = 'us' then 'USA' 
					when c_nationcode = 'cn' then 'China'
                	when c_nationcode is null then 'China' 
                	else 'Others' end 
from customer

Note: PawSQL Advisor会分析输入SQL所有的case expr when null的条件，并重写为 case expr is null。

3. NOT IN 与NULL

结论：NOT IN 子查询谓词，如果子查询结果集有空值，NOT IN谓词总为假假设我们有一张订单表，其中客户编号和订单日期由于数据缺失可以为空

CREATE TABLE orders (
	o_orderkey int4 NOT NULL,
	o_custkey int4 NULL,
	o_orderdate date NULL,
	CONSTRAINT orders_pkey PRIMARY KEY (o_orderkey)
);
insert into orders values(1, 1, '2021-01-01');
insert into orders values(2, null, '2020-09-01');
insert into orders values(3, 3, null);

现在我们想要获取没有订单的客户（编号为2）进行营销，我们通常的写法可能是这样的，

select * from customer where c_custkey not in (select o_custkey from orders)

而事实上，上面的sql并没有返回我们预期的结果。原因就是子查询里的o_custkey有空值，而NOT IN的处理逻辑是这样的

c_custkey not in (1，3,null) → c_custkey<>1 and c_custkey<>3 and c_custkey<>null → c_custkey<>1 and c_custkey<>3 and unknown → unknown -> false

事实上，如果子查询的结果集里有空值，这个SQL永远返回为空。

正确的写法有两种：

在子查询里加上非空限制

select * from customer where c_custkey not in (select o_custkey from orders where o_custkey is not null)

将NOT IN子查询改写为not exists子查询

select * from customer where not exists (select o_custkey from orders where o_custkey=c_custkey)

Note: PawSQL的采用第一种方式来进行重写优化，但是它的功能更强大，PawSQL先判定子查询里的列是否可能为空的,如果可能为空，它才会向用户推荐重写后的SQL。

4. 修饰符ALL与NULL

ALL修饰的子查询谓词，如果子查询的结果集中有空值，则该谓词总为false 假设通过下面的sql来获取订单系统关闭后注册的用户

select * from customer where c_regdate > all(select o_orderdate from orders)

和上面的NOT IN类似的，由于子查询的结果中存在NULL，这个sql不会返回预期的结果。ALL 运算实际执行时也是与返回的结果集一一比较，然后进行AND的运算，最终结果unknown。而unknown作为条件进行评估是，结果为false.

正确的写法有两种：

在子查询里加上非空限制

select * from customer where c_regdate > all(select o_orderdate from orders where o_orderdate is not null)

将expr > all或expr >= all改写为聚集函数 expr > (select max()...)（如果expr < all或expr <= all，则改写为expr < (select min() ...)
```
select * from customer where c_regdate > (select max(o_custkey) from orders)
```

Note: PawSQL 的采用第二种方式来进行重写优化，原因是PawSQL对第二种重写后的sql还可以进一步通过重写（max/min子查询重写规则）进行性能优化。

PawSQL关于NULL的重写优化

PawSQL中关于NULL处理的重写优化规则有三个，对应与上面的四种情况。

规则编码	规则描述
UseEqual4NullRewrite	`=null`或是`case when null`l并不能判断表达式为空, 判断表达式为空应该使用`is null`
NotInNullableSubQueryRewrite	NOT IN子查询且选择列可空，这种情况不会返回任何记录
AllQualifierSubQueryRewrite	ALL修饰的子查询选择列可空，这种情况不会返回任何记录

PawSQL的功能更加强大，它会根据DDL中列的定义是否为nullable,以及作用在列上的运算是否会产生nullable的结果，来确定子查询里的查询列是否nullable的,如果可能为空，它才会向用户推荐重写后的SQL。我们通过下面7个案例来展示PawSQL是如何处理NULL相关的陷阱的。

customer表定义中列c_phone及c_nationkey可空。

案例 1: `= null` 重写为 `is null`

原SQL

select count(*) from customer where c_phone=null; 

重写后的SQL

select count(*) from customer where customer.c_phone is null;

案例2：`case expr when null` 重写为 `case when expr is null`

原SQL

select case c_phone when null then 1 when '139%' then 0 else -1 end from customer;

重写后的SQL

select case
         when c_phone is null then 1 
         when c_phone = '139%' then 0 
         else -1 
       end
  from customer;

案例 3: `c_nationkey`可空，增加 `c_nationkey is not null`条件

原SQL

select count(*) from nation where n_nationkey not in (select c_nationkey from customer);

重写后的SQL

select count(*) from nation where n_nationkey not in ( select c_nationkey from customer where c_nationkey is not null)

案例 4: `c_nationkey`可空，所以`max(c_nationkey)`可空, 增加 `c_nationkey is not null`条件

原SQL

select count(*) from nation where n_nationkey not in (select max(c_nationkey) from customer group by c_mktsegment);

重写后的SQL

select count(*) from nation where n_nationkey not in (
             select max(customer.c_nationkey)
               from customer
               where c_nationkey is not null
               group by c_mktsegment)

案例 5: `count(c_nationkey )`永远不为空，所以无需重写

select count(*) from nation where n_nationkey not in (select count(c_nationkey) from customer group by c_mktsegment);

案例 6: `c_name`不为空，但是`c_nationkey`可空，所以需增加 `c_nationkey is not null` 条件

原SQL

  select count(*) from nation where (n_name,n_nationkey) not in (select 'China',c_nationkey from customer);

重写后的SQL

select count(*) from nation where (n_name,n_nationkey) not in ( select 'China', c_nationkey from customer where customer.c_nationkey is not null)

案例 7: `c_nationkey`可空，所以重写为 `> (select min(c_nationkey) from customer)`

原SQL

select count(*) from customer where n_nationkey > all(select c_nationkey from customer);

重写后的SQL

select count(*) from customer where n_nationkey > (select min(c_nationkey) from customer)

关于PawSQL

PawSQL专注数据库性能优化的自动化和智能化，支持MySQL，PostgreSQL，Opengauss等，提供的SQL优化产品包括

PawSQL Cloud，在线自动化SQL优化工具，支持SQL审查，智能查询重写、基于代价的索引推荐，适用于数据库管理员及数据应用开发人员，
PawSQL Advisor IntelliJ 插件, 适用于数据应用开发人员，可以IDEA/DataGrip应用市场通过名称搜索“PawSQL Advisor”安装。

联系我们

邮件：service@pawsql.com

Twitter: https://twitter.com/pawsql

扫描关注PawSQL公众号

PawSQL

先说结论​

三值逻辑​

关于NULL的4个陷阱​

1. 比较谓词与NULL​

2. Case When与NULL​

3. NOT IN 与NULL​

4. 修饰符ALL与NULL​

PawSQL关于NULL的重写优化​

案例 1: = null 重写为 is null​

案例2：case expr when null 重写为 case when expr is null​

案例 3: c_nationkey可空，增加 c_nationkey is not null条件​

案例 4: c_nationkey可空，所以max(c_nationkey)可空, 增加 c_nationkey is not null条件​

案例 5: count(c_nationkey )永远不为空，所以无需重写​

案例 6: c_name不为空，但是c_nationkey可空，所以需增加 c_nationkey is not null 条件​

案例 7: c_nationkey可空，所以重写为 > (select min(c_nationkey) from customer)​

关于PawSQL​

联系我们​