Implicit Data Type Conversion
Copyright © 2024 PawSQL
Problem
Let's say the O_ORDERDATE
column in TPCH.ORDERS
table is defined as VARCHAR(16)
, and there is an index ORDDATE_IDX
on ORDERS(O_ORDERDATE)
to accelerate the queries against this table if there is a condition on O_ORDERDATE
. If we want to count the number of orders of today using following SQL query.
select count(*) from ORDERS
where O_ORDERDATE = current_date();
Use explain
to get the query plan:
-> Filter: (orders.O_ORDERDATE = <cache>(curdate())) (cost=2621.05 rows=19938)
-> Table scan on ORDERS (cost=2621.05 rows=199378)
From the query plan, we can see the database optimizer uses a table scan instead of the index lookup as we expected. The query cost is high.
Diagnosis
Let's replace the function call current_date()
with a string value '2022-09-08'
,
select count(*) from ORDERS
where O_ORDERDATE = '2022-09-08';
Use explain
to get the query plan:
-> Aggregate: count(0) (cost=0.50 rows=1)
-> Filter: (orders.O_ORDERDATE = '2022-09-08') (cost=0.40 rows=1)
-> Covering index lookup on ORDERS using PAW_IDX1264424863 (O_ORDERDATE='2022-09-08') (cost=0.40 rows=1)
From the query plan , we can see the database optimizer chooses index lookup
using index ORDDATE_IDX
and the cost is less than 1% of the previous SQL query.
Diagnosis: The problem is because of the implicit data type conversion applied on O_ORDERDATE
and this conversion disables the index application. When the data types of expressions in a predicate are different, there will be some implicit data type convention operations during query execution (MySQL) 1. For implicit conventions, some times it happens to the constants in the predicate, sometimes it happens to the column in the predicate. When the convention is applied on the column, the candidate indexes will not be considered during query execution and leading to serious performance problems.
Solution
The straightful solution is to ensure the expressions used in filtering conditions have the same data type to the column's. For the problematic SQL query,
-
we can either define the
O_ORDERDATE
asDATE
type, or -
we can use type conversion function
cast
to make sure the filtering constant with the same data type of column.
But both solutions are very prone to mis-usage and time consuming, because the developers need to have the knowledge and check the table definitions from time to time. That is where the query tuning tool, such as PawSQL, fills in.
PawSQL auto-rewrites
There is a rewrite optimization rule [DiffDataTypeInPredicateRewrite] in PawSQL to automatically detect the data type mis-matching in SQL queries and provide automatic rewrite optmization.
Conditions
- The predicate is between a table column and simple value expressions2
- The table column has a different data type from the simple value expressions.
- Support following three type of predicates
- column operator expr
- column between expr1 and expr2
- column in (expr1, expr2, )
Example Cases - MySQL
Case 1: column operator expr
-
Input Query
select count(*) from ORDERS where O_ORDERDATE = current_date();
-
Query Plan for input query
-> Aggregate: count(orders.O_CUSTKEY) (cost=4614.83 rows=1)
-> Filter: (orders.O_ORDERDATE = <cache>(curdate())) (cost=2621.05 rows=19938)
-> Table scan on ORDERS (cost=2621.05 rows=199378) -
Optimized Query
select count(*)
from TPCH.ORDERS
where ORDERS.O_ORDERDATE = cast(current_date() as CHAR(16)); -
Query Plan for the optimized query
-> Aggregate: count(0) (cost=1.20 rows=1)
-> Filter: (orders.O_ORDERDATE = <cache>(cast(curdate() as char(16) charset utf8mb4))) (cost=1.10 rows=1)
-> Covering index lookup on ORDERS using o_orderdate_idx (O_ORDERDATE=cast(curdate() as char(16) charset utf8mb4)) (cost=1.10 rows=1)Optimization Explanation:
O_ORDERDATE
is type ofCHAR(16)
in table definition, andcurrent_date()
is type ofDATE
, so index onO_ORDERDATE
is not applicable, and the estimated cost is 4614.83. PawSQL rewritescurrent_date()
tocast(current_date() as CHAR(16))
, so the index onO_ORDERDATE
is chosen by the database optimizer and the cost is reduce to 1.20, which is less than 0.1%.
Case 2: column between expr1 and expr2
-
Input query
select count(*) from ORDERS where O_ORDERDATE between current_date() and DATE_ADD(CURRENT_DATE(), INTERVAL 1 DAY);
-
Query plan for input query
-> Aggregate: count(orders.O_ORDERKEY) (cost=5057.45 rows=1)
-> Filter: (orders.O_ORDERDATE between <cache>(curdate()) and <cache>((curdate() + interval 1 day))) (cost=2842.36 rows=22151)
-> Table scan on ORDERS (cost=2842.36 rows=199378) -
Optimized query
select count(*)
from TPCH.ORDERS
where ORDERS.O_ORDERDATE between cast(current_date() as CHAR(16)) and cast(DATE_ADD(CURRENT_DATE(), interval '1' DAY) as CHAR(16)) -
Query plan for optimized query
-> Aggregate: count(0) (cost=1.31 rows=1)
-> Filter: (orders.O_ORDERDATE between <cache>(cast(curdate() as char(16) charset utf8mb4)) and <cache>(cast((curdate() + interval '1' day) as char(16) charset utf8mb4))) (cost=1.21 rows=1)
-> Covering index range scan on ORDERS using o_orderdate_idx over ('2022-09-14' <= O_ORDERDATE <= '2022-09-15') (cost=1.21 rows=1)Optimization Explanation:
DATE_ADD(CURRENT_DATE(), INTERVAL 1 DAY)
is a nested function and it's final result is type ofDATE
, which is not compatible withCHAR(16)
. So it's rewritten to cast(DATE_ADD(CURRENT_DATE(), interval '1' DAY) as CHAR(16)). And the cost is reduced to 1.31 from 5057.45.
Case 3: column in (expr1, expr2, ...)
-
Input query
select count(*) from ORDERS where O_ORDERDATE in (current_date(), '2022-10-01');
-
Query Plan for input query
-> Aggregate: count(0) (cost=8602.39 rows=1)
-> Filter: (orders.O_ORDERDATE in (<cache>(curdate()),'2022-10-01')) (cost=4614.83 rows=39876)
-> Index scan on ORDERS using o_orderdate_idx (cost=4614.83 rows=199378) -
Optimized query
select count(*)
from TPCH.ORDERS
where ORDERS.O_ORDERDATE in (cast(current_date() as CHAR(16)), '2022-10-01'); -
Query plan for optimized query
-> Aggregate: count(0) (cost=1.62 rows=1)
-> Filter: (orders.O_ORDERDATE in (<cache>(cast(curdate() as char(16) charset utf8mb4)),'2022-10-01')) (cost=1.42 rows=2)
-> Covering index range scan on ORDERS using o_orderdate_idx over (O_ORDERDATE = '2022-09-14') OR (O_ORDERDATE = '2022-10-01') (cost=1.42 rows=2)Optimization Explanation:
current_date()
is written tocast(current_date() as CHAR(16))
, and'2022-10-01'
is type of character string, which is compatible with the type of columnO_ORDERDATE
. It is no need to rewrite it. And the cost is reduced to 1.62 from 8602.39.
Examples - PostgreSQL
select *
from customer
where c_custkey = 2.0;
Seq Scan on customer (cost=0.00..418.00 rows=50 width=181)
Filter: ((c_custkey)::numeric = 2.0)
select *
from customer
where c_custkey = 2;
Index Scan using customer_pkey on customer (cost=0.29..8.30 rows=1 width=181)
Index Cond: (c_custkey = 2)
About PawSQL
PawSQL is dedicated to automatic and intelligent database performance optimization. The products provided by PawSQL include:
- PawSQL Cloud, an online automated SQL optimization tool that supports SQL auditing, intelligent query rewriting, cost-based index recommendations, suitable for database administrators and data application developers.
- PawSQL Advisor, an IntelliJ plugin that is suitable for data application developers and can be installed via the IDEA/DataGrip marketplace by searching for "PawSQL Advisor" by name.
- PawSQL Engine, which is the backend optimization engine of the PawSQL series of products, can be installed and deployed independently, and provides SQL optimization services through http/json interfaces. PawSQL Engine is provided for deployment and installation as a docker image.
Contact Us
Email: service@pawsql.com
Twitter: https://twitter.com/pawsql
Footnotes
-
For PostgreSQL, instead of applying data type conversion implicitly, there will be an error thrown by the database to notify users there should be some explicit type conventions to add. ↩
-
Simple value expressions refer to constants or expressions/functions which are evaluated to a constant, such as
current_date()
,1 + 2
,concate('first name', 'last name')
. ↩