Skip to main content

Implicit Data Type Conversion

· 7 min read
PawSQL Team
Optimize your SQL Queries by Clicks!

Copyright © 2024 PawSQL

Problem

Let's say the O_ORDERDATE column in TPCH.ORDERS table is defined as VARCHAR(16), and there is an index ORDDATE_IDX on ORDERS(O_ORDERDATE) to accelerate the queries against this table if there is a condition on O_ORDERDATE. If we want to count the number of orders of today using following SQL query.

select count(*) from ORDERS 
where O_ORDERDATE = current_date();

Use explain to get the query plan:

-> Filter: (orders.O_ORDERDATE = <cache>(curdate()))  (cost=2621.05 rows=19938)
-> Table scan on ORDERS (cost=2621.05 rows=199378)

From the query plan, we can see the database optimizer uses a table scan instead of the index lookup as we expected. The query cost is high.

Diagnosis

Let's replace the function call current_date() with a string value '2022-09-08',

select count(*) from ORDERS 
where O_ORDERDATE = '2022-09-08';

Use explain to get the query plan:

-> Aggregate: count(0)  (cost=0.50 rows=1)
-> Filter: (orders.O_ORDERDATE = '2022-09-08') (cost=0.40 rows=1)
-> Covering index lookup on ORDERS using PAW_IDX1264424863 (O_ORDERDATE='2022-09-08') (cost=0.40 rows=1)

From the query plan , we can see the database optimizer chooses index lookup using index ORDDATE_IDX and the cost is less than 1% of the previous SQL query.

Diagnosis: The problem is because of the implicit data type conversion applied on O_ORDERDATE and this conversion disables the index application. When the data types of expressions in a predicate are different, there will be some implicit data type convention operations during query execution (MySQL) 1. For implicit conventions, some times it happens to the constants in the predicate, sometimes it happens to the column in the predicate. When the convention is applied on the column, the candidate indexes will not be considered during query execution and leading to serious performance problems.

Solution

The straightful solution is to ensure the expressions used in filtering conditions have the same data type to the column's. For the problematic SQL query,

  1. we can either define the O_ORDERDATE as DATE type, or

  2. we can use type conversion function cast to make sure the filtering constant with the same data type of column.

But both solutions are very prone to mis-usage and time consuming, because the developers need to have the knowledge and check the table definitions from time to time. That is where the query tuning tool, such as PawSQL, fills in.

PawSQL auto-rewrites

There is a rewrite optimization rule [DiffDataTypeInPredicateRewrite] in PawSQL to automatically detect the data type mis-matching in SQL queries and provide automatic rewrite optmization.

Conditions

  1. The predicate is between a table column and simple value expressions2
  2. The table column has a different data type from the simple value expressions.
  3. Support following three type of predicates
    • column operator expr
    • column between expr1 and expr2
    • column in (expr1, expr2, )

Example Cases - MySQL

Case 1: column operator expr

  • Input Query

     select count(*) from ORDERS where O_ORDERDATE = current_date();
  • Query Plan for input query

     -> Aggregate: count(orders.O_CUSTKEY)  (cost=4614.83 rows=1)
    -> Filter: (orders.O_ORDERDATE = <cache>(curdate())) (cost=2621.05 rows=19938)
    -> Table scan on ORDERS (cost=2621.05 rows=199378)
  • Optimized Query

     select count(*)
    from TPCH.ORDERS
    where ORDERS.O_ORDERDATE = cast(current_date() as CHAR(16));
  • Query Plan for the optimized query

     -> Aggregate: count(0)  (cost=1.20 rows=1)
    -> Filter: (orders.O_ORDERDATE = <cache>(cast(curdate() as char(16) charset utf8mb4))) (cost=1.10 rows=1)
    -> Covering index lookup on ORDERS using o_orderdate_idx (O_ORDERDATE=cast(curdate() as char(16) charset utf8mb4)) (cost=1.10 rows=1)

    Optimization Explanation: O_ORDERDATE is type of CHAR(16) in table definition, and current_date() is type of DATE, so index on O_ORDERDATE is not applicable, and the estimated cost is 4614.83. PawSQL rewrites current_date() to cast(current_date() as CHAR(16)), so the index on O_ORDERDATE is chosen by the database optimizer and the cost is reduce to 1.20, which is less than 0.1%.

Case 2: column between expr1 and expr2

  • Input query

     select count(*) from ORDERS where O_ORDERDATE between current_date() and DATE_ADD(CURRENT_DATE(), INTERVAL 1 DAY);
  • Query plan for input query

     -> Aggregate: count(orders.O_ORDERKEY)  (cost=5057.45 rows=1)
    -> Filter: (orders.O_ORDERDATE between <cache>(curdate()) and <cache>((curdate() + interval 1 day))) (cost=2842.36 rows=22151)
    -> Table scan on ORDERS (cost=2842.36 rows=199378)
  • Optimized query

       select count(*)
    from TPCH.ORDERS
    where ORDERS.O_ORDERDATE between cast(current_date() as CHAR(16)) and cast(DATE_ADD(CURRENT_DATE(), interval '1' DAY) as CHAR(16))
  • Query plan for optimized query

     -> Aggregate: count(0)  (cost=1.31 rows=1)
    -> Filter: (orders.O_ORDERDATE between <cache>(cast(curdate() as char(16) charset utf8mb4)) and <cache>(cast((curdate() + interval '1' day) as char(16) charset utf8mb4))) (cost=1.21 rows=1)
    -> Covering index range scan on ORDERS using o_orderdate_idx over ('2022-09-14' <= O_ORDERDATE <= '2022-09-15') (cost=1.21 rows=1)

    Optimization Explanation: DATE_ADD(CURRENT_DATE(), INTERVAL 1 DAY) is a nested function and it's final result is type of DATE, which is not compatible with CHAR(16). So it's rewritten to cast(DATE_ADD(CURRENT_DATE(), interval '1' DAY) as CHAR(16)). And the cost is reduced to 1.31 from 5057.45.

Case 3: column in (expr1, expr2, ...)

  • Input query

     select count(*) from ORDERS where O_ORDERDATE in (current_date(), '2022-10-01');
  • Query Plan for input query

     -> Aggregate: count(0)  (cost=8602.39 rows=1)
    -> Filter: (orders.O_ORDERDATE in (<cache>(curdate()),'2022-10-01')) (cost=4614.83 rows=39876)
    -> Index scan on ORDERS using o_orderdate_idx (cost=4614.83 rows=199378)
  • Optimized query

     select count(*)
    from TPCH.ORDERS
    where ORDERS.O_ORDERDATE in (cast(current_date() as CHAR(16)), '2022-10-01');
  • Query plan for optimized query

     -> Aggregate: count(0)  (cost=1.62 rows=1)
    -> Filter: (orders.O_ORDERDATE in (<cache>(cast(curdate() as char(16) charset utf8mb4)),'2022-10-01')) (cost=1.42 rows=2)
    -> Covering index range scan on ORDERS using o_orderdate_idx over (O_ORDERDATE = '2022-09-14') OR (O_ORDERDATE = '2022-10-01') (cost=1.42 rows=2)

    Optimization Explanation: current_date() is written to cast(current_date() as CHAR(16)), and '2022-10-01' is type of character string, which is compatible with the type of column O_ORDERDATE. It is no need to rewrite it. And the cost is reduced to 1.62 from 8602.39.

Examples - PostgreSQL

select *
from customer
where c_custkey = 2.0;
Seq Scan on customer  (cost=0.00..418.00 rows=50 width=181)
Filter: ((c_custkey)::numeric = 2.0)
select *
from customer
where c_custkey = 2;
Index Scan using customer_pkey on customer  (cost=0.29..8.30 rows=1 width=181)
Index Cond: (c_custkey = 2)

About PawSQL

PawSQL is dedicated to automatic and intelligent database performance optimization. The products provided by PawSQL include:

  • PawSQL Cloud, an online automated SQL optimization tool that supports SQL auditing, intelligent query rewriting, cost-based index recommendations, suitable for database administrators and data application developers.
  • PawSQL Advisor, an IntelliJ plugin that is suitable for data application developers and can be installed via the IDEA/DataGrip marketplace by searching for "PawSQL Advisor" by name.
  • PawSQL Engine, which is the backend optimization engine of the PawSQL series of products, can be installed and deployed independently, and provides SQL optimization services through http/json interfaces. PawSQL Engine is provided for deployment and installation as a docker image.

Contact Us

Email: service@pawsql.com

Twitter: https://twitter.com/pawsql

Footnotes

  1. For PostgreSQL, instead of applying data type conversion implicitly, there will be an error thrown by the database to notify users there should be some explicit type conventions to add.

  2. Simple value expressions refer to constants or expressions/functions which are evaluated to a constant, such as current_date(), 1 + 2, concate('first name', 'last name').