Skip to main content

Optimize TPCH Queries with PawSQL

Overview

TPCH is a database test benchmark for decision support systems,developed by TPC (Transaction Processing Performance Council) in 1994. TPCH tests consist of a set of 22 complex business queries and some data maintenance queries to measure the performance of relational database management systems under complex analytical queries. These queries simulate real queries in a data warehouse of a wholesale supplier, covering aspects such as order analysis, supply chain management, and partial sales analysis.

Database vendors use TPCH tests to evaluate their products, assessing the impact of different servers, storage configurations, parallel processing, and resource scheduling on database performance.

In this article, we will evaluate the effectiveness of PawSQL's optimizations on the TPCH queries by only rewriting queries and recommending indexes without changing the hardware configuration and database parameters.

Benchmark Setup

To better simulate real scenarios, we have established benchmarks on a relatively high basis.

  1. We use the official TPC tool to generate test data;
  2. We use the official TPC tool to generate query statements, and ensure that each filtering condition can generate a large amount of data for subsequent aggregation operations;
  3. We have created indexes for primary keys and foreign keys to ensure the basic optimization of table joins.

Environment Config

Since this evaluation focuses on PawSQL's optimization recommendations, we reduce the hardware configuration and use the default configuration for database parameters.

  • Hardware: 2C4G
  • MySQL 8.0.17
  • Database Parameters (defaults)
  • PawSQL settings(defaults)

Data Volume

Table NameRow CountEngine
customer15000INNODB
lineitem600572INNODB
nation25INNODB
orders150000INNODB
part20000INNODB
partsupp80000INNODB
region5INNODB
supplier1000INNODB

Indexes Predefined

image-20231104122321645

Optimization Summary

  • For the 22 query statements in TPCH, 21 have optimization recommendations; recommended 8 query rewrites, recommended 21 indexes;
  • 18 queries with performance improvements from optimization recommendations, 86% of the total; 12 queries with performance improvements over 50%, 60% of the total;
  • Average performance improvement 276.40%, the maximum improvement for a single query is nearly 15 times.

Metrics of 22 Queries

The optimization results for each query can be viewed through the links. We will take Query-19 as an example to illustrate the capability of PawSQL's automatic optimization.

Query NumberBefore Opt (ms)After Opt (ms)RewriteIndexesPlan ChangedPerf Improved
Query-9385.40824.31201Yes1484.69%
Query-11345.28426.99401Yes1178.71%
Query-1927.8943.2612Yes753.64%
Query-431.4516.47202Yes385.51%
Query-17297.86575.64701Yes293.73%
Query-1425.0829.35412Yes168.07%
Query-1814.6765.76613Yes154.43%
Query-1183.74976.81312Yes139.21%
Query-6131.09564.1901Yes104.23%
Query-2076.32639.92202Yes91.19%
Query-5894.588468.44902Yes90.97%
Query-13154.0398.64513Yes56.15%
Query-1554.33542.19401Yes28.79%
Query-8161.492138.44601Yes16.65%
Query-1077.35570.00502Yes10.51%
Query-12480.841440.79512Yes9.09%
Query-7506.672477.10801Yes6.2%
Query-22369.674364.28901Yes1.48%
Query-3445.12445.95800No-0.19%
Query-16156.083184.7612Yes-15.51%
Query-2669.726855.37222Yes-21.7%
Query-211085.1563466.5701Yes-68.7%

Analysis of Q19 Optimization

Q19 in the TPCH test set queries the total discount revenue for all orders of 3 different types of parts shipped by air or carrier. The part selection considers specific brands, packaging and size ranges. This query exemplifies a report generated by a data mining tool. The characteristics of Q19 are that it has grouping, sorting, aggregation, and IN subquery operations combined with a 3-table join operation.

Original Q19 SQL:

select sum(lineitem.l_extendedprice * (1 - lineitem.l_discount)) as revenue
from lineitem, part
where part.p_partkey = lineitem.l_partkey
and part.p_brand = 'Brand#52'
and part.p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
and lineitem.l_quantity >= 4
and lineitem.l_quantity <= 4 + 10
and part.p_size between 1 and 5
and lineitem.l_shipmode in ('AIR', 'AIR REG')
and lineitem.l_shipinstruct = 'DELIVER IN PERSON'
or part.p_partkey = lineitem.l_partkey
and part.p_brand = 'Brand#11'
and part.p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
and lineitem.l_quantity >= 18
and lineitem.l_quantity <= 18 + 10
and part.p_size between 1 and 10
and lineitem.l_shipmode in ('AIR', 'AIR REG')
and lineitem.l_shipinstruct = 'DELIVER IN PERSON'
or part.p_partkey = lineitem.l_partkey
and part.p_brand = 'Brand#51'
and part.p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
and lineitem.l_quantity >= 29
and lineitem.l_quantity <= 29 + 10
and part.p_size between 1 and 15
and lineitem.l_shipmode in ('AIR', 'AIR REG')
and lineitem.l_shipinstruct = 'DELIVER IN PERSON'

PawSQL's Automatic Optimization for Q19

PawSQL's automated optimization is divided into two main parts:

  • Step 1: Apply the "Rewrite OR conditions to UNION" rule to rewrite it to UNION ALL

Note that PawSQL can determine if each OR branch is mutually exclusive and rewrite to UNION ALL to avoid a deduplication operation when they are. This is the case in this example.

The rewritten SQL is:

select sum(PawDT_1698819584185.l_extendedprice * (1 - PawDT_1698819584185.l_discount)) as revenue
from (
select /*QB_3*/ lineitem.l_discount, lineitem.l_extendedprice
from lineitem, part
where part.p_partkey = lineitem.l_partkey
and part.p_brand = 'Brand#51'
and part.p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
and lineitem.l_quantity >= 29
and lineitem.l_quantity <= 29 + 10
and part.p_size between 1 and 15
and lineitem.l_shipmode in ('AIR', 'AIR REG')
and lineitem.l_shipinstruct = 'DELIVER IN PERSON'

union all
select /*QB_2*/ lineitem.l_discount, lineitem.l_extendedprice
from lineitem, part
where part.p_partkey = lineitem.l_partkey
and part.p_brand = 'Brand#52'
and part.p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
and lineitem.l_quantity >= 4
and lineitem.l_quantity <= 4 + 10
and part.p_size between 1 and 5
and lineitem.l_shipmode in ('AIR', 'AIR REG')
and lineitem.l_shipinstruct = 'DELIVER IN PERSON'

union all
select /*QB_1*/ lineitem.l_discount, lineitem.l_extendedprice
from lineitem, part
where part.p_partkey = lineitem.l_partkey
and part.p_brand = 'Brand#11'
and part.p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
and lineitem.l_quantity >= 18
and lineitem.l_quantity <= 18 + 10
and part.p_size between 1 and 10
and lineitem.l_shipmode in ('AIR', 'AIR REG')
and lineitem.l_shipinstruct = 'DELIVER IN PERSON'
) as PawDT_1698819584185
  • Step 2: Recommend indexes based on the rewritten SQL:
CREATE INDEX PAWSQL_IDX2001589724 ON tpch.part(P_BRAND,P_CONTAINER,P_PARTKEY);
-- PART in block QB_1 (DRIVE): MATCHING(part.p_brand = 'Brand#11'), RANGE_SCAN(part.p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK'))

CREATE INDEX PAWSQL_IDX1523603249 ON tpch.lineitem(L_PARTKEY,L_SHIPINSTRUCT,L_QUANTITY);
-- LINEITEM in block QB_1 (DRIVEN): MATCHING(part.p_partkey = lineitem.l_partkey and lineitem.l_shipinstruct = 'DELIVER IN PERSON'), RANGE_SCAN(lineitem.l_quantity >= 18)

Q19 Performance Verification

From verification output, the execution time decreased from 27.9ms to 2.26ms, and performance improved by 753%.

image-20231102151754468

About PawSQL

PawSQL focuses on automating and intelligencing database performance optimization, supporting MySQL, PostgreSQL, openGauss, Oracle, etc. The SQL optimization products provided include:

  • PawSQL Cloud, an online automated SQL optimization tool that supports SQL review, intelligent query rewriting, cost-based index recommendation, suitable for DBAs and data application developers.
  • PawSQL Advisor, an IntelliJ plugin suitable for data application developers, can be installed by searching for "PawSQL Advisor" in the IDEA/DataGrip marketplace.
  • PawSQL Engine is the backend optimization engine of PawSQL products, which can be installed and deployed independently as a docker image, and provide SQL optimization services through http/json interfaces.

Contact Us

Website: https://docs.pawsql.com

Email: service@pawsql.com

Twitter: https://twitter.com/pawsql

Scan to follow PawSQL on Wechat PawSQL