Skip to main content

Reasoning LLM Evaluation in SQL Semantics:A Comparative Analysis of DeepSeek r1, GPT-4o, and Claude 3.7 Sonnet

· 5 min read

Introduction

As Large Language Models (LLMs) become increasingly prevalent in technical domains, assessing their capabilities in specific technical tasks grows more important. This research focuses on evaluating three leading reasoning models—DeepSeek r1, GPT-4o, and Claude 3.7 Sonnet—specifically examining their abilities in SQL comprehension and analysis, particularly in determining SQL query equivalence.

Evaluation Methodology

We designed a challenging test case [note 1]: using an original SQL query from the TPC-H benchmark, alongside a rewritten version optimized through the professional SQL optimization tool PawSQL. While these queries exhibit significant syntactic differences, they are semantically equivalent. We tasked the four models with analyzing whether these queries are equivalent and evaluated their reasoning processes and conclusions.

Test Case Link:https://www.pawsql.com/statement/1897947325217640449(https://www.pawsql.com/statement/1897947325217640449)

Test Queries

Original SQL Query:

select ps.ps_partkey,
sum(ps.ps_supplycost * ps.ps_availqty) as value
from partsupp as ps, supplier, nation
where ps.ps_suppkey = supplier.s_suppkey
and supplier.s_nationkey = nation.n_nationkey
and nation.n_name = 'JAPAN'
group by ps.ps_partkey
having sum(ps.ps_supplycost * ps.ps_availqty)
> (select sum(ps_supplycost * ps_availqty) * 0.0001
from partsupp, supplier, nation
where ps_partkey = ps.ps_partkey
and ps_suppkey = s_suppkey
and s_nationkey = n_nationkey
and n_name = 'JAPAN')
order by value desc

Optimized SQL Query: Scalar Subquery Decorrelation

select ps.ps_partkey,
sum(ps.ps_supplycost * ps.ps_availqty) as value
from partsupp as ps, supplier, nation, (
select ps_partkey, sum(ps_supplycost * ps_availqty) * 0.0001 as null_
from partsupp, supplier, nation
where ps_suppkey = s_suppkey
and s_nationkey = n_nationkey
and n_name = 'JAPAN'
group by ps_partkey) as SQ
where ps.ps_suppkey = s_suppkey
and s_nationkey = n_nationkey
and n_name = 'JAPAN'
and SQ.ps_partkey = ps.ps_partkey
group by ps.ps_partkey
having sum(ps.ps_supplycost * ps.ps_availqty) > sum(SQ.null_)
order by value desc

Model Analysis Results

DeepSeek r1 (Reasoning Time: 59 seconds)

DeepSeek r1 took approximately 59 seconds for reasoning and correctly determined that the two queries are logically equivalent.

DeepSeek r1

Key points in its analysis include:

  1. Query Logic Comparison: Distinguished between the first query using a correlated subquery and the second query using an inline view for pre-calculation.
  2. Equivalence Verification: Provided detailed analysis of subquery scopes, calculation consistency, and filtering conditions, noting that sum(SQ.null_) in the second query is actually equivalent to using SQ.null_ directly since each ps_partkey is unique within SQ.
  3. Performance Differences: Identified that the second query avoids multiple scans through pre-calculation and join optimization, likely resulting in improved execution efficiency. PawSQL's optimization results indeed confirm that overall performance improved from 192 seconds to 0.065 seconds, representing a performance boost of over 1000 times [1].

GPT-4o (Reasoning Time: 26 seconds)

GPT-4o took 26 seconds for reasoning and also correctly concluded that the two queries are equivalent.

GPT-4o

Its analysis focused on:

  1. Subquery Approach Differences: Identified that the first SQL uses a correlated subquery while the second SQL places the calculation logic in an inline subquery.
  2. Execution Method Differences: Pointed out that despite differences in execution plans, the semantics and final results of both queries are identical.

GPT-4o's analysis was concise and direct, capturing the essence of query optimization: improving performance by pre-calculating aggregate values in the derived table while maintaining result consistency.

Claude 3.7 Sonnet

Claude 3.7 Sonnet incorrectly concluded that the queries are not equivalent.

Claude 3.7 Sonnet

Its analysis focused on:

  1. Subquery Processing Methods: Correctly identified differences in subquery implementation between the two queries.
  2. HAVING Clause Condition Comparison: Mistakenly believed that sum(SQ.null_) in the second query would reaggregate the derived table results, changing the comparison semantics.
  3. Modification Suggestions: Proposed modifying the HAVING condition in the second query to directly use SQ.null_ rather than sum(SQ.null_).

Claude's analysis revealed limitations in understanding SQL aggregate function behavior in different contexts.

Evaluation Conclusions

  1. DeepSeek r1 and GPT-4o demonstrated strong SQL semantic comprehension, not only providing correct conclusions but also accurately explaining semantic preservation during optimization.
  2. Claude 3.7 Sonnet showed notable limitations in handling complex SQL transformations and subquery optimizations, particularly in understanding interactions between aggregate functions and JOIN operations.
  3. Model performance differences likely reflect variations in the quality and quantity of SQL-related content in training data, as well as differences in model understanding of database query execution mechanisms.

Implications

Despite significant progress in LLMs' SQL understanding and analysis capabilities, this research highlights the irreplaceable nature of professional SQL optimization tools (like PawSQL) in practical database optimization work.

  1. Optimization Accuracy and Reliability: Professional tools based on database theory and practice ensure semantic equivalence in query transformations, while LLMs still exhibit uncertainty in understanding complex SQL queries.
  2. Systematic and Comprehensive Performance Optimization: Professional tools can optimize based on database engine characteristics and statistics, considering index usage, join order, predicate pushdown, and other multi-dimensional optimizations to generate predictable, consistent optimization results.
  3. Stability and Controllability in Production Environments: In production environments, SQL query optimization must consider predictability and consistency, compatibility with existing application systems, and robust handling of edge cases. Professional tools offer higher reliability and controllability in these aspects.
  4. Tool Collaboration: Developing methods for LLMs to work collaboratively with professional SQL optimization tools may be an effective approach to improving database optimization efficiency and reliability.

Summary

Through SQL equivalence analysis tasks, this study reveals capability differences among current top-tier reasoning models when addressing problems in specialized technical domains. While some models demonstrate SQL semantic understanding abilities, the value of professional SQL optimization tools in actual production environments remains irreplaceable. As technology evolves, combining LLMs with professional tools may become the best practice for future database optimization.