Mechanising the metatheory of SQL with nulls

PI: James Cheney, Co-I: Wilmer Ricciotti, University of Edinburgh

SQL is the standard query language used by the multi-billion-dollar relational database industry.
SQL semantics is notoriously subtle, written in natural language, and inconsistent across implementations.
Previous attempts to verify SQL transformations have ignored widely-used features, such as null values.
We present the first mechanised semantics that models these features, making it possible to formally verify that real query optimisers are correct for real-world databases.

The Structured Query Language, SQL, is by far the most common language used by relational databases, which are the basis of a multi-billion-dollar industry. The SQL standard is described by a large and comprehensive definition (ISO/IEC 9075:2016), based on natural language rather than a formal specification; due to the lack of an agreed-on formal semantics, commercial SQL implementations interpret the standard in different ways, so that the same query can yield different results on the same input data depending on the SQL system it is run on.

SQL systems first run a query optimiser which applies a set of rewrite rules to obtain an equivalent query that can be processed more efficiently. However, due to the lack of a well-understood formal semantics, it is very difficult to validate the soundness of such rewrite rules, and incorrect implementations are known in the literature. Bugs in query optimisers could lead to corruption or errors in critical data.

Among SQL’s features, its ability to deal with incomplete information, in the form of null values, accounts for a great deal of semantic complexity. To express uncertainty, logical predicates on tuples containing null values employ three truth values: true, false, and unknown. As a consequence, queries equivalent in the absence of null values can produce different results when applied to tables with incomplete data, as illustrated in the diagram below.

Although when this project was carried out there were some previous formalizations of SQL or relational query languages, all of them ignored null values, so they “proved” query equivalences that are unsound in the presence of these features. Our project built on a recent (on-paper) formal semantics for SQL with nulls by Guagliardo and Libkin, providing the validation of key metatheoretic properties in the Coq proof assistant. We view this as a first step towards a future in which query optimizers are certified. Our development, which can be publicly accessed at its GitHub repository provided us with a reliable reference which has guided us in our further work on querying databases in functional programming languages, published in leading conferences on programming languages and formal methods.

PUBLICATIONS.[1] Wilmer Ricciotti and James Cheney, “Strongly Normalizing Higher-Order Relational Queries.”, FSCD 2020. [2] Wilmer Ricciotti and James Cheney, “Query Lifting: Language-integrated query for heterogeneous nested collections.”, ESOP 2021. [3] Wilmer Ricciotti and James Cheney, Formalization of SQL with Nulls. J Autom Reasoning (2022).

RELATED GRANTS. Dr James Cheney, ERC Consolidator Grant: “Skye: Bridging theory and practice for scientific data curation”, 2016-2021, £1.75M.

IMPACT. “Database queries and query languages are widely used in industry, yet their implementations and optimisation rules are error-prone due to complications, such as the semantics of nulls. This can easily lead to subtle bugs in relational database engines or incorrect queries, and work on formalising the semantics of existing query languages, including the real-world semantics of nulls, is very important and likely to have a tangible impact on making systems more reliable. For example, optimisation rules proposed in Kim’s seminal work on query un-nesting contained the famous count bug, which led to incorrect query results in the presence of null values and could have been prevented if formal verification techniques were used.” – Matthias Brantner, Oracle –