Client SQL building library for Trino #10410

OneRaynyDay · 2021-12-25T09:50:12Z

OneRaynyDay
Dec 25, 2021
Collaborator

Treeno¹ - tree representation of a Trino-SQL query in python. Link to project

Problem Statement

ANSI-SQL is a large DSL with the ability for users to express a large variety of transformations to existing tables and subqueries. As a result, every large company has its own "standard library" of SQL functionality specialized for their use cases. Some functions in said standard library may include real-valued histogram using trino's width_bucket, quantile cuts, random sampling (beyond TABLESAMPLE BERNOULLI), various aggregations and summarization utilities and much more. This not only reduces duplication of code, but also allows data scientists to focus more on the analysis and less on the SQL query writing, which may include complicated CTE's, subqueries, and (potentially correlated) joins.

I strongly believe in Trino's success, so I made a project that allows flexible SQL building in python. I hope the project will not only help existing power users to compose more complicated queries, but also encourage more users to use Trino.

Existing Solutions

There have been three main ways(that I know of) this problem has been tackled:

Providing basic SQL templating for the relevant functions (for example, jinjasql)
Providing a programmatic query building interface (for example, pypika, sqlalchemy's core builders, sqlbuilder, and more)
Providing an ORM (for example, sqlalchemy, django, sqlobject)

Arguments against templating

The main argument against the basic SQL templating is its lack of flexibility compared to a programmatic query builder. For every slightly different functionality that can't be easily done by adding a new template argument, we would need to create a new template. Empirically, the variety, maturity and popularity(based off of github stars as a heuristic) of query building interfaces surpasses that of templating.

Arguments against ORM

There are many problems with traditional ORM's, such as the N+1 problem and overfetching which are especially bad for the scale of data that Trino can crunch as opposed to typical OLTP databases such as mysql and postgres.

We should separate the core query building from not only the actual execution but also the metadata retrieval. As an example, many companies have suffered performance hits from the hive metastore eagerly locking on metadata fetching which would be necessary in an ORM model (other pain points of metastore can be found here). We want to have the optionality to supply the query builder with such metadata information, but it shouldn't be a required component because of the variety of connectors we support, each of which with different performance characteristics on both queries and metadata retrieval.

As a result, I decided to make treeno a query building library aimed for compatibility ONLY with Trino and provides a suite of helpful utilities.

Why Treeno?

At work, we use pypika for its query building functionality, and have had success using it to construct a rudimentary standard library for data analysis. However, we've found a couple of reasons to discontinue our investment in it²:

Pypika doesn't understand/incorrectly interprets basic SQL constructs
Pypika is no longer being developed
Pypika has incorrect python typing, which makes it hard to work with a partially typed codebase(checked by mypy)
Pypika has dubious implementation for some edge cases
Pypika's objects are stateful and cause subtle bugs
Pypika is not Trino-compatible and most likely never will be
Pypika can convert to string, but can't parse from string to recover itself

From all of these points, the most concerning ones are (6) and (7), which are also present in other query building libraries that I have found. I will outline some examples of each issue in the appendix if you're not yet convinced.

Treeno attempts to address all of the issues above, and more.

Non-goals of Treeno

Treeno is just a query building interface with some static analysis. It's not meant to do anything smarter than just that. Here is a partial list of non-goals:

Ability to connect to Trino coordinator - As discussed in the ORM section, we don't want to incur unexpected costs into the cluster just by building the query.
Ability to optimize queries - Treeno is just a query builder - it shouldn't be responsible for optimizing a query. That job should be left to the query execution engine which understands the performance characteristics.
Support for other SQL dialects - Open source libraries which support a wide cross-section of SQL dialects are either humongous in scale or are lacking core features. This library will never be modified to support another dialect like PostgreSQL.
Linting - Linters and style checks for SQL building can be done atop Treeno, but it should never be built into Treeno itself. The library should be agnostic to style, and accept unconventional use cases such as natural joins.

Design

Treeno is intentionally written for python. Python is the most common language for data analysis with a large ecosystem involving pandas, ipython, jupyter, numpy, and the scientific suite. I have not yet considered support for other languages such as Julia and R.

Treeno is built with a focus for interoperability for existing patterns. This means it was built around the idea that users should not learn how to construct Treeno objects in order to construct their queries - they should be able to write raw SQL and have it be convertible to Treeno nodes if they wish.

Treeno is faithful to the Trino grammar(mostly). Treeno uses the python3 ANTLR runtime to parse Trino's grammar and create visitors to create a representation of the query that is similar(but not strictly adhering) to that of the Trino's execution engine. The representation is meant to be partial in that we don't fetch metadata immediately. I also aim to simplify the user experience by merging together some parts of the antlr grammar rules into a single object (i.e. windowSpecification and windowFrame and frameExtent are all coalesced into a single Window object). For the complicated yet commonly used parts of sql queries (such as cte's, subqueries, joins, etc), I will provide an even higher level abstraction for users to build their queries with (this is a TODO). I also make slight modifications to the grammar, which will be outlined in the appendix.

Examples

Queries

Typical query, which can be printed in both a formatted manner or a compressed default:

>>> print(query.sql(PrintOptions(PrintMode.PRETTY)))
SELECT "a","b"
  FROM "t"
 ORDER BY "a" DESC NULLS FIRST,"b"
>>> print(str(query))
SELECT "a","b" FROM "t" ORDER BY "a" DESC NULLS FIRST,"b"

The style guide I adhered to is: https://www.sqlstyle.guide/, but this can be easily changed later.

Support for basic window functions:

>>> ... # imports
>>> query_from_sql("SELECT SUM(a) OVER (PARTITION BY date ORDER BY timestamp ROWS BETWEEN 5 PRECEDING AND CURRENT ROW), x, y, z FROM t")
>>> print(query.sql(PrintOptions(PrintMode.PRETTY)))
SELECT SUM("a") OVER (
       PARTITION BY "date"
           ORDER BY "timestamp"
            ROWS BETWEEN 5 PRECEDING AND CURRENT ROW),
       "x","y","z"
  FROM "t"
>>> print(query)
SELECT SUM("a") OVER (PARTITION BY "date" ORDER BY "timestamp" ROWS BETWEEN 5 PRECEDING AND CURRENT ROW),"x","y","z" FROM "t"

Support for subqueries:

>>> query = query_from_sql("SELECT a FROM (SELECT a,b FROM (SELECT a,b,c FROM t))")
>>> print(str(query))
SELECT "a" FROM (SELECT "a","b" FROM (SELECT "a","b","c" FROM "t"))
>>> print(query.sql(PrintOptions(PrintMode.PRETTY)))
SELECT "a"
  FROM (SELECT "a","b"
          FROM (SELECT "a","b","c"
                  FROM "t"))

Support for all pemdas according to the grammar:

>>> query = query_from_sql("SELECT (((1+2)*3)/4)-10")
>>> print(str(query)) # only + needs parenthesizing
SELECT (1 + 2) * 3 / 4 - 10
>> query = query_from_sql("SELECT (TRUE OR TRUE) AND FALSE")
>>> print(str(query)) # AND has precedence, so we preserve the parentheses
SELECT (TRUE OR TRUE) AND FALSE
>>> query = query_from_sql("SELECT TRUE OR (TRUE AND FALSE)")
>>> print(str(query)) # AND has precedence so parentheses are redundant
SELECT TRUE OR TRUE AND FALSE

Treeno also is type-aware:

>>> query = query_from_sql("SELECT CAST(3 AS DECIMAL(29,1))")
>>> print(str(query))
SELECT CAST(3 AS DECIMAL(29,1))
>>> query.select[0].type
DataType(type_name='DECIMAL', parameters={'precision': 29, 'scale': 1})
>>> from treeno.datatypes.builder import decimal
>>> decimal(precision=29, scale=1)
DataType(type_name='DECIMAL', parameters={'precision': 29, 'scale': 1})

I plan to use this to do partial type check validation on the tree so we don't have to hit the coordinator often with failed queries. This is a TODO quality-of-life feature, and only Literal's have a known type for now:

>>> wrap_literal(30000).data_type
DataType(type_name='INTEGER', parameters={})
>>> wrap_literal(3000000000).data_type
DataType(type_name='BIGINT', parameters={})
>>> wrap_literal("foobar").data_type
DataType(type_name='VARCHAR', parameters={})
>>> wrap_literal(True).data_type
DataType(type_name='BOOLEAN', parameters={})
>>> wrap_literal(Decimal("125.333333")).data_type # This is the decimal.Decimal type from python's stdlib
DataType(type_name='DECIMAL', parameters={'precision': 9, 'scale': 6})
>>> obj = Sum(wrap_literal(3)) # Functions do not yet have a defined output type (TODO)
>>> obj.data_type
DataType(type_name='UNKNOWN', parameters={})

Standalone expressions and types

Treeno also supports standalone expressions:

>>> str(expression_from_sql("1+2/3"))
'1 + 2 / 3'
>>> str(expression_from_sql("SUM(foo) OVER (ORDER BY date)")) # Currently we verbosely print out the entire window specification
'SUM("foo") OVER (ORDER BY "date" RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)'

... and data types:

>>> type_from_sql("TIMESTAMP(9) WITH TIME ZONE")
DataType(type_name='TIMESTAMP', parameters={'precision': 9, 'timezone': True})
>>> type_from_sql("DECIMAL")
DataType(type_name='DECIMAL', parameters={'precision': 38, 'scale': 0})
>>> type_from_sql("DECIMAL(30, 10)")
DataType(type_name='DECIMAL', parameters={'precision': 30, 'scale': 10})
>>> type_from_sql("IPADDRESS")
DataType(type_name='IPADDRESS', parameters={})

Other nice-to-haves‚

Treeno also has some educational tools to better understand the structure of queries:

>>> ast = AST("1+2/3")
>>> node = ast.expression()
>>> tree(ast, node)
          standaloneExpres
                sion
   ______________|________________
  |                           expression
  |                               |
  |                        booleanExpressio
  |                               n
  |                               |
  |                        valueExpression
  |     __________________________|________________
  |    |         |                          valueExpression
  |    |         |                 ________________|________________
  |    |  valueExpression         |         valueExpression  valueExpression
  |    |         |                |                |                |
  |    |  primaryExpressio        |         primaryExpressio primaryExpressio
  |    |         n                |                n                n
  |    |         |                |                |                |
  |    |       number             |              number           number
  |    |         |                |                |                |
<EOF>  +         1                /                2                3

What needs to be done?

In order of significance:

User experience: Currently the library is not friendly for programmatic sql building. I plan to add a layer of helper builder functions to make this much simpler. Whether it should mostly follow sqlalchemy or pypika's syntax is still up for debate.
Feature-completeness: I still need to finish a few features for Trino's grammar, namely patternRecognition, primaryExpression, table union/intersect, and a lot of built-in functions. On top of that, there's also the pretty sql printing that needs to be added to a couple of complex expressions.
Additional cool features: I would also like to perform the partial type checking as we build up the tree, using UNKNOWN type where we're unsure, and error'ing with an intuitive diagnostic message when the input type is defined and doesn't comply.

Some caveat of the design

The biggest expected caveat is outlined here. The trino sql grammar file is constantly being updated and we would like to add new features as they come in. This means some parts of the visitor pattern needs to be rewritten as new changes for grammar comes into the main Trino repository. On one hand, we all love new SQL features being implemented, but on the other hand, that makes this project a moving target³.

Another concern of the design is its opinionated SQL representation, which can be fixed later on. For example, we always double quote our columns for names that contain . and other non-alphanumeric characters. Note that this particular "feature" is also enforced in Pypika. You also cannot perfectly recover the original sql string character-by-character, but the semantic definition is still the same. Note that by using the ANTLR parser directly, we don't have a way of recovering SQL comments either.

Future Work

In the long term, I would also like to contribute/upstream a set of common functions using this library for other Trino users and help speed up their data analysis. With a static SQL query builder, we can also push more verification into client-side and as a result decrease the load on the coordinator (which currently is designed as a SPOF). With the query builder we now can represent SQL with python objects. This means Treeno can be used for data lineage tools, which allow infra engineers to better understand the relative priority/popularity of different data sources.

Human concerns

Technical design aside, I'm sure whoever is reading this also considers the human concern of the problem: what is a reasonable timeline for this library to be feature-complete?

I don't have an answer for this as the target SQL grammar is always moving. I also have a day job that requires my full attention on the weekdays(but it would be cool if I could also do this at my day job!) which doesn't only involve working with Trino. I plan to get an acceptable version of Treeno out by early next year and will spend time over weekends to add new features and fix old bugs. I would also love new pull requests for any feature others are particularly passionate about, as my decision to implement features depends on the use case I see in my day-to-day work.

Appendix

Outlining Pypika issues

Pypika doesn't understand/incorrectly interprets basic SQL constructs

Pypika is missing the following functionalities (and a lot more):

Lateral joins
Natural joins (well, we don't use these anyways)
Grouping sets, cubes, rollups
Window function over GROUPS (they have ROWS and RANGE)
Pattern recognition
Non-select queries (TABLE and VALUES queries)

Pypika's orderby accepts a string which automatically gets converted to an object representing input table's column name (Field), but it should actually be the OUTPUT's column name, as ORDER BY, OFFSET, LIMIT, etc are operated on the output of a query.

Pypika is no longer being developed
At the time of writing: 95 issues, 10 pull requests(one of them being mine), and no new commit in the past few months.

Pypika has incorrect python typing, which makes it hard to work with a partially typed codebase(checked by mypy)

Pypika's inheritance is broken for one: it should be using TypeVars to bound to generic types in order to prevent type erasure on the inheritance hierarchy. See here as an example. This should be -> TypeVar("GenericSelectable", bound="Selectable).

Pypika has dubious implementation for some edge cases
In pypika Criterion is essentially the same as booleanExpression in the Trino ANTLR grammar. Criterion.all([]) AND's together no elements, which creates an EmptyCriterion which has no representation in string and causes an exception when turning into a string. An empty criterion means there should be no booleanExpression present.

Pypika's objects are stateful and cause subtle bugs
Refer to: kayak/pypika#633

Pypika is not Trino-compatible
It only has partial support for clickhouse, and point (1) already shows that it's probably not "fully compatible" with any particular SQL dialect.

Pypika can convert to string, but can't parse from string to recover itself
This is extremely important for data analysts. Those who are comfortable writing raw string SQL would much rather express their subqueries and such as strings and have the standard library implementation parse it as a tree. I've found that this is extremely important for adoption.

Small modifications to the grammar

As discussed in the design section, the SqlBase.g4 grammar needs to be transformed slightly to work in python. For example, there are terms in SqlBase.g4 that collide with keywords in python such as filter, from, type, etc. The grammar also is built on fragment LETTER which is only upper case, but the sql string could be composed of lower case identifiers⁴. I have a script in Treeno that downloads the grammar file and performs a series of transformations to fix these problems. Currently, the parser fails to parse SQL which uses lowercase keywords for queries such as select instead of SELECT, so I plan on modifying the grammar to accept lower case versions of the keywords as well (EDIT: this is done)

P.S. Merry Christmas everyone :)

This name isn't set-in-stone. Let me know if you have strong opinions for/against it! ↩
I still think pypika is a great library, and it could fix these problems relatively easily, but the lack of maintenance makes this hard. ↩
I don't think the above problem is that bad. I picked an old grammar file from 1 year ago when I first started development on Treeno, and planned to "rebase" on the new grammar file. There was very little that had to be changed with the current implementation, and whatever changes needed to be made were relatively simple as long as I read the PR's relating to those changes. Being behind a few versions also doesn't hurt because it doesn't change the end-user experience at all. The only concern is with large-scale refactors of the grammar file which might force me to add a new layer of tree nodes. ↩
I haven't done an investigation, but I assume the query string gets uppercased prior to parsing on the execution engine side. ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client SQL building library for Trino #10410

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Client SQL building library for Trino #10410

OneRaynyDay Dec 25, 2021 Collaborator

Problem Statement

Existing Solutions

Arguments against templating

Arguments against ORM

Why Treeno?

Non-goals of Treeno

Design

Examples

Queries

Standalone expressions and types

Other nice-to-haves‚

What needs to be done?

Some caveat of the design

Future Work

Human concerns

Appendix

Outlining Pypika issues

Small modifications to the grammar

Footnotes

Replies: 0 comments

OneRaynyDay
Dec 25, 2021
Collaborator