Client SQL building library for Trino #10410
OneRaynyDay
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Treeno
1 - tree representation of a Trino-SQL query in python. Link to projectProblem Statement
ANSI-SQL is a large DSL with the ability for users to express a large variety of transformations to existing tables and subqueries. As a result, every large company has its own "standard library" of SQL functionality specialized for their use cases. Some functions in said standard library may include real-valued histogram using trino's width_bucket, quantile cuts, random sampling (beyond
TABLESAMPLE BERNOULLI
), various aggregations and summarization utilities and much more. This not only reduces duplication of code, but also allows data scientists to focus more on the analysis and less on the SQL query writing, which may include complicated CTE's, subqueries, and (potentially correlated) joins.I strongly believe in Trino's success, so I made a project that allows flexible SQL building in python. I hope the project will not only help existing power users to compose more complicated queries, but also encourage more users to use Trino.
Existing Solutions
There have been three main ways(that I know of) this problem has been tackled:
Arguments against templating
The main argument against the basic SQL templating is its lack of flexibility compared to a programmatic query builder. For every slightly different functionality that can't be easily done by adding a new template argument, we would need to create a new template. Empirically, the variety, maturity and popularity(based off of github stars as a heuristic) of query building interfaces surpasses that of templating.
Arguments against ORM
There are many problems with traditional ORM's, such as the N+1 problem and overfetching which are especially bad for the scale of data that Trino can crunch as opposed to typical OLTP databases such as mysql and postgres.
We should separate the core query building from not only the actual execution but also the metadata retrieval. As an example, many companies have suffered performance hits from the hive metastore eagerly locking on metadata fetching which would be necessary in an ORM model (other pain points of metastore can be found here). We want to have the optionality to supply the query builder with such metadata information, but it shouldn't be a required component because of the variety of connectors we support, each of which with different performance characteristics on both queries and metadata retrieval.
As a result, I decided to make
treeno
a query building library aimed for compatibility ONLY with Trino and provides a suite of helpful utilities.Why Treeno?
At work, we use pypika for its query building functionality, and have had success using it to construct a rudimentary standard library for data analysis. However, we've found a couple of reasons to discontinue our investment in it2:
From all of these points, the most concerning ones are (6) and (7), which are also present in other query building libraries that I have found. I will outline some examples of each issue in the appendix if you're not yet convinced.
Treeno attempts to address all of the issues above, and more.
Non-goals of Treeno
Treeno
is just a query building interface with some static analysis. It's not meant to do anything smarter than just that. Here is a partial list of non-goals:Design
Treeno is intentionally written for python. Python is the most common language for data analysis with a large ecosystem involving
pandas
,ipython
,jupyter
,numpy
, and the scientific suite. I have not yet considered support for other languages such as Julia and R.Treeno is built with a focus for interoperability for existing patterns. This means it was built around the idea that users should not learn how to construct Treeno objects in order to construct their queries - they should be able to write raw SQL and have it be convertible to Treeno nodes if they wish.
Treeno is faithful to the Trino grammar(mostly). Treeno uses the python3 ANTLR runtime to parse Trino's grammar and create visitors to create a representation of the query that is similar(but not strictly adhering) to that of the Trino's execution engine. The representation is meant to be partial in that we don't fetch metadata immediately. I also aim to simplify the user experience by merging together some parts of the antlr grammar rules into a single object (i.e.
windowSpecification
andwindowFrame
andframeExtent
are all coalesced into a singleWindow
object). For the complicated yet commonly used parts of sql queries (such as cte's, subqueries, joins, etc), I will provide an even higher level abstraction for users to build their queries with (this is a TODO). I also make slight modifications to the grammar, which will be outlined in the appendix.Examples
Queries
Typical query, which can be printed in both a formatted manner or a compressed default:
The style guide I adhered to is: https://www.sqlstyle.guide/, but this can be easily changed later.
Support for basic window functions:
Support for subqueries:
Support for all pemdas according to the grammar:
Treeno also is type-aware:
I plan to use this to do partial type check validation on the tree so we don't have to hit the coordinator often with failed queries. This is a TODO quality-of-life feature, and only Literal's have a known type for now:
Standalone expressions and types
Treeno also supports standalone expressions:
... and data types:
Other nice-to-haves‚
Treeno also has some educational tools to better understand the structure of queries:
What needs to be done?
In order of significance:
User experience: Currently the library is not friendly for programmatic sql building. I plan to add a layer of helper builder functions to make this much simpler. Whether it should mostly follow sqlalchemy or pypika's syntax is still up for debate.
Feature-completeness: I still need to finish a few features for Trino's grammar, namely
patternRecognition
,primaryExpression
, table union/intersect, and a lot of built-in functions. On top of that, there's also the pretty sql printing that needs to be added to a couple of complex expressions.Additional cool features: I would also like to perform the partial type checking as we build up the tree, using
UNKNOWN
type where we're unsure, and error'ing with an intuitive diagnostic message when the input type is defined and doesn't comply.Some caveat of the design
The biggest expected caveat is outlined here. The trino sql grammar file is constantly being updated and we would like to add new features as they come in. This means some parts of the visitor pattern needs to be rewritten as new changes for grammar comes into the main Trino repository. On one hand, we all love new SQL features being implemented, but on the other hand, that makes this project a moving target3.
Another concern of the design is its opinionated SQL representation, which can be fixed later on. For example, we always double quote our columns for names that contain
.
and other non-alphanumeric characters. Note that this particular "feature" is also enforced in Pypika. You also cannot perfectly recover the original sql string character-by-character, but the semantic definition is still the same. Note that by using the ANTLR parser directly, we don't have a way of recovering SQL comments either.Future Work
In the long term, I would also like to contribute/upstream a set of common functions using this library for other Trino users and help speed up their data analysis. With a static SQL query builder, we can also push more verification into client-side and as a result decrease the load on the coordinator (which currently is designed as a SPOF). With the query builder we now can represent SQL with python objects. This means Treeno can be used for data lineage tools, which allow infra engineers to better understand the relative priority/popularity of different data sources.
Human concerns
Technical design aside, I'm sure whoever is reading this also considers the human concern of the problem: what is a reasonable timeline for this library to be feature-complete?
I don't have an answer for this as the target SQL grammar is always moving. I also have a day job that requires my full attention on the weekdays(but it would be cool if I could also do this at my day job!) which doesn't only involve working with Trino. I plan to get an acceptable version of Treeno out by early next year and will spend time over weekends to add new features and fix old bugs. I would also love new pull requests for any feature others are particularly passionate about, as my decision to implement features depends on the use case I see in my day-to-day work.
Appendix
Outlining Pypika issues
Pypika doesn't understand/incorrectly interprets basic SQL constructs
Pypika is missing the following functionalities (and a lot more):
Pypika's orderby accepts a string which automatically gets converted to an object representing input table's column name (
Field
), but it should actually be the OUTPUT's column name, as ORDER BY, OFFSET, LIMIT, etc are operated on the output of a query.Pypika is no longer being developed
At the time of writing: 95 issues, 10 pull requests(one of them being mine), and no new commit in the past few months.
Pypika has incorrect python typing, which makes it hard to work with a partially typed codebase(checked by mypy)
Pypika's inheritance is broken for one: it should be using
TypeVar
s to bound to generic types in order to prevent type erasure on the inheritance hierarchy. See here as an example. This should be-> TypeVar("GenericSelectable", bound="Selectable)
.Pypika has dubious implementation for some edge cases
In pypika
Criterion
is essentially the same asbooleanExpression
in the Trino ANTLR grammar.Criterion.all([])
AND's together no elements, which creates anEmptyCriterion
which has no representation in string and causes an exception when turning into a string. An empty criterion means there should be no booleanExpression present.Pypika's objects are stateful and cause subtle bugs
Refer to: kayak/pypika#633
Pypika is not Trino-compatible
It only has partial support for clickhouse, and point (1) already shows that it's probably not "fully compatible" with any particular SQL dialect.
Pypika can convert to string, but can't parse from string to recover itself
This is extremely important for data analysts. Those who are comfortable writing raw string SQL would much rather express their subqueries and such as strings and have the standard library implementation parse it as a tree. I've found that this is extremely important for adoption.
Small modifications to the grammar
As discussed in the design section, the
SqlBase.g4
grammar needs to be transformed slightly to work in python. For example, there are terms inSqlBase.g4
that collide with keywords in python such asfilter
,from
,type
, etc. The grammar also is built onfragment LETTER
which is only upper case, but the sql string could be composed of lower case identifiers4. I have a script in Treeno that downloads the grammar file and performs a series of transformations to fix these problems. Currently, the parser fails to parse SQL which uses lowercase keywords for queries such asselect
instead ofSELECT
, so I plan on modifying the grammar to accept lower case versions of the keywords as well (EDIT: this is done)P.S. Merry Christmas everyone :)
Footnotes
This name isn't set-in-stone. Let me know if you have strong opinions for/against it! ↩
I still think pypika is a great library, and it could fix these problems relatively easily, but the lack of maintenance makes this hard. ↩
I don't think the above problem is that bad. I picked an old grammar file from 1 year ago when I first started development on Treeno, and planned to "rebase" on the new grammar file. There was very little that had to be changed with the current implementation, and whatever changes needed to be made were relatively simple as long as I read the PR's relating to those changes. Being behind a few versions also doesn't hurt because it doesn't change the end-user experience at all. The only concern is with large-scale refactors of the grammar file which might force me to add a new layer of tree nodes. ↩
I haven't done an investigation, but I assume the query string gets uppercased prior to parsing on the execution engine side. ↩
Beta Was this translation helpful? Give feedback.
All reactions