-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(drop): speed up performance of drop #9440
Conversation
Benchmarks:
|
b9bc7f7
to
1ee4118
Compare
495aa08
to
2e68785
Compare
New benchmark results (note these aren't comparing to Construction
Still scaling with the number of columns, but overall doing fewer full scans of all columns. Compilation
This now scales with |
Here are the comparison benchmarks for benchmarks that passed: Compilation
Construction
|
For construction, things are better across the board. For compilation, it's kind of a wash except for the important fact that some cases of drop simply didn't work because construction the expression overflowed the Python stack. I think then this is overall a net improvement. |
Clouds are passing:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lovely, looks like a win to me
This PR speeds up drop construction and compilation in cases where the number
of dropped columns is small relative to the total number of parent-table
columns.
There are two ways this is done:
drop
construction is sped up by reducing the number of iterationsover the full set of columns when constructing an output schema.
This is where the bulk of the improvement is.
Compilation of the
drop
operation is also a bit faster for smaller sets ofdropped columns on some backends due to use of
* EXCLUDE
syntax.Since the optimization is done in the
schema
property, adding a newDropColumns
relation IR seemed like the lightest weight approach giventhat that also enables compilers to use
EXCLUDE
syntax, which will produce afar smaller query than using project-without-the-dropped-columns approach.
Partially addresses the
drop
performance seen in #9111.To address this for all backends, they either need to all support
SELECT * EXCLUDE(col1, ..., colN)
syntax or we need to implement columnpruning.
Follow-ups could include applying a similar approach to
rename
(usingREPLACE
syntax for compilation).
It might be possible to reduce the overhead of
relocate
as well, butI haven't explored that.