-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved performance of RANGE preceding window functions #4904
Comments
cc @ozankabak I thought you might be interested in this (not to implement but as some context / possible future work) |
Table 1 in this paper gives a good summary of complexities involved. Datafusion currently employs what the author calls "Removable Cumulative" mechanism. However, the So, for the first three frame types in Table 1, what we have is better complexity-wise. Certain implementation improvements can still be made (e.g. in binary searching) obviously, but I'd say we are doing a good job there. If I'm not mistaken we don't support the fourth frame type; i.e. frames with variable boundaries. This is the use case where the segment tree shines. When we add support for that, we should definitely consider segment tree as the algorithm of choice. To summarize, I think the segment tree approach and our current approach will coexist and just apply to different use cases, it seems like. It was a good read, thank you for putting this on the radar. |
This is what I thought too at first -- and then I tried it. I believe they are referring to window functions like However, after more reading, I think they are referring to something like an expression rather than constant in a ❯ select column1, first_value(column1) OVER (ROWS column1/5 PRECEDING) from foo;
Internal("Window frame bound cannot be BinaryOp { left: Identifier(Ident { value: \"column1\", quote_style: None }), op: Divide, right: Value(Number(\"5\", false)) }") |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I am not sure this is an important feature to add, but I wanted to write it down.
Basically there is an interesting algorithm called
Segment Tree
which is described in :https://dl.acm.org/doi/10.14778/2794367.2794375
p1058-leis.pdf
This algorithm handles window functions with
RANGE
window functions well, at least that is the claim. It might be a reasonable structure to implement instead of the "MovingMinMax" added in #4675 by @berkaycpp and @mustafasrepo .Describe the solution you'd like
If we hit unacceptable performance of window functions (especially with largely varying
RANGE
), this might an algorithm worth looking into.As a reminder a
RANGE
window frame is determined in terms of the values of the partition, not the number of rows:Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: