-
Notifications
You must be signed in to change notification settings - Fork 907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Support nullable timestamp columns in time range window functions #6557
[REVIEW] Support nullable timestamp columns in time range window functions #6557
Conversation
1. Initial commit. Fix for non-grouped time-ranges with null timestamps, in ASC order. 2. Fixed non-grouped time-ranges with null timestampw, in DESC. 3. Fixed grouped time-ranges with null timestamp, in ASC. 4. Fixed grouped time-ranges with null timestamp, in DESC. 5. Fixed grouped time-ranges for ASC (following) 6. Tests for different combinations of timestamp null grouping, timestamp ordering, and key grouping. 7. Refactor: Common code gathered to utility function. 8. Code formatting
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
This change stomps all over the work in #6512. I realize that I'd be mixing concerns here, but would anyone be opposed to my including that refactor in this PR?
|
Is #6512 ready for review? |
Added to CHANGELOG.md.
…mestamp-branch-0.17
Now I understand the question:) I think the diff would not work well if the code has been moved between files, so these two PRs combined would be really hard to review. I would prefer to keep them separate, mostly for this reason. |
Codecov Report
@@ Coverage Diff @@
## branch-0.17 #6557 +/- ##
============================================
Coverage 82.54% 82.54%
============================================
Files 90 90
Lines 14928 14928
============================================
Hits 12322 12322
Misses 2606 2606 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review
…mestamp-branch-0.17
1. Non-groupby case: Use null-count to calculate null bounds 2. Groupby case: Use partition_point()
Hmm. There are test failures that don't seem related to this change, at least on the face of it. From the test logs:
|
Yeah, this is a known CI issue. Started happening two days ago. |
Thanks, @vuule. Phew. As an aside, I think I've incorporated your suggestions in the current version of the code. Please let me know if this is more palatable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestions, requesting changes primarily for of test coverage.
1. Short circuit eval for all-null/no-null cases. 2. Test cases for all-null/no-null. 3. Code formatting
…mestamp-branch-0.17
…mestamp-branch-0.17
rerun tests |
@vuule, thank you for your review and advice. Hey, @davidwendt. I was wondering if you might take a gander at this change. Thanks. |
1. Fixed angle-brackets for header inclusion. 2. Switched to doxygen-style function documentation
…mestamp-branch-0.17
3. Code formatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Got an optional suggestion.
1. Streamline null_bound calculation, with fewer comparisons.
2. Fix botched null_bound calculation cleanup. 3. Remove errant println in tests.
Thanks for the reviews, all. I have just merged this now. |
Closes #6524.
Time-range based rolling window functions currently do not consider the possibility that the timestamp column might have nulls in them, thus muffing up the time-window calculation.
Note that
grouped_time_range_rolling_window()
requires that the timestamp column is sorted (ASC
orDESC
), and possibly grouped by other keys. TheNULL
values in the timestamp column will be clustered at the beginning or end of the column (if ungrouped), or beginning/end of each group.E.g. The following is an example of a table grouped by the first column, ordered by
ASCENDING
timestamp (within each group), withNULL
s appearing first.In this case, when doing a rolling-window
COUNT
on the third column with1 DAY
preceding and1 DAY
following, the rows withNULL
timestamps should not participate with those that have a valid timestamp. They will, however, participate in a window of their own. The output for rolling-windowCOUNT
should be:Note that the third row of the table must ignore the previous two rows, since their timestamps are
NULL
. The rolling-window for the null rows themselves must consider all rows in that group's null-range. This behaviour is congruent with ranged window operations in SQL systems (such as SparkSQL).The commits included herewith provide support for nullable timestamp columns in time-range window functions.