-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] [lead/lag] Add support for LEAD/LAG window functions for fixed-width types #6261
[WIP] [lead/lag] Add support for LEAD/LAG window functions for fixed-width types #6261
Conversation
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
36b9bc0
to
a2b6ac6
Compare
a2b6ac6
to
1d78925
Compare
Initial implementation for LEAD/LAG window functions, for fixed-width types.
1d78925
to
db33cb1
Compare
Codecov Report
@@ Coverage Diff @@
## branch-0.16 #6261 +/- ##
===============================================
+ Coverage 84.45% 84.72% +0.27%
===============================================
Files 82 82
Lines 13846 14163 +317
===============================================
+ Hits 11694 12000 +306
- Misses 2152 2163 +11
Continue to review full report at Codecov.
|
* | ||
* @returns A nullable output column containing the results of the window operation. | ||
*/ | ||
std::unique_ptr<column> offset_rolling_window( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't think this would need it's own rolling_window API. I would hope that any aggregation specific info can be put into the aggregation
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is where I started, actually: putting this in grouped_rolling_window()
.
Lead/Lag are a little different from the other aggregations:
- They work with a single offset value, interpreted based on direction.
- They support default values where other operators don't.
- The window width calculation ended up differing from what we have in the
grouped_
case.
In the end, there was Lead
/Lag
-specific error-handling that obfuscated grouped_rolling_window()
. Moving it out to another function simplified everything.
An earlier version of this function was an overload of grouped_rolling_window
. It was renamed to avoid implying support for all window operators, and to reorder the parameters more appropriately for Lead
/Lag
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would hope that any aggregation specific info can be put into the aggregation object.
Oh, I think I see what you mean now. Let me try cleaning up the API, and putting the row_offset
into LEAD
/LAG
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been trying to evaluate whether it also makes sense to move the default outputs into the aggregation operator payload, as a scalar. I don't think that would be viable. The caller might use an expression as a default, thus yielding a default column.
I'll try moving the offset into the aggregation operator itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a crack at writing a LEAD
/LAG
aggregation that stores a row_offset
. I'm afraid it doesn't look doable, or will at the very least be hacky.
The back-end machinery in rolling_window()
seems to be geared towards aggregation-operators that have no state. E.g. agg_op
(used in rolling_window_launcher
, gpu_rolling
etc.) is assumed to be stateless and default-constructable.
It might've been possible to pass a more complex aggregation operator to the process_rolling_window
device function, if the row-offsets were compile-time constants, but alas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would hope that any aggregation specific info can be put into the aggregation object.
@jrhemstad, I have tried to accommodate your suggestion over at #6277. If that version seems more palatable, I will close this PR in favour of that approach.
* null, by default. | ||
* 2. LAG(N): returns the row from the input column at the specified offset preceding | ||
* the current row. If the offset crosses the grouping boundary or column boundary for | ||
* a given row, a "default" value is returned if available. The "default" value is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the default value is null by default, do we need to state "if available"? It seems like there will always be a default value available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can reword this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, @hyperbolic2346. I have accommodated your suggestion at #6277. Please note that in that approach, I've had to move this documentation into rolling.cu
, and removed offset_rolling_window()
entirely.
* [11, 12, 13, null, 21, 22, 23, null] | ||
* | ||
* LAG(input_col, 2) yields: | ||
* [null, null, 11, 12, null, null, 21, 22] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not understanding this correctly I don't think. It took me a couple of times reading this to believe that input_col here is the mutated column from the LEAD call above. Is that correct? If so, should we make that more explicit with input_col = LEAD(input_col, 1) yields:
or show the operation on the original input_col?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with all the rolling window functions, offset_rolling_window()
returns a unique_ptr<column>
. It doesn't mutate the input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my assumption would be that LAG(input_col, 2)
would make [null, null, 10, 11, null, null, 20, 21]
. I'm not sure what happened to the 10 and 20 in the example unless we used the output of the above LEAD call which removed them as input for this LAG
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see what you're pointing out. The example is incorrect.
Thank you for the correction, @hyperbolic2346. :]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have corrected this example, over at #6277.
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); | ||
|
||
/** | ||
* @brief Applies an offset-based window function (like LEAD/LAG) on specified column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the duplicates use @copydoc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I'll change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm now using @copydoc
, for #6277.
Closing in favour of #6277. |
Initial implementation for
LEAD()
/LAG()
offset window functions, supporting fixed-width data types.Offset-based window functions include LEAD and LAG:
LEAD(N)
: Returns the row from the input column, at the specified offset past thecurrent row. If the offset crosses the grouping boundary or column boundary for
a given row, a "default" value is returned if available. The "default" value is
null, by default.
LAG(N)
: returns the row from the input column at the specified offset precedingthe current row. If the offset crosses the grouping boundary or column boundary for
a given row, a "default" value is returned if available. The "default" value is
null, by default.
E.g. Consider an input column with the following values and grouping:
[10, 11, 12, 13, 20, 21, 22, 23]
<------G1-----> <------G2------>
LEAD(input_col, 1)
yields:[11, 12, 13, null, 21, 22, 23, null]
LAG(input_col, 2)
yields:[null, null, 11, 12, null, null, 21, 22]
LEAD(input_col, 1, 99)
(where 99 indicates the default) yields:[11, 12, 13, 99, 21, 22, 23, 99]