Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [lead/lag] Add support for LEAD/LAG window functions for fixed-width types #6261

Closed

Conversation

mythrocks
Copy link
Contributor

Initial implementation for LEAD()/LAG() offset window functions, supporting fixed-width data types.

Offset-based window functions include LEAD and LAG:

  1. LEAD(N): Returns the row from the input column, at the specified offset past the
    current row. If the offset crosses the grouping boundary or column boundary for
    a given row, a "default" value is returned if available. The "default" value is
    null, by default.
  2. LAG(N): returns the row from the input column at the specified offset preceding
    the current row. If the offset crosses the grouping boundary or column boundary for
    a given row, a "default" value is returned if available. The "default" value is
    null, by default.

E.g. Consider an input column with the following values and grouping:
[10, 11, 12, 13, 20, 21, 22, 23]
<------G1-----> <------G2------>

LEAD(input_col, 1) yields:
[11, 12, 13, null, 21, 22, 23, null]

LAG(input_col, 2) yields:
[null, null, 11, 12, null, null, 21, 22]

LEAD(input_col, 1, 99)(where 99 indicates the default) yields:
[11, 12, 13, 99, 21, 22, 23, 99]

@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@mythrocks mythrocks changed the title [WIP] [lead/lag] Implement LEAD/LAG window functions [WIP] [lead/lag] Add support for LEAD/LAG window functions for fixed-width types Sep 18, 2020
Initial implementation for LEAD/LAG window functions,
for fixed-width types.
@codecov
Copy link

codecov bot commented Sep 18, 2020

Codecov Report

Merging #6261 into branch-0.16 will increase coverage by 0.27%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.16    #6261      +/-   ##
===============================================
+ Coverage        84.45%   84.72%   +0.27%     
===============================================
  Files               82       82              
  Lines            13846    14163     +317     
===============================================
+ Hits             11694    12000     +306     
- Misses            2152     2163      +11     
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/io/parquet.py 91.35% <0.00%> (-0.85%) ⬇️
python/cudf/cudf/_version.py 44.80% <0.00%> (-0.72%) ⬇️
python/cudf/cudf/io/parquet.py 91.73% <0.00%> (-0.50%) ⬇️
python/cudf/cudf/core/series.py 90.96% <0.00%> (-0.37%) ⬇️
python/cudf/cudf/utils/applyutils.py 98.74% <0.00%> (-0.01%) ⬇️
python/cudf/cudf/utils/ioutils.py 85.84% <0.00%> (ø)
python/cudf/cudf/utils/docutils.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/csv.py 95.52% <0.00%> (ø)
python/cudf/cudf/core/column/string.py 86.05% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_csv.py 100.00% <0.00%> (ø)
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d509d0d...29c5611. Read the comment docs.

*
* @returns A nullable output column containing the results of the window operation.
*/
std::unique_ptr<column> offset_rolling_window(
Copy link
Contributor

@jrhemstad jrhemstad Sep 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't think this would need it's own rolling_window API. I would hope that any aggregation specific info can be put into the aggregation object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is where I started, actually: putting this in grouped_rolling_window().

Lead/Lag are a little different from the other aggregations:

  1. They work with a single offset value, interpreted based on direction.
  2. They support default values where other operators don't.
  3. The window width calculation ended up differing from what we have in the grouped_ case.

In the end, there was Lead/Lag-specific error-handling that obfuscated grouped_rolling_window(). Moving it out to another function simplified everything.

An earlier version of this function was an overload of grouped_rolling_window. It was renamed to avoid implying support for all window operators, and to reorder the parameters more appropriately for Lead/Lag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would hope that any aggregation specific info can be put into the aggregation object.

Oh, I think I see what you mean now. Let me try cleaning up the API, and putting the row_offset into LEAD/LAG.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been trying to evaluate whether it also makes sense to move the default outputs into the aggregation operator payload, as a scalar. I don't think that would be viable. The caller might use an expression as a default, thus yielding a default column.
I'll try moving the offset into the aggregation operator itself.

Copy link
Contributor Author

@mythrocks mythrocks Sep 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a crack at writing a LEAD/LAG aggregation that stores a row_offset. I'm afraid it doesn't look doable, or will at the very least be hacky.

The back-end machinery in rolling_window() seems to be geared towards aggregation-operators that have no state. E.g. agg_op (used in rolling_window_launcher, gpu_rolling etc.) is assumed to be stateless and default-constructable.

It might've been possible to pass a more complex aggregation operator to the process_rolling_window device function, if the row-offsets were compile-time constants, but alas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would hope that any aggregation specific info can be put into the aggregation object.

@jrhemstad, I have tried to accommodate your suggestion over at #6277. If that version seems more palatable, I will close this PR in favour of that approach.

* null, by default.
* 2. LAG(N): returns the row from the input column at the specified offset preceding
* the current row. If the offset crosses the grouping boundary or column boundary for
* a given row, a "default" value is returned if available. The "default" value is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the default value is null by default, do we need to state "if available"? It seems like there will always be a default value available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can reword this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, @hyperbolic2346. I have accommodated your suggestion at #6277. Please note that in that approach, I've had to move this documentation into rolling.cu, and removed offset_rolling_window() entirely.

* [11, 12, 13, null, 21, 22, 23, null]
*
* LAG(input_col, 2) yields:
* [null, null, 11, 12, null, null, 21, 22]
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 Sep 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not understanding this correctly I don't think. It took me a couple of times reading this to believe that input_col here is the mutated column from the LEAD call above. Is that correct? If so, should we make that more explicit with input_col = LEAD(input_col, 1) yields: or show the operation on the original input_col?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with all the rolling window functions, offset_rolling_window() returns a unique_ptr<column>. It doesn't mutate the input.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my assumption would be that LAG(input_col, 2) would make [null, null, 10, 11, null, null, 20, 21]. I'm not sure what happened to the 10 and 20 in the example unless we used the output of the above LEAD call which removed them as input for this LAG call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see what you're pointing out. The example is incorrect.

Thank you for the correction, @hyperbolic2346. :]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have corrected this example, over at #6277.

rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Applies an offset-based window function (like LEAD/LAG) on specified column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the duplicates use @copydoc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'll change this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now using @copydoc, for #6277.

@mythrocks
Copy link
Contributor Author

Closing in favour of #6277.

@mythrocks mythrocks closed this Sep 30, 2020
@mythrocks mythrocks deleted the lead-lag-branch-0.16-review branch November 23, 2021 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants