Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Std dev samp for windowing [databricks] #3869

Merged
merged 10 commits into from
Nov 3, 2021

Conversation

razajafri
Copy link
Collaborator

@razajafri razajafri commented Oct 20, 2021

This PR adds windowing support for stddev. Its built on top of @revans2 changes for adding support for GpuReplaceWindowFunction

  • Added changes to AggregateFunction.scala for the actual Expression changes. Had to add a CudfWindowStddev class to avoid stack overflow when replacing StddevSamp
  • Added tests in window_function_test.py.

depends on rapidsai/cudf#9527

closes #3814

Signed-off-by: Raza Jafri [email protected]

@razajafri razajafri self-assigned this Oct 20, 2021
@abellina
Copy link
Collaborator

There are known issues in this implementation.

Should we have it disabled by default for now? We should link the cuDF issue if there is one.

The NaNs and Nulls are not handled properly. The correct way to do this is to run two aggregations and then filter out the result.

Is this something you are planning to handle in the plugin? Or is that going to be handled under the hood by cuDF

@sameerz sameerz added the feature request New feature or request label Oct 20, 2021
Signed-off-by: Raza Jafri <[email protected]>
@razajafri razajafri changed the title Add Std dev for windowing Add Std dev for windowing [databricks] Oct 26, 2021
@razajafri razajafri marked this pull request as ready for review October 26, 2021 21:16
@razajafri
Copy link
Collaborator Author

razajafri commented Oct 26, 2021

Should we have it disabled by default for now? We should link the cuDF issue if there is one.

I kept it as a draft PR for this specific reason.

Is this something you are planning to handle in the plugin? Or is that going to be handled under the hood by cuDF

Yes, I have updated the PR to handle the Spark specific output in the plugin

@razajafri razajafri marked this pull request as draft October 26, 2021 21:19
@razajafri
Copy link
Collaborator Author

@abellina PTAL, I have cleaned up the PR and updated it against @revans2 Average PR

@razajafri razajafri marked this pull request as ready for review October 27, 2021 15:08
Signed-off-by: Raza Jafri <[email protected]>
@razajafri
Copy link
Collaborator Author

build

@abellina abellina self-requested a review October 29, 2021 16:13
These changes were causing test_lead_lag_for_structs_with_arrays
test to fail. I am not sure why but seems to be happening only for
the case where int_gen is nested in the array_gen.

Signed-off-by: Raza Jafri <[email protected]>
@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

@revans2 can take another look?

revans2
revans2 previously approved these changes Oct 29, 2021
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one minor nit


override lazy val evaluateExpression: Expression = {
override val evaluateExpression: Expression = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we add the lazy back in?

@razajafri
Copy link
Collaborator Author

build

revans2
revans2 previously approved these changes Oct 29, 2021
Comment on lines 3094 to 3099
ExprChecks.groupByOnly(
TypeSig.DOUBLE, TypeSig.DOUBLE,
Seq(ParamCheck("input", TypeSig.DOUBLE,
TypeSig.DOUBLE))).asInstanceOf[ExprChecksImpl].contexts
++
ExprChecks.windowOnly(
Copy link
Collaborator

@ttnghia ttnghia Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be just aggNotReduction?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about that ExprCheck. I think that's what we need, @revans2 do you see anything wrong with using aggNotReduction here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aggNotReduction should do the same thing, just be cleaner.

@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

@ttnghia @revans2 can you please take another look

@@ -1204,8 +1204,27 @@ case class GpuStddevPop(child: Expression, nullOnDivideByZero: Boolean)
override def prettyName: String = "stddev_pop"
}

case class WindowStddevSamp(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to support WindowStddevPop too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? The std family can be supported all at once with very little code added. Otherwise, we may have new requests from customers to go back, read the code, add more code + tests etc... in a significantly much more time.

IMO adding all stddev_pop/samp and var_pop/samp at once is the most optimal way.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be very easy to do but that doesn't mean that we have to bundle them together. This PR is specific to a client's request and is scheduled for this release.

@razajafri razajafri changed the title Add Std dev for windowing [databricks] Add Std dev samp for windowing [databricks] Nov 2, 2021
@razajafri razajafri merged commit cebcc1f into NVIDIA:branch-21.12 Nov 3, 2021
@ttnghia ttnghia mentioned this pull request Mar 28, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] stddev stddev_samp and std should be supported over a window
5 participants