Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support window operations on Decimal #1333

Closed
6 tasks done
revans2 opened this issue Dec 8, 2020 · 15 comments
Closed
6 tasks done

[FEA] Support window operations on Decimal #1333

revans2 opened this issue Dec 8, 2020 · 15 comments
Assignees
Labels
feature request New feature or request

Comments

@revans2
Copy link
Collaborator

revans2 commented Dec 8, 2020

Is your feature request related to a problem? Please describe.
This is still blocked on CUDF adding in support for decimal window operations. The window operations currently supported include:

@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Dec 8, 2020
@revans2 revans2 mentioned this issue Dec 8, 2020
27 tasks
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Dec 15, 2020
@mythrocks
Copy link
Collaborator

CUDF Decimal support in window functions:
min, max, count, lead, lag: rapidsai/cudf#7037
row_number: rapidsai/cudf#7061

@sperlingxx
Copy link
Collaborator

sperlingxx commented Jan 11, 2021

I have no idea what does "collect list" point to.

@sperlingxx
Copy link
Collaborator

sperlingxx commented Jan 11, 2021

I did some tests with cuDF java tests. I found that all window operations required by current issue can run on decimal columns except sum.
sum on decimal is still unsupported by cuDF because rolling with sum operation is yet supported on decimal: https://github.com/rapidsai/cudf/blob/branch-0.18/cpp/src/rolling/rolling_detail.hpp#L54

@revans2
Copy link
Collaborator Author

revans2 commented Jan 11, 2021

Thanks, then we should split this up into at least 2 issues. The first is for what we can support today. The second should be a follow on to implement sum, and then we can make sure that there is an issue in CUDF to track it. I asked @razajafri to take lead on making sure that all of the issues we need are in cudf so please coordinate on that.

The final issue is for collect_list which @mythrocks is working on a general cudf implementation for (we do not currently support it for any window operations).

@razajafri
Copy link
Collaborator

rapidsai/cudf#7061 and rapidsai/cudf#7037 are both merged in and can be supported by the plugin. I will create an issue for sum in cudf. As for collect_list @mythrocks is working on it. I spoke with him and he will create an issue in cudf to track that work

@razajafri
Copy link
Collaborator

I have opened a cudf issue to track sum in a window operation rapidsai/cudf#7117

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Jan 12, 2021
This pull request is to verify window operations on decimal columns in java package, which is required by spark-rapids on [issue 1333](NVIDIA/spark-rapids#1333).

Authors:
  - sperlingxx <[email protected]>

Approvers:
  - Robert (Bobby) Evans (@revans2)

URL: #7120
@revans2
Copy link
Collaborator Author

revans2 commented Jan 19, 2021

Lead, Lag, Max, and Min were merged in as a part of #1512
Count was merged in as a part of #1476 but it was a mistake because it was intended to only go in for regular aggregate operations so tests for it were added as a part of #1512

Both sum and row_number are are merged into cudf so we should be able to get those merged in shortly.

@sperlingxx
Copy link
Collaborator

sperlingxx commented Jan 20, 2021

Hi @revans2, I met some problem on wrapping the sum support for decimal type. The problem is spark catalyst will conduct a hard-code precision promotion on the output decimal type of sum expression (code link), which leads to precision overflow any input decimal whose precision greater than 8.
Considering the output decimal type can pass to parent expressions recursively, I think perhaps we need create a catalyst rule to revert the precision promotion?

@revans2
Copy link
Collaborator Author

revans2 commented Jan 20, 2021

@sperlingxx We want to match what Spark does. We are just not going to be able to support sums greater than a precision of 8. This is the same limitation that we have for doing a sum with a group-by or a reduction in spark.

@sperlingxx
Copy link
Collaborator

Thanks @revans2. I created the pull request #1560 for sum. And I think the row_number is already supported for now?

@revans2
Copy link
Collaborator Author

revans2 commented Jan 21, 2021

@sperlingxx you are right row_number is done. I missed it because it really was a test only change.

@revans2
Copy link
Collaborator Author

revans2 commented Jan 21, 2021

@sameerz at this point we have everything done for this feature except collect_list. Do we want to keep this open for it, or should we split it off into a separate issue and close this?

@sameerz
Copy link
Collaborator

sameerz commented Jan 21, 2021

I am hoping we will see a collect_list PR in cudf soon, so let's keep this open. If it does not get raised in the next few days we can close this and open a new issue for collect_list.

@sameerz
Copy link
Collaborator

sameerz commented Jan 30, 2021

collect_list was merged in rapidsai/cudf#7189

@sameerz
Copy link
Collaborator

sameerz commented Feb 12, 2021

Follow up issue for supporting rank is in issue #1584 .

@sameerz sameerz closed this as completed Feb 12, 2021
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants