-
Notifications
You must be signed in to change notification settings - Fork 907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] scan null_policy is either documented wrong or is producing incorrect results #8462
Comments
@revans2 I agree something is off here. The only difference between inclusive and exclusive scan should be that the output of the latter is shifted by one relative to the former, and a zero is inserted at the start. Looking at the C++ code, the logic that generates the output null mask is different between the two paths. Inclusive: cudf/cpp/src/reductions/scan/scan_inclusive.cu Lines 159 to 163 in 90e29d9
Exclusive: cudf/cpp/src/reductions/scan/scan_exclusive.cu Lines 94 to 96 in 90e29d9
As for naming and documentation, Pandas is slightly clearer in its intent: the parameter Either way, I think this is a bug in implementation, not documentation. Note the scan implementation was recently split into multiple files by @davidwendt , but that PR did not change this logic. |
Interesting, the test of this functionality does not exercise cudf/cpp/tests/reductions/scan_tests.cpp Lines 337 to 384 in 90e29d9
|
On your note that the first value in EXCLUSIVE should be 0, I noticed that for PRODUCT the first value is 1 and not 0, for MAX it in the minimum value for the number type being processed, and for MIN it is the MAXIMUM value for the number type being processed. Not sure if that is against the requirements or not either. |
I am fine with that. I think part of what threw me was that the documentation did not include examples, like a lot of others do, which would have clarified any ambiguity in the English. Then when I tried to come up with my own "examples" to see how it works I ran into the inconsistency in the implementation which threw me off further. I want to add that from the Spark side of things this is not something that we care about. It is just a bug that I found while evaluating scan for use with running window operations. If we do decide to go that route there will be a separate feature request for that. |
Yes, I meant that the first value has to be the identity (0 for addition). I think this bug exists because nobody is using exclusive scan yet. Inclusive scan is used to implement cuDF (Pandas) |
Fixes #8462 by generalizing the existing `mask_inclusive_scan` used in `inclusive_scan` so it can be applied the same way in `exclusive_scan`. #8462 demonstrated holes in test coverage, so this PR also reorganizes `scan_tests` test infrastructure to enable a single set of tests to be applied to all supported data types and parameters, correctly expecting exceptions in unsupported cases (e.g. no product scans on fixed-point, only min/max inclusive scans on strings). Authors: - Mark Harris (https://github.com/harrism) Approvers: - Devavret Makkar (https://github.com/devavret) - MithunR (https://github.com/mythrocks) URL: #8478
Describe the bug
cudf::scan
contains the following documentationBut it also documents the
null_handling
parameter asThe two appear to be inconsistent with each other. If I exclude nulls and there are nulls in my data would that row show up as a null or would it be excluded? If I include nulls does that mean that all values after the null are null? because
Any operation with a null results in a null.
or does it mean that a null input row results in a null output row andThe null values are skipped
. But if they are skipped then how inINCLUDE
different fromEXCLUDE
? So I ran some tests to find out, and things get even worse.For example:
Aggregation: SUM
Input: INT32
[1, 2, null, 3, 5, 8, 10]
Results
null_policy::INCLUDE
null_policy::EXCLUDE
scan_type::INCLUSIVE
[1, 3, null, null, null, null, null]
[1, 3, null, 6, 11, 19, 29]
scan_type::EXCLUSIVE
[0, 1, 3, 3, 6, 11, 19]
[0, 1, null, 3, 6, 11, 19]
For some reason when null are included on an inclusive scan then everything after the first null results in a null, but on an exclusive scan nulls are treated as if they were the default value.
When you exclude nulls at least the two implementations are consistent, and I guess follows the documentation, but I am not really sure what excluding a null means if nulls still show up in the output.
This pattern applies for all of the other scan aggregations that are supported
The text was updated successfully, but these errors were encountered: