Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] scan null_policy is either documented wrong or is producing incorrect results #8462

Closed
revans2 opened this issue Jun 8, 2021 · 5 comments · Fixed by #8478
Closed
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.

Comments

@revans2
Copy link
Contributor

revans2 commented Jun 8, 2021

Describe the bug
cudf::scan contains the following documentation

 * The null values are skipped for the operation, and if an input element
 * at `i` is null, then the output element at `i` will also be null.

But it also documents the null_handling parameter as

 * @param[in] null_handling Exclude null values when computing the result if
 * null_policy::EXCLUDE. Include nulls if null_policy::INCLUDE.
 * Any operation with a null results in a null.

The two appear to be inconsistent with each other. If I exclude nulls and there are nulls in my data would that row show up as a null or would it be excluded? If I include nulls does that mean that all values after the null are null? because Any operation with a null results in a null. or does it mean that a null input row results in a null output row and The null values are skipped. But if they are skipped then how in INCLUDE different from EXCLUDE? So I ran some tests to find out, and things get even worse.

For example:

Aggregation: SUM
Input: INT32 [1, 2, null, 3, 5, 8, 10]

Results

null_policy::INCLUDE null_policy::EXCLUDE
scan_type::INCLUSIVE [1, 3, null, null, null, null, null] [1, 3, null, 6, 11, 19, 29]
scan_type::EXCLUSIVE [0, 1, 3, 3, 6, 11, 19] [0, 1, null, 3, 6, 11, 19]

For some reason when null are included on an inclusive scan then everything after the first null results in a null, but on an exclusive scan nulls are treated as if they were the default value.

When you exclude nulls at least the two implementations are consistent, and I guess follows the documentation, but I am not really sure what excluding a null means if nulls still show up in the output.

This pattern applies for all of the other scan aggregations that are supported

@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify labels Jun 8, 2021
@harrism
Copy link
Member

harrism commented Jun 9, 2021

@revans2 I agree something is off here. The only difference between inclusive and exclusive scan should be that the output of the latter is shifted by one relative to the former, and a zero is inserted at the start. Looking at the C++ code, the logic that generates the output null mask is different between the two paths.

Inclusive:

if (null_handling == null_policy::EXCLUDE) {
output->set_null_mask(detail::copy_bitmask(input, stream, mr), input.null_count());
} else if (input.nullable()) {
output->set_null_mask(mask_inclusive_scan(input, stream, mr), cudf::UNKNOWN_NULL_COUNT);
}

Exclusive:
if (null_handling == null_policy::EXCLUDE) {
output->set_null_mask(detail::copy_bitmask(input, stream, mr), input.null_count());
}

As for naming and documentation, Pandas is slightly clearer in its intent: the parameter skipNA defaults to true, and is the same as null_policy::EXCLUDE. This means that while it propagates the input nulls to the output, nulls are not aggregated in the scan so they don't propagate. Setting it to false is the same as null_policy::INCLUDE, which means nulls are aggregated, and therefore they propagate.

Either way, I think this is a bug in implementation, not documentation.

CC @karthikeyann

Note the scan implementation was recently split into multiple files by @davidwendt , but that PR did not change this logic.

@harrism harrism added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 9, 2021
@harrism
Copy link
Member

harrism commented Jun 9, 2021

Interesting, the test of this functionality does not exercise scan_type::EXCLUSIVE.

TYPED_TEST(ScanTest, skip_nulls)
{
bool do_print = false;
auto const v = cudf::test::make_type_param_vector<TypeParam>({1, 2, 3, 4, 5, 6, 7, 8, 1, 1});
auto const b = std::vector<bool>{1, 1, 1, 1, 1, 0, 1, 0, 1, 1};
cudf::test::fixed_width_column_wrapper<TypeParam> const col_in(v.begin(), v.end(), b.begin());
const column_view input_view = col_in;
std::unique_ptr<cudf::column> col_out;
// test output calculation
std::vector<TypeParam> out_v(input_view.size());
std::vector<bool> out_b(input_view.size());
zip_scan(v.cbegin(),
v.cend(),
b.cbegin(),
out_v.begin(),
value_or<TypeParam>{0},
std::plus<TypeParam>{});
std::partial_sum(b.cbegin(), b.cend(), out_b.begin(), std::logical_and<bool>{});
// skipna=true (default)
CUDF_EXPECT_NO_THROW(
col_out = cudf::scan(
input_view, cudf::make_sum_aggregation(), scan_type::INCLUSIVE, null_policy::EXCLUDE));
cudf::test::fixed_width_column_wrapper<TypeParam> expected_col_out1(
out_v.begin(), out_v.end(), b.cbegin());
CUDF_TEST_EXPECT_COLUMN_PROPERTIES_EQUAL(expected_col_out1, col_out->view());
CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected_col_out1, col_out->view());
if (do_print) {
print_view(expected_col_out1, "expect = ");
print_view(col_out->view(), "result = ");
}
// skipna=false
CUDF_EXPECT_NO_THROW(
col_out = cudf::scan(
input_view, cudf::make_sum_aggregation(), scan_type::INCLUSIVE, null_policy::INCLUDE));
cudf::test::fixed_width_column_wrapper<TypeParam> expected_col_out2(
out_v.begin(), out_v.end(), out_b.begin());
if (do_print) {
print_view(expected_col_out2, "expect = ");
print_view(col_out->view(), "result = ");
}
CUDF_TEST_EXPECT_COLUMN_PROPERTIES_EQUAL(expected_col_out2, col_out->view());
CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected_col_out2, col_out->view());
}

@harrism harrism changed the title [BUG] scan null_policy is either documented wrong or is prodcuing incorrect results [BUG] scan null_policy is either documented wrong or is producing incorrect results Jun 9, 2021
@harrism harrism self-assigned this Jun 9, 2021
@revans2
Copy link
Contributor Author

revans2 commented Jun 9, 2021

On your note that the first value in EXCLUSIVE should be 0, I noticed that for PRODUCT the first value is 1 and not 0, for MAX it in the minimum value for the number type being processed, and for MIN it is the MAXIMUM value for the number type being processed. Not sure if that is against the requirements or not either.

@revans2
Copy link
Contributor Author

revans2 commented Jun 9, 2021

Either way, I think this is a bug in implementation, not documentation.

I am fine with that. I think part of what threw me was that the documentation did not include examples, like a lot of others do, which would have clarified any ambiguity in the English. Then when I tried to come up with my own "examples" to see how it works I ran into the inconsistency in the implementation which threw me off further.

I want to add that from the Spark side of things this is not something that we care about. It is just a bug that I found while evaluating scan for use with running window operations. If we do decide to go that route there will be a separate feature request for that.

@harrism
Copy link
Member

harrism commented Jun 9, 2021

Yes, I meant that the first value has to be the identity (0 for addition).

I think this bug exists because nobody is using exclusive scan yet. Inclusive scan is used to implement cuDF (Pandas) cumsum. Either way, the test coverage is poor (exclusive scan is not tested), and you are right about examples, we can add some.

rapids-bot bot pushed a commit that referenced this issue Jun 22, 2021
Fixes #8462 by generalizing the existing `mask_inclusive_scan` used in `inclusive_scan` so it can be applied the same way in `exclusive_scan`. #8462 demonstrated holes in test coverage, so this PR also reorganizes `scan_tests` test infrastructure to enable a single set of tests to be applied to all supported data types and parameters, correctly expecting exceptions in unsupported cases (e.g. no product scans on fixed-point, only min/max inclusive scans on strings).

Authors:
  - Mark Harris (https://github.com/harrism)

Approvers:
  - Devavret Makkar (https://github.com/devavret)
  - MithunR (https://github.com/mythrocks)

URL: #8478
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants