[BUG] scan null_policy is either documented wrong or is producing incorrect results #8462

revans2 · 2021-06-08T21:33:55Z

Describe the bug
cudf::scan contains the following documentation

 * The null values are skipped for the operation, and if an input element
 * at `i` is null, then the output element at `i` will also be null.

But it also documents the null_handling parameter as

 * @param[in] null_handling Exclude null values when computing the result if
 * null_policy::EXCLUDE. Include nulls if null_policy::INCLUDE.
 * Any operation with a null results in a null.

The two appear to be inconsistent with each other. If I exclude nulls and there are nulls in my data would that row show up as a null or would it be excluded? If I include nulls does that mean that all values after the null are null? because Any operation with a null results in a null. or does it mean that a null input row results in a null output row and The null values are skipped. But if they are skipped then how in INCLUDE different from EXCLUDE? So I ran some tests to find out, and things get even worse.

For example:

Aggregation: SUM
Input: INT32 [1, 2, null, 3, 5, 8, 10]

Results

	`null_policy::INCLUDE`	`null_policy::EXCLUDE`
`scan_type::INCLUSIVE`	`[1, 3, null, null, null, null, null]`	`[1, 3, null, 6, 11, 19, 29]`
`scan_type::EXCLUSIVE`	`[0, 1, 3, 3, 6, 11, 19]`	`[0, 1, null, 3, 6, 11, 19]`

For some reason when null are included on an inclusive scan then everything after the first null results in a null, but on an exclusive scan nulls are treated as if they were the default value.

When you exclude nulls at least the two implementations are consistent, and I guess follows the documentation, but I am not really sure what excluding a null means if nulls still show up in the output.

This pattern applies for all of the other scan aggregations that are supported

The text was updated successfully, but these errors were encountered:

harrism · 2021-06-09T01:26:25Z

@revans2 I agree something is off here. The only difference between inclusive and exclusive scan should be that the output of the latter is shifted by one relative to the former, and a zero is inserted at the start. Looking at the C++ code, the logic that generates the output null mask is different between the two paths.

Inclusive:

cudf/cpp/src/reductions/scan/scan_inclusive.cu

Lines 159 to 163 in 90e29d9

    
           if (null_handling == null_policy::EXCLUDE) { 
        
             output->set_null_mask(detail::copy_bitmask(input, stream, mr), input.null_count()); 
        
           } else if (input.nullable()) { 
        
             output->set_null_mask(mask_inclusive_scan(input, stream, mr), cudf::UNKNOWN_NULL_COUNT); 
        
           }

Exclusive:

cudf/cpp/src/reductions/scan/scan_exclusive.cu

Lines 94 to 96 in 90e29d9

    
           if (null_handling == null_policy::EXCLUDE) { 
        
             output->set_null_mask(detail::copy_bitmask(input, stream, mr), input.null_count()); 
        
           }

As for naming and documentation, Pandas is slightly clearer in its intent: the parameter skipNA defaults to true, and is the same as null_policy::EXCLUDE. This means that while it propagates the input nulls to the output, nulls are not aggregated in the scan so they don't propagate. Setting it to false is the same as null_policy::INCLUDE, which means nulls are aggregated, and therefore they propagate.

Either way, I think this is a bug in implementation, not documentation.

CC @karthikeyann

Note the scan implementation was recently split into multiple files by @davidwendt , but that PR did not change this logic.

harrism · 2021-06-09T01:34:35Z

Interesting, the test of this functionality does not exercise scan_type::EXCLUSIVE.

cudf/cpp/tests/reductions/scan_tests.cpp

Lines 337 to 384 in 90e29d9

    
           TYPED_TEST(ScanTest, skip_nulls) 
        
           { 
        
             bool do_print = false; 
        
             auto const v  = cudf::test::make_type_param_vector<TypeParam>({1, 2, 3, 4, 5, 6, 7, 8, 1, 1}); 
        
             auto const b  = std::vector<bool>{1, 1, 1, 1, 1, 0, 1, 0, 1, 1}; 
        
             cudf::test::fixed_width_column_wrapper<TypeParam> const col_in(v.begin(), v.end(), b.begin()); 
        
             const column_view input_view = col_in; 
        
             std::unique_ptr<cudf::column> col_out; 
        
             // test output calculation 
        
             std::vector<TypeParam> out_v(input_view.size()); 
        
             std::vector<bool> out_b(input_view.size()); 
        
             zip_scan(v.cbegin(), 
        
                      v.cend(), 
        
                      b.cbegin(), 
        
                      out_v.begin(), 
        
                      value_or<TypeParam>{0}, 
        
                      std::plus<TypeParam>{}); 
        
             std::partial_sum(b.cbegin(), b.cend(), out_b.begin(), std::logical_and<bool>{}); 
        
             // skipna=true (default) 
        
             CUDF_EXPECT_NO_THROW( 
        
               col_out = cudf::scan( 
        
                 input_view, cudf::make_sum_aggregation(), scan_type::INCLUSIVE, null_policy::EXCLUDE)); 
        
             cudf::test::fixed_width_column_wrapper<TypeParam> expected_col_out1( 
        
               out_v.begin(), out_v.end(), b.cbegin()); 
        
             CUDF_TEST_EXPECT_COLUMN_PROPERTIES_EQUAL(expected_col_out1, col_out->view()); 
        
             CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected_col_out1, col_out->view()); 
        
             if (do_print) { 
        
               print_view(expected_col_out1, "expect = "); 
        
               print_view(col_out->view(), "result = "); 
        
             } 
        
             // skipna=false 
        
             CUDF_EXPECT_NO_THROW( 
        
               col_out = cudf::scan( 
        
                 input_view, cudf::make_sum_aggregation(), scan_type::INCLUSIVE, null_policy::INCLUDE)); 
        
             cudf::test::fixed_width_column_wrapper<TypeParam> expected_col_out2( 
        
               out_v.begin(), out_v.end(), out_b.begin()); 
        
             if (do_print) { 
        
               print_view(expected_col_out2, "expect = "); 
        
               print_view(col_out->view(), "result = "); 
        
             } 
        
             CUDF_TEST_EXPECT_COLUMN_PROPERTIES_EQUAL(expected_col_out2, col_out->view()); 
        
             CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected_col_out2, col_out->view()); 
        
           }

revans2 · 2021-06-09T11:29:52Z

On your note that the first value in EXCLUSIVE should be 0, I noticed that for PRODUCT the first value is 1 and not 0, for MAX it in the minimum value for the number type being processed, and for MIN it is the MAXIMUM value for the number type being processed. Not sure if that is against the requirements or not either.

revans2 · 2021-06-09T11:34:50Z

Either way, I think this is a bug in implementation, not documentation.

I am fine with that. I think part of what threw me was that the documentation did not include examples, like a lot of others do, which would have clarified any ambiguity in the English. Then when I tried to come up with my own "examples" to see how it works I ran into the inconsistency in the implementation which threw me off further.

I want to add that from the Spark side of things this is not something that we care about. It is just a bug that I found while evaluating scan for use with running window operations. If we do decide to go that route there will be a separate feature request for that.

harrism · 2021-06-09T23:05:14Z

Yes, I meant that the first value has to be the identity (0 for addition).

I think this bug exists because nobody is using exclusive scan yet. Inclusive scan is used to implement cuDF (Pandas) cumsum. Either way, the test coverage is poor (exclusive scan is not tested), and you are right about examples, we can add some.

Fixes #8462 by generalizing the existing `mask_inclusive_scan` used in `inclusive_scan` so it can be applied the same way in `exclusive_scan`. #8462 demonstrated holes in test coverage, so this PR also reorganizes `scan_tests` test infrastructure to enable a single set of tests to be applied to all supported data types and parameters, correctly expecting exceptions in unsupported cases (e.g. no product scans on fixed-point, only min/max inclusive scans on strings). Authors: - Mark Harris (https://github.com/harrism) Approvers: - Devavret Makkar (https://github.com/devavret) - MithunR (https://github.com/mythrocks) URL: #8478

revans2 added bug Something isn't working Needs Triage Need team to review and classify labels Jun 8, 2021

revans2 mentioned this issue Jun 8, 2021

[FEA] support for running window aggregations #8440

Closed

harrism added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 9, 2021

harrism changed the title ~~[BUG] scan null_policy is either documented wrong or is prodcuing incorrect results~~ [BUG] scan null_policy is either documented wrong or is producing incorrect results Jun 9, 2021

harrism self-assigned this Jun 9, 2021

harrism mentioned this issue Jun 10, 2021

Fix exclusive scan when including nulls and improve testing #8478

Merged

rapids-bot bot closed this as completed in #8478 Jun 22, 2021

revans2 mentioned this issue Jul 12, 2021

Support for groupby/scan rank and dense_rank aggregations #8652

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] scan null_policy is either documented wrong or is producing incorrect results #8462

[BUG] scan null_policy is either documented wrong or is producing incorrect results #8462

revans2 commented Jun 8, 2021

harrism commented Jun 9, 2021

harrism commented Jun 9, 2021

revans2 commented Jun 9, 2021

revans2 commented Jun 9, 2021

harrism commented Jun 9, 2021

[BUG] scan null_policy is either documented wrong or is producing incorrect results #8462

[BUG] scan null_policy is either documented wrong or is producing incorrect results #8462

Comments

revans2 commented Jun 8, 2021

harrism commented Jun 9, 2021

harrism commented Jun 9, 2021

revans2 commented Jun 9, 2021

revans2 commented Jun 9, 2021

harrism commented Jun 9, 2021