[FEA] row_comparators should use strongly typed index types to ensure commutativity #10508

jrhemstad · 2022-03-24T15:11:09Z

Is your feature request related to a problem? Please describe.

As @vyasr pointed out in #9666 (comment), row_equality_comparator and row_lexicographic_comparator are not commutative, i.e., op(i,j) !==> op(j,i). (Edit: Obviously < is not commutative, but it still has the same kind of problem because there is implicit meaning to the values that will cause failures if violated)

Both of these operators work on two tables, a lhs and rhs. For two ids i and j op(i,j) assumes i references into row i of lhs and j into rhs. Thus, op(i,j) is not the same as op(j,i).

For example, imagine doing a binary search for the rows of table B into table A

table_device_view A{...}, B{...};
row_lexicographic_comparator less_compare{A, B};
thrust::binary_search( counting_iterator{0}, counting_iterator{A.size()}
                      counting_iterator{0}, counting_iterator{B.size()}
                      less_compare);

This uses counting iterators in combination with the row_lexicographic_comparator to perform a binary search of row i from B into A.

Invoking less_compare(i,j) it implicitly assumes that i references table A (the lhs) and j references table B (the rhs). But there is absolutely zero guarantee from the binary_search algorithm that it would only use values from the first input range for the first argument and values from the second range for the second argument. In other words, less_compare(i,j) could be invoked where i indexes into B and j indexes into A. This would perform an incorrect comparison and silently produce incorrect results.

Concrete example of this in action with std::merge: https://godbolt.org/z/j5z5oG87j

Describe the solution you'd like

We should leverage strong typing to solve this problem.

In short, define a strong type for a lhs_index_t and rhs_index_t and provide two overloads of the row operators binary operator() that accept arguments of these types in swapped ordering. These strongly typed overloads can simply unwrap the strong type wrapper and call an internal overload that works on just size_t indices:

enum class lhs_index_t : int32_t{};
enum class rhs_index_t : int32_t{};

struct device_row_comparator{

   // Maps the strong type values to the correct order for the implicitly ordered size_t overload
   bool operator()(lhs_index_t l, rhs_index_t r){
       return this->operator()(static_cast<int32_t>(l), static_cast<int32_t>(r));
    }
   // Maps the strong type values to the correct order for the implicitly ordered size_t overload
    bool operator()(rhs_index_t r, lhs_index_t l){
       return this->operator()(static_cast<int32_t>(l), static_cast<int32_t>(r));
    }

private:
   // Assumes lhs_index references lhs, and likewise with rhs_index and rhs
   bool operator()(size_t lhs_index, size_t rhs_index){   }
   table_device_view lhs, rhs;
}

I don't believe a thrust::counting_iterator will work as-is with the lhs_index_t/rhs_index_t as defined above. We may need to provide a convenience wrapper for making counting iterators for these strong types, or making them actual structs with appropriate operator+ to make it work with counting_iterator by default.

Describe alternatives you've considered
There really isn't an alternative. We've only survived this far by relying on implementation defined behavior that algorithms aren't switching the order of arguments on us.

Additional Context

Using the strongly typed indices will require modifying algorithms to use these new strong types. This change should be made in conjunction with the updating to the new comparators added in #10164.

This issue will block updating an algorithm that expects to perform comparisons between two different tables, e.g., #9452.

The text was updated successfully, but these errors were encountered:

jrhemstad · 2022-03-24T15:46:12Z

I suggest alongside implementing this change we update an existing algorithm to use the new strongly typed logic.

cudf::merge would be an excellent candidate as it currently encodes the "side" information as runtime values:

cudf/cpp/src/merge/merge.cu

Lines 143 to 180 in 57ff6f5

    
           struct side_index_generator { 
        
             side _side; 
        
             __device__ index_type operator()(size_type i) const noexcept { return index_type{_side, i}; } 
        
           }; 
        
           /** 
        
            * @brief Generates the row indices and source side (left or right) in accordance with the index 
        
            * columns. 
        
            * 
        
            * 
        
            * @tparam index_type Indicates the type to be used to collect index and side information; 
        
            * @param[in] left_table The left table_view to be merged 
        
            * @param[in] right_table The right table_view to be merged 
        
            * @param[in] column_order Sort order types of index columns 
        
            * @param[in] null_precedence Array indicating the order of nulls with respect to non-nulls for the 
        
            * index columns 
        
            * @param[in] nullable Flag indicating if at least one of the table_view arguments has nulls 
        
            * (defaults to true) 
        
            * @param[in] stream CUDA stream used for device memory operations and kernel launches. 
        
            * 
        
            * @return A device_uvector of merged indices 
        
            */ 
        
           index_vector generate_merged_indices(table_view const& left_table, 
        
                                                table_view const& right_table, 
        
                                                std::vector<order> const& column_order, 
        
                                                std::vector<null_order> const& null_precedence, 
        
                                                bool nullable                = true, 
        
                                                rmm::cuda_stream_view stream = rmm::cuda_stream_default) 
        
           { 
        
             const size_type left_size  = left_table.num_rows(); 
        
             const size_type right_size = right_table.num_rows(); 
        
             const size_type total_size = left_size + right_size; 
        
             auto left_gen    = side_index_generator{side::LEFT}; 
        
             auto right_gen   = side_index_generator{side::RIGHT}; 
        
             auto left_begin  = cudf::detail::make_counting_transform_iterator(0, left_gen); 
        
             auto right_begin = cudf::detail::make_counting_transform_iterator(0, right_gen);

Using strong types for the input indices would be a good way to simplify the logic.

jrhemstad · 2022-03-24T15:58:44Z

cudf::search would be another good exemplar algorithm.

vyasr · 2022-03-24T23:56:16Z

Thanks for writing this issue up! I think I've left inline comments about this in a couple of places and we've discussed possible solutions before, but I hadn't gotten around to writing up a full proposal.

It may also be worthwhile to test with at least one algorithm using cuco. The changes associated with #10401 will likely run into these kinds noncommutativity bugs.

jrhemstad · 2022-04-14T14:49:20Z

Hey, turns out we had an issue for this a long time ago! #3257

I knew it sounded familiar :)

github-actions · 2022-05-14T15:03:09Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

This PR resolves #10508. It introduces two-table lexicographic row comparators with strongly typed index types. Given tables `lhs` and `rhs`, the `two_table_comparator` can create a device comparator whose strongly typed call operator can compare bidirectionally: `lhs[i] < rhs[j]` and `rhs[i] < lhs[j]`. The strong typing indicates which index belongs to which table. This PR also contains a sample implementation in `search_ordered.cu`, which implements `lower_bound` and `upper_bound` algorithms. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Nghia Truong (https://github.com/ttnghia) - Jake Hemstad (https://github.com/jrhemstad) URL: #10730

jrhemstad added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. tech debt labels Mar 24, 2022

jrhemstad assigned bdice Mar 24, 2022

GregoryKimball mentioned this issue Apr 9, 2022

[FEA] Story - Supporting row operators on nested types #10186

Closed

This was referenced Apr 14, 2022

[FEA] Add left and right index iterators for row_operators #3257

Closed

Support equally comparing two different tables in the newly experimental row comparator #10667

Closed

This was referenced Apr 15, 2022

Support nested types in lists::contains #10548

Merged

Fully support nested types in cudf::contains #10656

Merged

bdice mentioned this issue Apr 25, 2022

Two-table comparators with strong index types #10730

Merged

github-actions bot added the inactive-30d label May 14, 2022

sameerz mentioned this issue May 17, 2022

[FEA] collect_set on struct[Array] NVIDIA/spark-rapids#5508

Closed

rapids-bot bot closed this as completed in #10730 May 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] row_comparators should use strongly typed index types to ensure commutativity #10508

[FEA] row_comparators should use strongly typed index types to ensure commutativity #10508

jrhemstad commented Mar 24, 2022 •

edited

Loading

jrhemstad commented Mar 24, 2022

jrhemstad commented Mar 24, 2022

vyasr commented Mar 24, 2022

jrhemstad commented Apr 14, 2022

github-actions bot commented May 14, 2022

[FEA] row_comparators should use strongly typed index types to ensure commutativity #10508

[FEA] row_comparators should use strongly typed index types to ensure commutativity #10508

Comments

jrhemstad commented Mar 24, 2022 • edited Loading

jrhemstad commented Mar 24, 2022

jrhemstad commented Mar 24, 2022

vyasr commented Mar 24, 2022

jrhemstad commented Apr 14, 2022

github-actions bot commented May 14, 2022

jrhemstad commented Mar 24, 2022 •

edited

Loading