Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10402: [Rust] Refactor array equality #8541

Closed
wants to merge 4 commits into from
Closed

ARROW-10402: [Rust] Refactor array equality #8541

wants to merge 4 commits into from

Conversation

jorgecarleitao
Copy link
Member

@jorgecarleitao jorgecarleitao commented Oct 27, 2020

This is a major refactor of the equal.rs module.

The rational for this change is many fold:

  • currently array comparison requires downcasting the array ref to its concrete types. This is painful and not very ergonomics, as the user must "guess" what to downcast for comparison. We can see this in the hacks around sort, take and concatenate kernel's tests, and some of the tests of the builders.
  • the code in array comparison is difficult to follow given the amount of calls that they perform around offsets.
  • The implementation currently indirectly uses many of the unsafe APIs that we have (via pointer aritmetics), which makes it risky to operate and mutate.
  • Some code is being repeated.

This PR:

  1. adds impl PartialEq for dyn Array, to allow Array comparison based on Array::data (main change)
  2. Makes array equality to only depend on ArrayData, i.e. it no longer depends on concrete array types (such as PrimitiveArray and related API) to perform comparisons.
  3. Significantly reduces the risk of panics and UB when composite arrays are of different types, by checking the types on range comparison
  4. Makes array equality be statically dispatched, via match datatype.
  5. DRY the code around array equality
  6. Fixes an error in equality of dictionary with equal values
  7. Added tests to equalities that were not tested (fixed binary, some edge cases of dictionaries)
  8. splits equal.rs in smaller, more manageable files.
  9. Removes ArrayListOps, since it it no longer needed
  10. Moves Json equality to its own module, for clarity.
  11. removes the need to have two functions per type to compare arrays.
  12. Adds the number of buffers and their respective width to datatypes from the specification. This was backported from ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8 #8401
  13. adds a benchmark for array equality

Note that this does not implement PartialEq for ArrayData, only dyn Array, as different data does not imply a different array (due to nullability). That implementation is being worked on #8200.

IMO this PR significantly simplifies the code around array comparison, to the point where many implementations are 5 lines long.

This also improves performance by 10-40%.

Benchmark results
Previous HEAD position was 3dd3c69d7 Added bench for equality.
Switched to branch 'equal'
Your branch is up to date with 'origin/equal'.
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 51.28s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/equal-176c3cb11360bd12
Gnuplot not found, using plotters backend
equal_512               time:   [36.861 ns 36.894 ns 36.934 ns]                       
                        change: [-43.752% -43.400% -43.005%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe

equal_nulls_512         time:   [2.3271 us 2.3299 us 2.3331 us]                             
                        change: [-10.846% -9.0877% -7.7336%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

equal_string_512        time:   [49.219 ns 49.347 ns 49.517 ns]                              
                        change: [-30.789% -30.538% -30.235%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

equal_string_nulls_512  time:   [3.7873 us 3.7939 us 3.8013 us]                                    
                        change: [-8.2944% -7.0636% -5.4266%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

All tests are there, plus new tests for some of the edge cases and untested arrays.

This change is backward incompatible array1.equals(&array2) no longer works: use array1 == array2 instead, which is the idiomatic way of comparing structs and trait objects in rust.

@github-actions
Copy link

rust/arrow/src/array/equal/null.rs Show resolved Hide resolved
rust/arrow/src/array/data.rs Show resolved Hide resolved
rust/arrow/src/array/data.rs Show resolved Hide resolved
rust/arrow/src/array/equal/variable_size.rs Outdated Show resolved Hide resolved
@jorgecarleitao
Copy link
Member Author

I have simplified this code further to allow the compiler to optimize out some functions. The code is now 10-40% faster compared to current master. I updated the description accordingly.

lhs_start: usize,
rhs_start: usize,
len: usize,
) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that we can address later, is that if a struct slot is null, we should carry that nullness across to its children. This is actually the problem that @carols10cents and I encountered on both parquet roundtrips and integration tests.

The spec (http://arrow.apache.org/docs/format/Columnar.html#struct-layout) says:

While a struct does not have physical storage for each of its semantic slots (i.e. each scalar C-like struct), an entire struct slot can be set to null via the validity bitmap. Any of the child field arrays can have null values according to their respective independent validity bitmaps. This implies that for a particular struct slot the validity bitmap for the struct array might indicate a null slot when one or more of its child arrays has a non-null value in their corresponding slot. When reading the struct array the parent validity bitmap takes priority.

I think we can address this in a separate PR, if we still find that struct_equal doesn't cover that edge case.

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jorgecarleitao

I'm happy that this PR's been updated for logical equality, and I'd propose we merge it in, then look at #8590 and #8200 to address edge-cases where the IPC and Parquet tests still fail equality tests.

@carols10cents @alamb @paddyhoran @jhorstmann what's your opinion?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say I made it through this entire PR carefully, but I reviewed about 50% of it carefully and skimmed the rest -- what I did see had a logical and understandable structure I would feel comfortable supporting / maintaining, FWIW.

I agree with @jorgecarleitao that having a uniform definition of array equality will solve many of the challenges we have seen (especially with parquet <--> arrow round tripping)

Thus my opinion is that we should merge this.

@@ -112,10 +112,10 @@ pub trait Array: fmt::Debug + Send + Sync + ArrayEqual + JsonEqual {
/// // Make slice over the values [2, 3, 4]
/// let array_slice = array.slice(1, 3);
///
/// assert!(array_slice.equals(&Int32Array::from(vec![2, 3, 4])));
/// assert_eq!(array_slice.as_ref(), &Int32Array::from(vec![2, 3, 4]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:heart

@@ -531,7 +531,7 @@ impl<T: ArrowPrimitiveType> ArrayBuilder for PrimitiveBuilder<T> {

for i in 0..len {
// account for offset as `ArrayData` does not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment may need to be updated as the code seems to no longer account for offset

assert_eq!(finished.len(), expected.len());
assert_eq!(finished.null_count(), expected.null_count());
assert!(finished.equals(&(*expected)));
assert_eq!(finished, expected);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is certainly a lot nicer

@@ -3999,7 +3996,7 @@ mod tests {
let expected_list =
FixedSizeListArray::from(Arc::new(expected_list_data) as ArrayDataRef);
let expected_list = FixedSizeBinaryArray::from(expected_list);
// assert!(expected_list.values().equals(&*finished.values()));
// assert_eq!(expected_list.values(), finished.values());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not that you did it, but I wonder why this code is commented out -- maybe your changes will have fixed it

pub(super) fn buffer<T: ArrowNativeType>(&self, buffer: usize) -> &[T] {
let values = unsafe { self.buffers[buffer].data().align_to::<T>() };
if values.0.len() != 0 || values.2.len() != 0 {
panic!("The buffer is not byte-aligned with its interpretation")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for the panic

new_data
}

/// Returns the buffer `buffer` as a slice of type `T`. The slice is already offset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns the buffer `buffer` as a slice of type `T`. The slice is already offset.
/// Returns self.buffers at index `buffer` as a slice of type `T` starting at self.offset

}

#[inline]
fn equal_values(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn equal_values(
/// Compares two arrays for equality starting at `lhs_start` and `rhs_start`
/// `lhs` and `rhs` *must* have the same data type
fn equal_values(

@@ -121,7 +120,7 @@ impl ArrayData {
/// Returns whether the element at index `i` is null
pub fn is_null(&self, i: usize) -> bool {
if let Some(ref b) = self.null_bitmap {
return !b.is_set(i);
return !b.is_set(self.offset + i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, I think I have another ticket that would be fixed by this change

@jorgecarleitao
Copy link
Member Author

@alamb , @nevi-me and @jhorstmann , thanks a lot for taking a close look into this, also for the amazing work yesterday and today on reviewing and merging stuff. Really impressive! 💯

I rebased this and resolved conflicts.

@nevi-me
Copy link
Contributor

nevi-me commented Nov 7, 2020

@alamb , @nevi-me and @jhorstmann , thanks a lot for taking a close look into this, also for the amazing work yesterday and today on reviewing and merging stuff. Really impressive! 💯

I rebased this and resolved conflicts.

I'm super-excited, because this likely unlocks us on the failures that I've been getting on #8200.

@nevi-me nevi-me closed this in a04a15a Nov 7, 2020
@jorgecarleitao jorgecarleitao deleted the equal branch December 4, 2020 07:40
@jorgecarleitao jorgecarleitao restored the equal branch December 4, 2020 07:41
@jorgecarleitao jorgecarleitao deleted the equal branch December 14, 2020 07:35
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This is a major refactor of the `equal.rs` module.

The rational for this change is many fold:

* currently array comparison requires downcasting the array ref to its concrete types. This is painful and not very ergonomics, as the user must "guess" what to downcast for comparison. We can see this in the hacks around `sort`, `take` and `concatenate` kernel's tests, and some of the tests of the builders.
* the code in array comparison is difficult to follow given the amount of calls that they perform around offsets.
* The implementation currently indirectly uses many of the `unsafe` APIs that we have (via pointer aritmetics), which makes it risky to operate and mutate.
* Some code is being repeated.

This PR:

1. adds `impl PartialEq for dyn Array`, to allow `Array` comparison based on `Array::data` (main change)
2. Makes array equality to only depend on `ArrayData`, i.e. it no longer depends on concrete array types (such as `PrimitiveArray` and related API) to perform comparisons.
3. Significantly reduces the risk of panics and UB when composite arrays are of different types, by checking the types on `range` comparison
4. Makes array equality be statically dispatched, via `match datatype`.
5. DRY the code around array equality
6. Fixes an error in equality of dictionary with equal values
7. Added tests to equalities that were not tested (fixed binary, some edge cases of dictionaries)
8. splits `equal.rs` in smaller, more manageable files.
9. Removes `ArrayListOps`, since it it no longer needed
10. Moves Json equality to its own module, for clarity.
11. removes the need to have two functions per type to compare arrays.
12. Adds the number of buffers and their respective width to datatypes from the specification. This was backported from apache#8401
13. adds a benchmark for array equality

Note that this does not implement `PartialEq` for `ArrayData`, only `dyn Array`, as different data does not imply a different array (due to nullability). That implementation is being worked on apache#8200.

IMO this PR significantly simplifies the code around array comparison, to the point where many implementations are 5 lines long.

This also improves performance by 10-40%.

<details>
 <summary>Benchmark results</summary>

```
Previous HEAD position was 3dd3c69 Added bench for equality.
Switched to branch 'equal'
Your branch is up to date with 'origin/equal'.
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 51.28s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/equal-176c3cb11360bd12
Gnuplot not found, using plotters backend
equal_512               time:   [36.861 ns 36.894 ns 36.934 ns]
                        change: [-43.752% -43.400% -43.005%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe

equal_nulls_512         time:   [2.3271 us 2.3299 us 2.3331 us]
                        change: [-10.846% -9.0877% -7.7336%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

equal_string_512        time:   [49.219 ns 49.347 ns 49.517 ns]
                        change: [-30.789% -30.538% -30.235%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

equal_string_nulls_512  time:   [3.7873 us 3.7939 us 3.8013 us]
                        change: [-8.2944% -7.0636% -5.4266%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe
```

</details>

All tests are there, plus new tests for some of the edge cases and untested arrays.

This change is backward incompatible `array1.equals(&array2)` no longer works: use `array1 == array2` instead, which is the idiomatic way of comparing structs and trait objects in rust.

Closes apache#8541 from jorgecarleitao/equal

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants