Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Binary Dictionary Arithmetic Support #4407

Merged
merged 2 commits into from
Jun 30, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jun 13, 2023

Which issue does this PR close?

Relates to #3999

Rationale for this change

As part of #3999 I'm trying to improve the consistency and correctness of the arithmetic kernels, however, I am repeatedly bashing my head against the dictionary support and therefore wanted to float this idea to see what people think.

My major rationale is:

  • Currently calling a kernel with a DictionaryArray and a scalar returns a DictionaryArray, however, calling a kernel with two DictionaryArray returns a PrimitiveArray, the latter feels strange to me
  • Huge amount of code complexity, and code generation to support this use-case
  • Difficult to keep the arithmetic logic for PrimitiveArray values and DictionaryArray values consistent
  • We currently don't support operations between PrimitiveArray and DictionaryArray, should we??
  • The performance of operating directly on the dictionaries, vs casting first, is broadly in the same ballpark of ~10s of ns per row
  • I honestly don't really understand the use-case for a DictionaryArray of primitives, they will be significantly slower to process than the corresponding PrimitiveArray, orders of magnitude in some case, and will likely take up more memory (especially given Concating dictionary array leads to duplicated dict values. #3837 and similar)

I think what would help me be less frustrated bashing my head against this would be some motivating use-case for this functionality, currently I can't see a compelling reason to ever use a DictionaryArray of primitives for query computation, they're almost always just worse

Performance of arithmetic using this feature, vs just casting first, run using (#4405)

dict_add(0)             time:   [354.31 µs 354.55 µs 354.84 µs]
                        change: [-1.1077% -0.7157% -0.2919%] (p = 0.00 < 0.05)
                        Change within noise threshold.

dict_add_checked(0)     time:   [31.384 µs 31.392 µs 31.401 µs]
                        change: [-1.3918% -0.7952% -0.4529%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

dict_add_cast(0)        time:   [44.593 µs 44.622 µs 44.657 µs]
                        change: [-3.3883% -3.3001% -3.2035%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

dict_add_cast_checked(0)
                        time:   [44.130 µs 44.160 µs 44.192 µs]
                        change: [-1.6532% -1.2188% -0.8736%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

dict_add(0.1)           time:   [411.69 µs 411.94 µs 412.28 µs]
                        change: [-0.8818% -0.7008% -0.5335%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

dict_add_checked(0.1)   time:   [19.859 µs 19.872 µs 19.885 µs]
                        change: [+3.1645% +3.8146% +4.4437%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  9 (9.00%) high mild
  4 (4.00%) high severe

dict_add_cast(0.1)      time:   [67.510 µs 67.682 µs 67.866 µs]
                        change: [-32.706% -32.439% -32.166%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) low severe
  9 (9.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

dict_add_cast_checked(0.1)
                        time:   [78.234 µs 78.265 µs 78.299 µs]
                        change: [-1.1505% -1.0897% -1.0254%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

dict_add(0.5)           time:   [687.92 µs 688.56 µs 689.45 µs]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

dict_add_checked(0.5)   time:   [72.906 µs 72.921 µs 72.939 µs]
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

dict_add_cast(0.5)      time:   [68.336 µs 68.367 µs 68.399 µs]
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

dict_add_cast_checked(0.5)
                        time:   [126.81 µs 126.89 µs 126.97 µs]
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

dict_add(0.9)           time:   [498.11 µs 498.35 µs 498.60 µs]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

dict_add_checked(0.9)   time:   [92.705 µs 95.673 µs 99.419 µs]

dict_add_cast(0.9)      time:   [69.080 µs 69.248 µs 69.383 µs]

dict_add_cast_checked(0.9)
                        time:   [171.86 µs 171.97 µs 172.08 µs]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

dict_add(1)             time:   [370.66 µs 370.83 µs 371.02 µs]
Found 21 outliers among 100 measurements (21.00%)
  12 (12.00%) low severe
  9 (9.00%) high mild

dict_add_checked(1)     time:   [31.390 µs 31.402 µs 31.414 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

dict_add_cast(1)        time:   [43.996 µs 44.022 µs 44.048 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

dict_add_cast_checked(1)
                        time:   [45.406 µs 45.439 µs 45.476 µs]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

What changes are included in this PR?

Are there any user-facing changes?

Yes, I suspect this will have downstream implications. Tagging @alamb @viirya @wjones127 @jhorstmann

@tustvold tustvold added the api-change Changes to the arrow API label Jun 13, 2023
@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 13, 2023
@jhorstmann
Copy link
Contributor

Agree that the performance benefit of specialized kernels is probably not worth the complexity and added code.

Currently calling a kernel with a DictionaryArray and a scalar returns a DictionaryArray, however, calling a kernel with two DictionaryArray returns a PrimitiveArray, the latter feels strange to me

This kind of makes sense to me, for many operations involving scalars, the dictionary would still be unique afterwards, while an operation with two dictionaries would lead to combinatoric explosion and no longer is beneficial to dictionary encode the results. Operations like array * 0 would of course lead to all duplicated values in the dictionary, so always returning a PrimitiveArray could be more consistent.

In our engine we had a similar issue with string replace or concat operations, where we decided that such operations on two dictionary arrays would always result in a string array, but with dictionary array and literal string it would be beneficial to build a new dictionary.

I did not review the code in detail, maybe this is already happening, but could the dyn kernels automatically downcast/materialize dictionary arrays so that dictionary arrays are still supported as inputs?

@tustvold
Copy link
Contributor Author

Could the dyn kernels automatically downcast/materialize dictionary arrays so that dictionary arrays are still supported as inputs

I think I would prefer that this was delegated to the query engines type coercion machinery, to ensure this is visible to the planner and avoid unnecessary casts back and forth. A PR to do this in DataFusion can be found - apache/datafusion#6785

@tustvold tustvold marked this pull request as ready for review June 28, 2023 09:40
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could soften the blow to removing this functionality by adding some documentation somewhere that explains how calculate on dictionaries (namely cast them to the underlying type).

Perhaps some overview discussion on https://docs.rs/arrow/latest/arrow/compute/kernels/index.html or https://docs.rs/arrow/latest/arrow/compute/kernels/arithmetic/index.html?

Or perhaps on the kernels themselves like https://docs.rs/arrow/latest/arrow/compute/kernels/arithmetic/fn.add_dyn.html 🤔

@alamb
Copy link
Contributor

alamb commented Jun 29, 2023

I think @viirya should also have a chance to review / comment on this prior to merge.

The justification as i understand it is that the primitive kernels are much faster anyways for this kind of operation and so including native dictionary creation is both slower as well as larger code and harder to work with

@viirya
Copy link
Member

viirya commented Jun 29, 2023

I will find some time looking at this soon if you are not hurry to merge this in.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For kernel-level change, this looks good to me as the performance numbers don't show regression.

I'm also thinking about the question around dictionary of primitives. But it is not an issue of this change. Because even if they still have some advantages on other aspects, for arithmetic operations we still can cast them to primitives without performance regression.

@tustvold tustvold merged commit a11b975 into apache:master Jun 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants