Fix: is_unique() for IndexedArray #1429

ioanaif · 2022-04-19T18:01:55Z

No description provided.

ioanaif · 2022-04-19T18:08:08Z

This PR aims to fix the following failling test:

  array = ak._v2.highlevel.Array(["1chchc", "1chchc", "2sss", "3", "4", "5"])
  categorical = ak._v2.behaviors.categorical.to_categorical(array)
  assert categorical.layout.is_unique() is False

The error comes from the awkward_unique_copy kernel in IndexedArray. This kernel removes duplicates of indexes if present which makes this assert to return True.

In the previous approach, the IndexedArray_ranges_carry_next_64 kernel was selecting indexes based on a start and stop, without removing duplicates.

My fix of the kernel, although it passes the tests, it basically renders the kernel useless as if no range is selected, then it's just a copy of the existing index array.

@ianna Could please have a look as well

codecov · 2022-04-19T18:54:06Z

Codecov Report

Merging #1429 (45ec4f7) into main (edfce38) will increase coverage by 0.88%.
The diff coverage is 62.44%.

Impacted Files	Coverage Δ
src/awkward/_v2/_connect/cuda/__init__.py	`0.00% <0.00%> (ø)`
src/awkward/_v2/_connect/jax/__init__.py	`0.00% <0.00%> (ø)`
src/awkward/_v2/operations/convert/ak_from_cupy.py	`50.00% <0.00%> (+23.33%)`	⬆️
src/awkward/_v2/operations/convert/ak_from_iter.py	`93.75% <ø> (ø)`
src/awkward/_v2/operations/convert/ak_from_jax.py	`50.00% <0.00%> (-25.00%)`	⬇️
src/awkward/_v2/operations/convert/ak_to_cupy.py	`33.33% <0.00%> (+23.95%)`	⬆️
src/awkward/_v2/operations/convert/ak_to_jax.py	`33.33% <0.00%> (-41.67%)`	⬇️
src/awkward/_v2/operations/describe/ak_backend.py	`10.00% <0.00%> (-2.50%)`	⬇️
...wkward/_v2/operations/structure/ak_argcartesian.py	`78.94% <ø> (ø)`
...c/awkward/_v2/operations/structure/ak_cartesian.py	`89.68% <ø> (ø)`
... and 90 more

jpivarski · 2022-04-21T13:08:44Z

If @ianna signs off on this, then I do too. It does change awkward_unique_copy.cpp and I don't know what the consequences of that are, but @ianna would.

Oh, and if you change the kernel definition, change kernel-specificaiton.yml, too.

ianna · 2022-04-23T21:31:21Z

@ioanaif - the purpose of the kernel is to return unique values. The length before and after -if eval - would be an indicator that the values are unique.

ianna

I think, the tolength in line 22 should be considered as an indicator of the unique values. For example, to determine if the following data is unique:
[2,6,3,9,4,2] they are firstly, sorted: [2,2,3,4,6,9], then passed via awkward_unique_copy: [2,3,4,6,9]. Since the final length is not equal to the original one, the data is not unique.

I'm not sure how an IndexedArray uniqueness should be defined: either the indices should be unique or the content should be unique.

ioanaif · 2022-04-25T10:12:57Z

I think, the tolength in line 22 should be considered as an indicator of the unique values. For example, to determine if the following data is unique: [2,6,3,9,4,2] they are firstly, sorted: [2,2,3,4,6,9], then passed via awkward_unique_copy: [2,3,4,6,9]. Since the final length is not equal to the original one, the data is not unique.

Ah, indeed. I changed back the kernel and added the check of indexes length comparison (returns False if the indexes are not unique)

ianna · 2022-04-25T16:10:21Z

I think, the tolength in line 22 should be considered as an indicator of the unique values. For example, to determine if the following data is unique: [2,6,3,9,4,2] they are firstly, sorted: [2,2,3,4,6,9], then passed via awkward_unique_copy: [2,3,4,6,9]. Since the final length is not equal to the original one, the data is not unique.

Ah, indeed. I changed back the kernel and added the check of indexes length comparison (returns False if the indexes are not unique)

I think, the indices do not need to be unique. The content does. Say, an indexed array can have duplicate indices [0,0,0,1,2,3] and be unique if its content is unique: [0.1,1.1,2.1,3.1]

jpivarski · 2022-04-26T18:53:35Z

@ioanaif and @ianna, let me know when this is settled so I can review it. (I can't tell from the discussion.) Thanks!

ianna · 2022-04-27T07:46:08Z

@ioanaif and @jpivarski - I think, check on is_unique for the following IndexedArray should be True, but I'm fine with the PR if this is expected behaviour.

>>> content = ak._v2.contents.NumpyArray([1.1, 2.2, 3.3, 4.4])
>>> content
<NumpyArray dtype='float64' len='4'>[1.1 2.2 3.3 4.4]</NumpyArray>
>>> index = ak._v2.index.Index64([0, 0, 0, 0, 0, 2, 2, 2])
>>> iarr = ak._v2.contents.IndexedArray(index, content)
>>> iarr
<IndexedArray len='8'>
    <index><Index dtype='int64' len='8'>[0 0 0 0 0 2 2 2]</Index></index>
    <content><NumpyArray dtype='float64' len='4'>[1.1 2.2 3.3 4.4]</NumpyArray></content>
</IndexedArray>
>>> harr = ak._v2.Array(iarr)
>>> harr
<Array [1.1, 1.1, 1.1, 1.1, 1.1, 3.3, 3.3, 3.3] type='8 * float64'>
>>> harr.layout.is_unique()
False
>>> harr.layout
<IndexedArray len='8'>
    <index><Index dtype='int64' len='8'>[0 0 0 0 0 2 2 2]</Index></index>
    <content><NumpyArray dtype='float64' len='4'>[1.1 2.2 3.3 4.4]</NumpyArray></content>
</IndexedArray>

ioanaif · 2022-04-27T08:51:12Z

It follows the same behaviour as v1, thus maybe @jpivarski can weigh in

>>> content = ak.layout.NumpyArray([1.1, 2.2, 3.3, 4.4])
>>> index = ak.layout.Index64([0, 0, 0, 0, 0, 2, 2, 2])
>>> iarr = ak.layout.IndexedArray64(index, content)
>>> iarr
<IndexedArray64>
    <index><Index64 i="[0 0 0 0 0 2 2 2]" offset="0" length="8" at="0x7f9eba404210"/></index>
    <content><NumpyArray format="d" shape="4" data="1.1 2.2 3.3 4.4" at="0x7f9eba4040b0"/></content>
</IndexedArray64>
>>> harr = ak.Array(iarr)
>>> harr.layout.is_unique()
False
>>> harr.layout
<IndexedArray64>
    <index><Index64 i="[0 0 0 0 0 2 2 2]" offset="0" length="8" at="0x7f9eba404210"/></index>
    <content><NumpyArray format="d" shape="4" data="1.1 2.2 3.3 4.4" at="0x7f9eba4040b0"/></content>
</IndexedArray64>
>>>

ianna · 2022-04-27T13:30:59Z

It follows the same behaviour as v1

yes, I worry it was wrong there...

jpivarski · 2022-04-27T16:31:05Z

The layout methods, such as layout.is_unique(), are not high-level functions: they do whatever we need to do to support the high-level functions, which have large docstrings saying what they're supposed to do. Which high-level function(s) is layout.is_unique() used in? That's what will determine if this behavior is right or wrong.

ioanaif · 2022-04-27T16:47:24Z

Which high-level function(s) is layout.is_unique() used in? That's what will determine if this behavior is right or wrong.

ak.validity_error -> layout.validityerror_parameters() -> layout.is_unique() : array = "categorical" requires contents to be unique

jpivarski · 2022-04-27T17:24:53Z

If that's the only one, then it should do what's appropriate for checking the categorical data. Categorical data is an IndexedArray (or maybe IndexedOptionArray) in which the index is allowed to have a lot of duplicates, but the nested content is not.

layout.validityerror needs a way of saying that the following is okay

<IndexedArray64>
    <index><Index64 i="[0 0 0 0 0 2 2 2]" offset="0" length="8" at="0x7f9eba404210"/></index>
    <content><NumpyArray format="d" shape="4" data="1.1 2.2 3.3 4.4" at="0x7f9eba4040b0"/></content>
</IndexedArray64>

because the NumpyArray doesn't have any duplicated elements. If layout.is_unique() does exactly what its name sounds like it does and complains about the IndexedArray having many 0s and 2s in its index, then perhaps layout.valididtyerror should only be applying it to the IndexedArray's content.

If layout.is_unique() is somehow IndexedArray-aware and only looks inside of the IndexedArray's content, then perhaps its name should be changed because it's misleading.

I'd prefer the function to be simple, that layout.is_unique() just checks to see if there are any duplicated entries in layout, and have layout.validityerror be aware of the fact that if an IndexedArray/IndexedOptionArray is labeled categorical, only its content must pass is_unique().

Initial pass on fixing .is_unique() for IndexedArray

9a0362e

ioanaif requested a review from ianna April 20, 2022 09:22

ianna reviewed Apr 23, 2022

View reviewed changes

Added lenght comparison for unique index

f4d2d58

Fix apply is_unique to IndexedArray content

45ec4f7

ioanaif merged commit 73ea782 into main Apr 28, 2022

ioanaif deleted the ioanaif/fix-unique-index-IndexedArray branch April 28, 2022 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: is_unique() for IndexedArray #1429

Fix: is_unique() for IndexedArray #1429

ioanaif commented Apr 19, 2022

ioanaif commented Apr 19, 2022

codecov bot commented Apr 19, 2022 •

edited

Loading

jpivarski commented Apr 21, 2022

ianna commented Apr 23, 2022

ianna left a comment

ioanaif commented Apr 25, 2022

ianna commented Apr 25, 2022

jpivarski commented Apr 26, 2022

ianna commented Apr 27, 2022

ioanaif commented Apr 27, 2022

ianna commented Apr 27, 2022

jpivarski commented Apr 27, 2022

ioanaif commented Apr 27, 2022

jpivarski commented Apr 27, 2022

Fix: is_unique() for IndexedArray #1429

Fix: is_unique() for IndexedArray #1429

Conversation

ioanaif commented Apr 19, 2022

ioanaif commented Apr 19, 2022

codecov bot commented Apr 19, 2022 • edited Loading

Codecov Report

jpivarski commented Apr 21, 2022

ianna commented Apr 23, 2022

ianna left a comment

Choose a reason for hiding this comment

ioanaif commented Apr 25, 2022

ianna commented Apr 25, 2022

jpivarski commented Apr 26, 2022

ianna commented Apr 27, 2022

ioanaif commented Apr 27, 2022

ianna commented Apr 27, 2022

jpivarski commented Apr 27, 2022

ioanaif commented Apr 27, 2022

jpivarski commented Apr 27, 2022

codecov bot commented Apr 19, 2022 •

edited

Loading