Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: is_unique() for IndexedArray #1429

Merged
merged 3 commits into from
Apr 28, 2022
Merged

Conversation

ioanaif
Copy link
Collaborator

@ioanaif ioanaif commented Apr 19, 2022

No description provided.

@ioanaif
Copy link
Collaborator Author

ioanaif commented Apr 19, 2022

This PR aims to fix the following failling test:

  array = ak._v2.highlevel.Array(["1chchc", "1chchc", "2sss", "3", "4", "5"])
  categorical = ak._v2.behaviors.categorical.to_categorical(array)
  assert categorical.layout.is_unique() is False

The error comes from the awkward_unique_copy kernel in IndexedArray. This kernel removes duplicates of indexes if present which makes this assert to return True.

In the previous approach, the IndexedArray_ranges_carry_next_64 kernel was selecting indexes based on a start and stop, without removing duplicates.

My fix of the kernel, although it passes the tests, it basically renders the kernel useless as if no range is selected, then it's just a copy of the existing index array.

@ianna Could please have a look as well

@codecov
Copy link

codecov bot commented Apr 19, 2022

Codecov Report

Merging #1429 (45ec4f7) into main (edfce38) will increase coverage by 0.88%.
The diff coverage is 62.44%.

Impacted Files Coverage Δ
src/awkward/_v2/_connect/cuda/__init__.py 0.00% <0.00%> (ø)
src/awkward/_v2/_connect/jax/__init__.py 0.00% <0.00%> (ø)
src/awkward/_v2/operations/convert/ak_from_cupy.py 50.00% <0.00%> (+23.33%) ⬆️
src/awkward/_v2/operations/convert/ak_from_iter.py 93.75% <ø> (ø)
src/awkward/_v2/operations/convert/ak_from_jax.py 50.00% <0.00%> (-25.00%) ⬇️
src/awkward/_v2/operations/convert/ak_to_cupy.py 33.33% <0.00%> (+23.95%) ⬆️
src/awkward/_v2/operations/convert/ak_to_jax.py 33.33% <0.00%> (-41.67%) ⬇️
src/awkward/_v2/operations/describe/ak_backend.py 10.00% <0.00%> (-2.50%) ⬇️
...wkward/_v2/operations/structure/ak_argcartesian.py 78.94% <ø> (ø)
...c/awkward/_v2/operations/structure/ak_cartesian.py 89.68% <ø> (ø)
... and 90 more

@ioanaif ioanaif requested a review from ianna April 20, 2022 09:22
@jpivarski
Copy link
Member

If @ianna signs off on this, then I do too. It does change awkward_unique_copy.cpp and I don't know what the consequences of that are, but @ianna would.

Oh, and if you change the kernel definition, change kernel-specificaiton.yml, too.

@ianna
Copy link
Collaborator

ianna commented Apr 23, 2022

@ioanaif - the purpose of the kernel is to return unique values. The length before and after -if eval - would be an indicator that the values are unique.

Copy link
Collaborator

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, the tolength in line 22 should be considered as an indicator of the unique values. For example, to determine if the following data is unique:
[2,6,3,9,4,2] they are firstly, sorted: [2,2,3,4,6,9], then passed via awkward_unique_copy: [2,3,4,6,9]. Since the final length is not equal to the original one, the data is not unique.

I'm not sure how an IndexedArray uniqueness should be defined: either the indices should be unique or the content should be unique.

@ioanaif
Copy link
Collaborator Author

ioanaif commented Apr 25, 2022

I think, the tolength in line 22 should be considered as an indicator of the unique values. For example, to determine if the following data is unique: [2,6,3,9,4,2] they are firstly, sorted: [2,2,3,4,6,9], then passed via awkward_unique_copy: [2,3,4,6,9]. Since the final length is not equal to the original one, the data is not unique.

Ah, indeed. I changed back the kernel and added the check of indexes length comparison (returns False if the indexes are not unique)

@ianna
Copy link
Collaborator

ianna commented Apr 25, 2022

I think, the tolength in line 22 should be considered as an indicator of the unique values. For example, to determine if the following data is unique: [2,6,3,9,4,2] they are firstly, sorted: [2,2,3,4,6,9], then passed via awkward_unique_copy: [2,3,4,6,9]. Since the final length is not equal to the original one, the data is not unique.

Ah, indeed. I changed back the kernel and added the check of indexes length comparison (returns False if the indexes are not unique)

I think, the indices do not need to be unique. The content does. Say, an indexed array can have duplicate indices [0,0,0,1,2,3] and be unique if its content is unique: [0.1,1.1,2.1,3.1]

@jpivarski
Copy link
Member

@ioanaif and @ianna, let me know when this is settled so I can review it. (I can't tell from the discussion.) Thanks!

@ianna
Copy link
Collaborator

ianna commented Apr 27, 2022

@ioanaif and @jpivarski - I think, check on is_unique for the following IndexedArray should be True, but I'm fine with the PR if this is expected behaviour.

>>> content = ak._v2.contents.NumpyArray([1.1, 2.2, 3.3, 4.4])
>>> content
<NumpyArray dtype='float64' len='4'>[1.1 2.2 3.3 4.4]</NumpyArray>
>>> index = ak._v2.index.Index64([0, 0, 0, 0, 0, 2, 2, 2])
>>> iarr = ak._v2.contents.IndexedArray(index, content)
>>> iarr
<IndexedArray len='8'>
    <index><Index dtype='int64' len='8'>[0 0 0 0 0 2 2 2]</Index></index>
    <content><NumpyArray dtype='float64' len='4'>[1.1 2.2 3.3 4.4]</NumpyArray></content>
</IndexedArray>
>>> harr = ak._v2.Array(iarr)
>>> harr
<Array [1.1, 1.1, 1.1, 1.1, 1.1, 3.3, 3.3, 3.3] type='8 * float64'>
>>> harr.layout.is_unique()
False
>>> harr.layout
<IndexedArray len='8'>
    <index><Index dtype='int64' len='8'>[0 0 0 0 0 2 2 2]</Index></index>
    <content><NumpyArray dtype='float64' len='4'>[1.1 2.2 3.3 4.4]</NumpyArray></content>
</IndexedArray>

@ioanaif
Copy link
Collaborator Author

ioanaif commented Apr 27, 2022

It follows the same behaviour as v1, thus maybe @jpivarski can weigh in

>>> content = ak.layout.NumpyArray([1.1, 2.2, 3.3, 4.4])
>>> index = ak.layout.Index64([0, 0, 0, 0, 0, 2, 2, 2])
>>> iarr = ak.layout.IndexedArray64(index, content)
>>> iarr
<IndexedArray64>
    <index><Index64 i="[0 0 0 0 0 2 2 2]" offset="0" length="8" at="0x7f9eba404210"/></index>
    <content><NumpyArray format="d" shape="4" data="1.1 2.2 3.3 4.4" at="0x7f9eba4040b0"/></content>
</IndexedArray64>
>>> harr = ak.Array(iarr)
>>> harr.layout.is_unique()
False
>>> harr.layout
<IndexedArray64>
    <index><Index64 i="[0 0 0 0 0 2 2 2]" offset="0" length="8" at="0x7f9eba404210"/></index>
    <content><NumpyArray format="d" shape="4" data="1.1 2.2 3.3 4.4" at="0x7f9eba4040b0"/></content>
</IndexedArray64>
>>> 

@ianna
Copy link
Collaborator

ianna commented Apr 27, 2022

It follows the same behaviour as v1

yes, I worry it was wrong there...

@jpivarski
Copy link
Member

The layout methods, such as layout.is_unique(), are not high-level functions: they do whatever we need to do to support the high-level functions, which have large docstrings saying what they're supposed to do. Which high-level function(s) is layout.is_unique() used in? That's what will determine if this behavior is right or wrong.

@ioanaif
Copy link
Collaborator Author

ioanaif commented Apr 27, 2022

Which high-level function(s) is layout.is_unique() used in? That's what will determine if this behavior is right or wrong.

ak.validity_error -> layout.validityerror_parameters() -> layout.is_unique() : array = "categorical" requires contents to be unique

@jpivarski
Copy link
Member

If that's the only one, then it should do what's appropriate for checking the categorical data. Categorical data is an IndexedArray (or maybe IndexedOptionArray) in which the index is allowed to have a lot of duplicates, but the nested content is not.

layout.validityerror needs a way of saying that the following is okay

<IndexedArray64>
    <index><Index64 i="[0 0 0 0 0 2 2 2]" offset="0" length="8" at="0x7f9eba404210"/></index>
    <content><NumpyArray format="d" shape="4" data="1.1 2.2 3.3 4.4" at="0x7f9eba4040b0"/></content>
</IndexedArray64>

because the NumpyArray doesn't have any duplicated elements. If layout.is_unique() does exactly what its name sounds like it does and complains about the IndexedArray having many 0s and 2s in its index, then perhaps layout.valididtyerror should only be applying it to the IndexedArray's content.

If layout.is_unique() is somehow IndexedArray-aware and only looks inside of the IndexedArray's content, then perhaps its name should be changed because it's misleading.

I'd prefer the function to be simple, that layout.is_unique() just checks to see if there are any duplicated entries in layout, and have layout.validityerror be aware of the fact that if an IndexedArray/IndexedOptionArray is labeled categorical, only its content must pass is_unique().

@ioanaif ioanaif merged commit 73ea782 into main Apr 28, 2022
@ioanaif ioanaif deleted the ioanaif/fix-unique-index-IndexedArray branch April 28, 2022 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants