-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: is_unique() for IndexedArray #1429
Conversation
This PR aims to fix the following failling test: array = ak._v2.highlevel.Array(["1chchc", "1chchc", "2sss", "3", "4", "5"])
categorical = ak._v2.behaviors.categorical.to_categorical(array)
assert categorical.layout.is_unique() is False The error comes from the In the previous approach, the My fix of the kernel, although it passes the tests, it basically renders the kernel useless as if no range is selected, then it's just a copy of the existing @ianna Could please have a look as well |
Codecov Report
|
@ioanaif - the purpose of the kernel is to return unique values. The length before and after -if eval - would be an indicator that the values are unique. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, the tolength
in line 22 should be considered as an indicator of the unique values. For example, to determine if the following data is unique:
[2,6,3,9,4,2]
they are firstly, sorted: [2,2,3,4,6,9]
, then passed via awkward_unique_copy
: [2,3,4,6,9]
. Since the final length is not equal to the original one, the data is not unique.
I'm not sure how an IndexedArray
uniqueness should be defined: either the indices should be unique or the content should be unique.
Ah, indeed. I changed back the kernel and added the check of indexes length comparison (returns False if the indexes are not unique) |
I think, the indices do not need to be unique. The content does. Say, an indexed array can have duplicate indices |
@ioanaif and @jpivarski - I think, check on >>> content = ak._v2.contents.NumpyArray([1.1, 2.2, 3.3, 4.4])
>>> content
<NumpyArray dtype='float64' len='4'>[1.1 2.2 3.3 4.4]</NumpyArray>
>>> index = ak._v2.index.Index64([0, 0, 0, 0, 0, 2, 2, 2])
>>> iarr = ak._v2.contents.IndexedArray(index, content)
>>> iarr
<IndexedArray len='8'>
<index><Index dtype='int64' len='8'>[0 0 0 0 0 2 2 2]</Index></index>
<content><NumpyArray dtype='float64' len='4'>[1.1 2.2 3.3 4.4]</NumpyArray></content>
</IndexedArray>
>>> harr = ak._v2.Array(iarr)
>>> harr
<Array [1.1, 1.1, 1.1, 1.1, 1.1, 3.3, 3.3, 3.3] type='8 * float64'>
>>> harr.layout.is_unique()
False
>>> harr.layout
<IndexedArray len='8'>
<index><Index dtype='int64' len='8'>[0 0 0 0 0 2 2 2]</Index></index>
<content><NumpyArray dtype='float64' len='4'>[1.1 2.2 3.3 4.4]</NumpyArray></content>
</IndexedArray> |
It follows the same behaviour as v1, thus maybe @jpivarski can weigh in >>> content = ak.layout.NumpyArray([1.1, 2.2, 3.3, 4.4])
>>> index = ak.layout.Index64([0, 0, 0, 0, 0, 2, 2, 2])
>>> iarr = ak.layout.IndexedArray64(index, content)
>>> iarr
<IndexedArray64>
<index><Index64 i="[0 0 0 0 0 2 2 2]" offset="0" length="8" at="0x7f9eba404210"/></index>
<content><NumpyArray format="d" shape="4" data="1.1 2.2 3.3 4.4" at="0x7f9eba4040b0"/></content>
</IndexedArray64>
>>> harr = ak.Array(iarr)
>>> harr.layout.is_unique()
False
>>> harr.layout
<IndexedArray64>
<index><Index64 i="[0 0 0 0 0 2 2 2]" offset="0" length="8" at="0x7f9eba404210"/></index>
<content><NumpyArray format="d" shape="4" data="1.1 2.2 3.3 4.4" at="0x7f9eba4040b0"/></content>
</IndexedArray64>
>>> |
yes, I worry it was wrong there... |
The layout methods, such as |
|
If that's the only one, then it should do what's appropriate for checking the categorical data. Categorical data is an IndexedArray (or maybe IndexedOptionArray) in which the
<IndexedArray64>
<index><Index64 i="[0 0 0 0 0 2 2 2]" offset="0" length="8" at="0x7f9eba404210"/></index>
<content><NumpyArray format="d" shape="4" data="1.1 2.2 3.3 4.4" at="0x7f9eba4040b0"/></content>
</IndexedArray64> because the NumpyArray doesn't have any duplicated elements. If If I'd prefer the function to be simple, that |
No description provided.