BUG: regression, is_unique is incorrect since pandas 2.1.0 #57911

morotti · 2024-03-19T12:04:05Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
print("pandas version", pd.__version__)

values = [1, 1, 2, 3, 4]
index = pd.Index(values)

print("===========")
print(index)
print("is_unique=", index.is_unique)

filtered_index = index[2:].copy()

print("===========")
print(filtered_index)
print("is_unique=", filtered_index.is_unique)


index = pd.Index(values)
filtered_index = index[2:].copy()

print("===========")
print(filtered_index)
print("is_unique=", filtered_index.is_unique)

Issue Description

Hello,

We found a regression, index.is_unique is incorrect since pandas 2.1.0.

I looked for open issues but did not find any fix or existing discussion.
Having a look at the changelog, there were lots of changes in 2.1.0 to introduce copy-on-write optimizations on the index.
I think the issue could be related to that, my best guess, maybe index[2:] cached something from the original index that is no longer correct?

Attaching a simple repro, it's very easy to reproduce. :)

Thank you.

Expected Behavior

pandas version 1.5.3
===========
Int64Index([1, 1, 2, 3, 4], dtype='int64')
is_unique= False
===========
Int64Index([2, 3, 4], dtype='int64')
is_unique= True
===========
Int64Index([2, 3, 4], dtype='int64')
is_unique= True

pandas version 2.2.1
===========
Index([1, 1, 2, 3, 4], dtype='int64')
is_unique= False
===========
Index([2, 3, 4], dtype='int64')
is_unique= False    # <---------------- INCORRECT
===========
Index([2, 3, 4], dtype='int64')
is_unique= True

Installed Versions

tested on:

pandas 1.5.3: PASS
pandas 2.0.0: PASS
pandas 2.0.3: PASS
pandas 2.1.0: INCORRECT
pandas 2.1.4: INCORRECT
pandas 2.2.1 (latest): INCORRECT

The text was updated successfully, but these errors were encountered:

Aloqeely · 2024-03-20T12:39:34Z

While this is a lazy fix, I've found that in pandas/_libs/index.pyx if we omit the functionality of IndexEngine's _update_from_sliced method (such that self.need_unique_check and self.need_monotonic_check = 1), it causes is_unique to work as expected and all related tests pass

lithomas1 · 2024-03-21T23:34:49Z

Hi, I can reproduce this on main - tentatively marking as for 2.2.2.

I haven't looked into this further myself - but further investigation/PRs would be welcome.

rob-sil · 2024-03-22T04:17:30Z

take

morotti · 2024-03-22T14:36:44Z

thank you for working on a fix,

would it be possible to roll the fix to the 2.1.x branch as well?
it would be really helpful since the regression was introduced with 2.1.0

phofl · 2024-03-23T01:36:42Z

We don't support the 2.1.x branch anymore, we will only release this on 2.2.x

morotti · 2024-04-29T16:39:16Z

Quick update, in case anybody runs into this bug.
I found one way to patch it in 2.1.x, only modifying python code to not require compilation.

The bug seems to be in the logic of _update_from_sliced() in cython code, it's only called by one python function.
As far as I understand, unique and monotonic properties can only be maintained if the original index was unique.

--- a/pandas/core/indexes/base       2024-03-28 14:14:22.086831383 +0000
+++ b/pandas/core/indexes/base.py 2024-03-28 14:31:57.291193802 +0000
@@ -5397,7 +5397,8 @@ class Index(IndexOpsMixin, PandasObject)
         result = type(self)._simple_new(res, name=self._name, refs=self._references)
         if "_engine" in self._cache:
             reverse = slobj.step is not None and slobj.step < 0
-            result._engine._update_from_sliced(self._engine, reverse=reverse)  # type: ignore[union-attr]  # noqa: E501
+            if self._engine.is_unique:
+                result._engine._update_from_sliced(self._engine, reverse=reverse)  # type: ignore[union-attr]  # noqa: E501

         return result

Aloqeely · 2024-04-29T17:33:56Z

Thanks @morotti! There's an open PR for fixing this bug at #57958, but it has not been merged yet

morotti added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 19, 2024

lithomas1 added Regression Functionality that used to work in a prior pandas version Index Related to the Index class or subclasses and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 21, 2024

lithomas1 added this to the 2.2.2 milestone Mar 21, 2024

github-actions bot assigned rob-sil Mar 22, 2024

rob-sil mentioned this issue Mar 22, 2024

BUG: Fix is_unique regression for slices of Indexes #57958

Merged

5 tasks

lithomas1 modified the milestones: 2.2.2, 2.2.3 Apr 10, 2024

mroeschke closed this as completed in #57958 Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: regression, is_unique is incorrect since pandas 2.1.0 #57911

BUG: regression, is_unique is incorrect since pandas 2.1.0 #57911

morotti commented Mar 19, 2024 •

edited

Loading

Aloqeely commented Mar 20, 2024

lithomas1 commented Mar 21, 2024

rob-sil commented Mar 22, 2024

morotti commented Mar 22, 2024

phofl commented Mar 23, 2024

morotti commented Apr 29, 2024 •

edited

Loading

Aloqeely commented Apr 29, 2024

BUG: regression, is_unique is incorrect since pandas 2.1.0 #57911

BUG: regression, is_unique is incorrect since pandas 2.1.0 #57911

Comments

morotti commented Mar 19, 2024 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

Aloqeely commented Mar 20, 2024

lithomas1 commented Mar 21, 2024

rob-sil commented Mar 22, 2024

morotti commented Mar 22, 2024

phofl commented Mar 23, 2024

morotti commented Apr 29, 2024 • edited Loading

Aloqeely commented Apr 29, 2024

morotti commented Mar 19, 2024 •

edited

Loading

morotti commented Apr 29, 2024 •

edited

Loading