Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: regression, is_unique is incorrect since pandas 2.1.0 #57911

Closed
2 of 3 tasks
morotti opened this issue Mar 19, 2024 · 7 comments · Fixed by #57958
Closed
2 of 3 tasks

BUG: regression, is_unique is incorrect since pandas 2.1.0 #57911

morotti opened this issue Mar 19, 2024 · 7 comments · Fixed by #57958
Assignees
Labels
Bug Index Related to the Index class or subclasses Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@morotti
Copy link
Contributor

morotti commented Mar 19, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
print("pandas version", pd.__version__)

values = [1, 1, 2, 3, 4]
index = pd.Index(values)

print("===========")
print(index)
print("is_unique=", index.is_unique)

filtered_index = index[2:].copy()

print("===========")
print(filtered_index)
print("is_unique=", filtered_index.is_unique)


index = pd.Index(values)
filtered_index = index[2:].copy()

print("===========")
print(filtered_index)
print("is_unique=", filtered_index.is_unique)

Issue Description

Hello,

We found a regression, index.is_unique is incorrect since pandas 2.1.0.

I looked for open issues but did not find any fix or existing discussion.
Having a look at the changelog, there were lots of changes in 2.1.0 to introduce copy-on-write optimizations on the index.
I think the issue could be related to that, my best guess, maybe index[2:] cached something from the original index that is no longer correct?

Attaching a simple repro, it's very easy to reproduce. :)

Thank you.

Expected Behavior

pandas version 1.5.3
===========
Int64Index([1, 1, 2, 3, 4], dtype='int64')
is_unique= False
===========
Int64Index([2, 3, 4], dtype='int64')
is_unique= True
===========
Int64Index([2, 3, 4], dtype='int64')
is_unique= True
pandas version 2.2.1
===========
Index([1, 1, 2, 3, 4], dtype='int64')
is_unique= False
===========
Index([2, 3, 4], dtype='int64')
is_unique= False    # <---------------- INCORRECT
===========
Index([2, 3, 4], dtype='int64')
is_unique= True

Installed Versions

tested on:

  • pandas 1.5.3: PASS
  • pandas 2.0.0: PASS
  • pandas 2.0.3: PASS
  • pandas 2.1.0: INCORRECT
  • pandas 2.1.4: INCORRECT
  • pandas 2.2.1 (latest): INCORRECT
@morotti morotti added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 19, 2024
@Aloqeely
Copy link
Member

While this is a lazy fix, I've found that in pandas/_libs/index.pyx if we omit the functionality of IndexEngine's _update_from_sliced method (such that self.need_unique_check and self.need_monotonic_check = 1), it causes is_unique to work as expected and all related tests pass

@lithomas1 lithomas1 added Regression Functionality that used to work in a prior pandas version Index Related to the Index class or subclasses and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 21, 2024
@lithomas1 lithomas1 added this to the 2.2.2 milestone Mar 21, 2024
@lithomas1
Copy link
Member

Hi, I can reproduce this on main - tentatively marking as for 2.2.2.

I haven't looked into this further myself - but further investigation/PRs would be welcome.

@rob-sil
Copy link
Contributor

rob-sil commented Mar 22, 2024

take

@morotti
Copy link
Contributor Author

morotti commented Mar 22, 2024

thank you for working on a fix,

would it be possible to roll the fix to the 2.1.x branch as well?
it would be really helpful since the regression was introduced with 2.1.0

@phofl
Copy link
Member

phofl commented Mar 23, 2024

We don't support the 2.1.x branch anymore, we will only release this on 2.2.x

@lithomas1 lithomas1 modified the milestones: 2.2.2, 2.2.3 Apr 10, 2024
@morotti
Copy link
Contributor Author

morotti commented Apr 29, 2024

Quick update, in case anybody runs into this bug.
I found one way to patch it in 2.1.x, only modifying python code to not require compilation.

The bug seems to be in the logic of _update_from_sliced() in cython code, it's only called by one python function.
As far as I understand, unique and monotonic properties can only be maintained if the original index was unique.

--- a/pandas/core/indexes/base       2024-03-28 14:14:22.086831383 +0000
+++ b/pandas/core/indexes/base.py 2024-03-28 14:31:57.291193802 +0000
@@ -5397,7 +5397,8 @@ class Index(IndexOpsMixin, PandasObject)
         result = type(self)._simple_new(res, name=self._name, refs=self._references)
         if "_engine" in self._cache:
             reverse = slobj.step is not None and slobj.step < 0
-            result._engine._update_from_sliced(self._engine, reverse=reverse)  # type: ignore[union-attr]  # noqa: E501
+            if self._engine.is_unique:
+                result._engine._update_from_sliced(self._engine, reverse=reverse)  # type: ignore[union-attr]  # noqa: E501

         return result

@Aloqeely
Copy link
Member

Thanks @morotti! There's an open PR for fixing this bug at #57958, but it has not been merged yet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Index Related to the Index class or subclasses Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants