Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TYP: how to annotate DataFrame.__getitem__ #46616

Open
twoertwein opened this issue Apr 2, 2022 · 7 comments
Open

TYP: how to annotate DataFrame.__getitem__ #46616

twoertwein opened this issue Apr 2, 2022 · 7 comments
Labels
Needs Discussion Requires discussion from core team before further action Typing type annotations, mypy/pyright type checking

Comments

@twoertwein
Copy link
Member

I was about to create a PR but I realized that annotating DataFrame.__getitem__ might be impossible without making some simplifications.

There are two problems with __getitem__:

  • We allow any Hashable key (one would expect that this always returns a Series) but slice (is Hashable) returns a DataFrame
  • Columns can be a multiindex, df["a"] can return a DataFrame.

The MS stubs seems to make two assumptions: 1) columns can only be of type str (and maybe a few more types - but not Hashable) and 2) multiindex doesn't exist. In practice, this will cover almost all cases.

I don't think there is a solution for the multiindex issue. Even if we make DataFrame generic to carry the type of the column index, there is no Not[Multiindex] type, so we will always end up with incompatible & overlapping overloads.

The Hashable issue can partly be addressed:

# cover most common cases that return a Series
@overloads
def __getitem__(self, key :Scalar) -> Series:
    ...

# cover most common cases that return a DataFrame
@overloads
def __getitem__(self, key : list[HashableT] | np.ndarray | slice | Index | Series) -> DataFrame:
    ...

# everything else
@overloads
def __getitem__(self, key : Hashable) -> Any:  #  or Series | DataFrame (but might create many errors, typshed also uses Any in some cases to avoid unions)
    ...

Do you see a way to cover all cases of __getitem__ and if not which assumptions are you willing to make? @simonjayhawkins @Dr-Irv

@twoertwein twoertwein added Needs Discussion Requires discussion from core team before further action Typing type annotations, mypy/pyright type checking labels Apr 2, 2022
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Apr 2, 2022

I just pushed a PR for this and other mypy related issues in the MS stubs, and this is what I am using that works for the tests that are there:

    @overload
    def __getitem__(self, idx: Scalar) -> Series: ...
    @overload
    def __getitem__(self, rows: slice) -> DataFrame: ...
    @overload
    def __getitem__(
        self,
        idx: Union[
            Tuple,
            Series[_bool],
            DataFrame,
            List[_str],
            List[Hashable],
            Index,
            np_ndarray_str,
            Sequence[Tuple[Scalar, ...]],
        ],
    ) -> DataFrame: ...

@phofl
Copy link
Member

phofl commented Apr 3, 2022

I think

    @overload
    def __getitem__(self, idx: Scalar) -> Series: ...

this is incorrect. If the idx is duplicated in the DataFrame, this returns a df.


df = pd.DataFrame([[1, 2, 3]], columns=["a", "a", "b"])
df["a"]

this returns a DataFrame instead of a Series.

@twoertwein
Copy link
Member Author

Thanks @phofl for finding another issue!

I'm not sure what the best solution for __getitem__ is: while it seems impossible to type 100%-correctly, it is a widely used method.

I would be inclined to a solution similar to @Dr-Irv's (but to somewhat accept Hashable) but we would then probably need

  1. a documentation about this issue (maybe something similar to numpy's comments on their type annotations) and
  2. [not sure about the following] to not make the life for MultiIndex-fans difficult, maybe a "pseudo-class" class DataFrameMultIndexColumn(DataFrame) which only changes the overloads for __getitem__. People who knowingly have a DataFrame with MultiIndx for columns can then simply annotate/cast their DataFrame to this special sub-class

@simonjayhawkins
Copy link
Member

stepping back a bit, It was my understanding from the dev meetings a couple of months back that we were moving the MS Stubs across as is in the first instance? Has this changed?

@twoertwein
Copy link
Member Author

It was my understanding from the dev meetings a couple of months back that we were moving the MS Stubs across as is in the first instance?

I'm looking forward to that (especially if that is done sooner than later)! Maybe the question of how to annotate __getitem__ should be discussed at the MS stubs instead of here or when trying to consolidate the copied and inline annotations.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Apr 6, 2022

this is incorrect. If the idx is duplicated in the DataFrame, this returns a df.

So this is an issue because the type of the result is based on the "insides" of the calling DataFrame and you can't track that as a type, since python is a dynamically typed language. Typing support is using static typing concepts.

With respect to typing, there are two ways to look at this:

  1. Make sure the type annotations are returning the union of all possible results, which accurately reflects behavior, but forces users to cast results.
  2. Have the type annotations handle the majority of normal use cases, and document that they may be incorrect if the underlying object has "unusual" properties (e.g., duplicate column names).

As a user, I would vote for (2), which is why we've used those annotations in the MS stubs.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Apr 6, 2022

It was my understanding from the dev meetings a couple of months back that we were moving the MS Stubs across as is in the first instance?

I'm looking forward to that (especially if that is done sooner than later)! Maybe the question of how to annotate __getitem__ should be discussed at the MS stubs instead of here or when trying to consolidate the copied and inline annotations.

I've spent a fair bit of time improving the annotations in the MS stubs, and developing out the test framework there. In particular, I just created a huge PR to make things work right with mypy . Once that is accepted (possibly in the next few days), the next steps would be to first bring over the testing mechanism and integrated it into the CI, and then bring over the actual stubs themselves.

We'd have to figure out a way to manage the transition from MS publishing the MS stubs in pylance (and Visual Studio Code) releases (and no longer maintaining them) to us having a pandas release with pandas supported stubs. I think the easiest way to do this would be to have a separate project with these stubs that is a submodule to pandas and a submodule to the MS stubs (which have more stubs than just pandas). But that would require both teams to agree that is the (relatively short term) way of managing these stubs. I'm open to other suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Discussion Requires discussion from core team before further action Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

No branches or pull requests

4 participants