TYP: how to annotate DataFrame.getitem #46616

twoertwein · 2022-04-02T14:52:26Z

I was about to create a PR but I realized that annotating DataFrame.__getitem__ might be impossible without making some simplifications.

There are two problems with __getitem__:

We allow any Hashable key (one would expect that this always returns a Series) but slice (is Hashable) returns a DataFrame
Columns can be a multiindex, df["a"] can return a DataFrame.

The MS stubs seems to make two assumptions: 1) columns can only be of type str (and maybe a few more types - but not Hashable) and 2) multiindex doesn't exist. In practice, this will cover almost all cases.

I don't think there is a solution for the multiindex issue. Even if we make DataFrame generic to carry the type of the column index, there is no Not[Multiindex] type, so we will always end up with incompatible & overlapping overloads.

The Hashable issue can partly be addressed:

# cover most common cases that return a Series
@overloads
def __getitem__(self, key :Scalar) -> Series:
    ...

# cover most common cases that return a DataFrame
@overloads
def __getitem__(self, key : list[HashableT] | np.ndarray | slice | Index | Series) -> DataFrame:
    ...

# everything else
@overloads
def __getitem__(self, key : Hashable) -> Any:  #  or Series | DataFrame (but might create many errors, typshed also uses Any in some cases to avoid unions)
    ...

Do you see a way to cover all cases of __getitem__ and if not which assumptions are you willing to make? @simonjayhawkins @Dr-Irv

The text was updated successfully, but these errors were encountered:

Dr-Irv · 2022-04-02T15:27:36Z

I just pushed a PR for this and other mypy related issues in the MS stubs, and this is what I am using that works for the tests that are there:

    @overload
    def __getitem__(self, idx: Scalar) -> Series: ...
    @overload
    def __getitem__(self, rows: slice) -> DataFrame: ...
    @overload
    def __getitem__(
        self,
        idx: Union[
            Tuple,
            Series[_bool],
            DataFrame,
            List[_str],
            List[Hashable],
            Index,
            np_ndarray_str,
            Sequence[Tuple[Scalar, ...]],
        ],
    ) -> DataFrame: ...

phofl · 2022-04-03T16:24:34Z

I think

    @overload
    def __getitem__(self, idx: Scalar) -> Series: ...

this is incorrect. If the idx is duplicated in the DataFrame, this returns a df.


df = pd.DataFrame([[1, 2, 3]], columns=["a", "a", "b"])
df["a"]

this returns a DataFrame instead of a Series.

twoertwein · 2022-04-03T21:58:35Z

Thanks @phofl for finding another issue!

I'm not sure what the best solution for __getitem__ is: while it seems impossible to type 100%-correctly, it is a widely used method.

I would be inclined to a solution similar to @Dr-Irv's (but to somewhat accept Hashable) but we would then probably need

a documentation about this issue (maybe something similar to numpy's comments on their type annotations) and
[not sure about the following] to not make the life for MultiIndex-fans difficult, maybe a "pseudo-class" class DataFrameMultIndexColumn(DataFrame) which only changes the overloads for __getitem__. People who knowingly have a DataFrame with MultiIndx for columns can then simply annotate/cast their DataFrame to this special sub-class

simonjayhawkins · 2022-04-04T08:28:27Z

stepping back a bit, It was my understanding from the dev meetings a couple of months back that we were moving the MS Stubs across as is in the first instance? Has this changed?

twoertwein · 2022-04-05T00:28:54Z

It was my understanding from the dev meetings a couple of months back that we were moving the MS Stubs across as is in the first instance?

I'm looking forward to that (especially if that is done sooner than later)! Maybe the question of how to annotate __getitem__ should be discussed at the MS stubs instead of here or when trying to consolidate the copied and inline annotations.

Dr-Irv · 2022-04-06T15:58:23Z

this is incorrect. If the idx is duplicated in the DataFrame, this returns a df.

So this is an issue because the type of the result is based on the "insides" of the calling DataFrame and you can't track that as a type, since python is a dynamically typed language. Typing support is using static typing concepts.

With respect to typing, there are two ways to look at this:

Make sure the type annotations are returning the union of all possible results, which accurately reflects behavior, but forces users to cast results.
Have the type annotations handle the majority of normal use cases, and document that they may be incorrect if the underlying object has "unusual" properties (e.g., duplicate column names).

As a user, I would vote for (2), which is why we've used those annotations in the MS stubs.

Dr-Irv · 2022-04-06T16:05:50Z

It was my understanding from the dev meetings a couple of months back that we were moving the MS Stubs across as is in the first instance?

I'm looking forward to that (especially if that is done sooner than later)! Maybe the question of how to annotate __getitem__ should be discussed at the MS stubs instead of here or when trying to consolidate the copied and inline annotations.

I've spent a fair bit of time improving the annotations in the MS stubs, and developing out the test framework there. In particular, I just created a huge PR to make things work right with mypy . Once that is accepted (possibly in the next few days), the next steps would be to first bring over the testing mechanism and integrated it into the CI, and then bring over the actual stubs themselves.

We'd have to figure out a way to manage the transition from MS publishing the MS stubs in pylance (and Visual Studio Code) releases (and no longer maintaining them) to us having a pandas release with pandas supported stubs. I think the easiest way to do this would be to have a separate project with these stubs that is a submodule to pandas and a submodule to the MS stubs (which have more stubs than just pandas). But that would require both teams to agree that is the (relatively short term) way of managing these stubs. I'm open to other suggestions.

twoertwein added Needs Discussion Requires discussion from core team before further action Typing type annotations, mypy/pyright type checking labels Apr 2, 2022

twoertwein mentioned this issue May 2, 2022

ENH: Add set of tests to type check the public APIs #45252

Open

twoertwein mentioned this issue Jul 16, 2022

TYP: Update timestamps.pyi #47756

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TYP: how to annotate DataFrame.getitem #46616

TYP: how to annotate DataFrame.getitem #46616

twoertwein commented Apr 2, 2022

Dr-Irv commented Apr 2, 2022

phofl commented Apr 3, 2022

twoertwein commented Apr 3, 2022

simonjayhawkins commented Apr 4, 2022

twoertwein commented Apr 5, 2022

Dr-Irv commented Apr 6, 2022

Dr-Irv commented Apr 6, 2022

TYP: how to annotate DataFrame.__getitem__ #46616

TYP: how to annotate DataFrame.__getitem__ #46616

Comments

twoertwein commented Apr 2, 2022

Dr-Irv commented Apr 2, 2022

phofl commented Apr 3, 2022

twoertwein commented Apr 3, 2022

simonjayhawkins commented Apr 4, 2022

twoertwein commented Apr 5, 2022

Dr-Irv commented Apr 6, 2022

Dr-Irv commented Apr 6, 2022

TYP: how to annotate DataFrame.getitem #46616

TYP: how to annotate DataFrame.getitem #46616