-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Typing Master Tracker #25601
Comments
I think this can be started at any time, since it's independent of how the types are written.
It'd be helpful to hear from people who have experience typing a large code base about where best to start. https://mypy.readthedocs.io/en/latest/existing_code.html provides some guidance. |
Since I was asked, and I'm in the middle of annotating our codebase at $work, I'll share some of what I've learnt.
#25576 only allowed stubgen to run on pandas to generate the .pyi files, the generated files are far from complete, most things are annotated as I've also been taking advantage of python 3.7
In general I think CI typing checking a good idea, but I think it should be an "optional" pass while the types are being written. I enabled mypy in our CI for a project a few years ago, and it was discouraging and frustrating to get started because of all the untyped code, which then caused us to ignore it and eventually disable it. If I were to do it again, it'd make the CI linting more of a warning than and error initially, and make mypy as permissive as possible until everything was properly typed; which might take time, and may even require a mypy plugin for complex things.
I can't speak to much on this since I'm not all that familiar with pandas itself, but I would expect good typing on the main data types, and the public API would be most useful for people using pandas. If thorough/correct typing on the main data types is too difficult without a plugin, perhaps My approach at work has been to add annotations to new functions/files, and add them to existing functions as I've had need to (if mypy complains for example). I've found that there were some very common files/classes that were used in a lot of places that I needed to annotate, and that would reduce a large percentage of the mypy errors. Mypy is pretty good at guessing types. I think the link that @TomAugspurger provided about annotating existing code bases is pretty much how I started, and has been helpful. |
Right now in #25622 I've whitelisted existing modules that have type annotations and made what I would consider "simple" updates (imports, typo fixes) to get a clean return run against those files. When you look at all files that have annotations, that would leave the following which are currently causing failures but probably need deeper inspection to resolve the issue: pandas/core/base.py
pandas/core/arrays/array_.py
pandas/core/arrays/datetimelike.py
pandas/core/arrays/sparse.py
pandas/core/arrays/period.py
pandas/core/arrays/integer.py
pandas/core/indexes/datetimelike.py
pandas/core/indexes/period.py
pandas/core/internals/blocks.py I think each of them may need a dedicated PR(s) before adding into our whitelist, but I think that's the best go forward path to get momentum on this. Feedback welcome |
FYI I've added a Typing label and added this as a project which might be easier to navigate than this issue. Leaving this open of course for ongoing discussions |
I've worked on adding types to dtypes/dtypes.py and I think there's a need to have some guidelines for how type hints are added. It's quite tricky not to go a unproductive path. I propose some rules to be added to
For an example, I've made a PR at #26327 that illustrates some of the issues I've encountered. See e.g. this:
Also see here:
I'm absolutely not an expert on type hints, so this is a proposal to discuss and hopefully we can at some point in the future add some guide lines for how we do type hinting in Pandas. |
@topper-123 thanks for the input - this is great!
Agreed
I would disagree here - if the signature is so complex that we can't annotate it then I would think it's surely a risk in our code for bugs. We would either want to alias the type or refactor the code to make things more apparent (the latter is obviously easier said than done). Are there particular spots in the code base already you can point to which highlight this point though?
Yea I think one parameter per line is a easy standard to follow.
I don't have a strong preference on this yet, but I slightly disagree for now. I think import machinery / standards is a larger discussion than type hints which problems here may drive, but I don't want that to preclude us from typing
I'd be OK with this too. I think this was a fairly recent change on the mypy side (see python/mypy#5677) so probably a moot point going forward
Not sure I follow this one. We are trying to limit use of If you wanted to draft up something for |
Yeah, a complex type signature can be a sign that the code should be restructured. But not always. for example Pandas considers a The clearest example of an untypable parameter would be the But also consider the
For example try adding the line The above is fixable and maybe should be fixed, but if everything import from everywhere, the risk of such issues will be very large.
Yeah, This point was largely related to the import issue in point 4 above, but can also be more general. For example, a lot of methods accept a list-like parameter and end up transforming it into e.g. a ndarray. I think in such cases using e.g. Anyway, the point was more that mypy/PyCharm etc should always be able to know the types as precisely as possible, and e.g. |
(typing noob here, sorry if I speak nonsense) Shouldn't we for that reason try to define some of our own type unions? Exactly because we have some flexible but still specific set of values that are typically accepted ? Eg a |
Existing annotations are mostly cleaned up so #25882 will be closed out soon. I think the next logical follow up will be to add annotations to items exposed via our API. Issue coming soon |
Just passing but I thought I might put my two cents in. Please keep in mind that this comes from a place, where the cost of potential failure is way more expensive than additional dev time.
Personally I'd advise against such approach. There are a few reasons for that:
It makes perfect sense, though it is worth pointing out, that complex signatures can, but not always do, indicate good candidates for A trivial example would be
Circular imports when annotating code are indeed a nightmare, module imports instead of In the most severe cases extracting annotations into stub file could provide additional layer of "insulation".
In case of import issues this really shouldn't be necessary, but I don't really understand the other use case. |
@zero323 Many thanks for sharing your wisdom here. |
Closing as I haven't updated this tracker and this is done in separate issues |
Per pandas-dev chat today opening up a master tracker for TODOs to improve typing.
Glancing the Python docs here's what I see for "new in version X.X.X" which we may want to consider:
3.5.2
3.5.3
3.5.4
3.6
3.6.1
3.7.2
The text was updated successfully, but these errors were encountered: