-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Has anyone brought up the idea of a compliance suite for Pandas? #1915
Comments
Indeed they have: https://data-apis.org/blog/announcing_the_consortium/ |
@itamarst I am in that consortium and the goal is slightly different than what you describe. At the current iteration, it is more about library consumers of dataframes rather than end users. There is a vast difference in the capabilities for Dask and Modin, for example, so in the context of the consortium, we wanted to create a way for libraries to accept all of the types without having to individually case each one. I think what you describe would be beneficial, because I often see many people say "such and such library does cover the full pandas API, but it gets pretty close". It would be good to have something more precise that can be evaluated instead of having to take anyone's word for it. We have something that evaluated Modin's API against pandas using I'm going to reopen this to track it. Let me know if you're interested in such a suite. |
Ah, I see. Some more context: This idea came to me when looking at I am working on Modin as consulting work on behalf of an organization (G-Research) that is interested in speeding up Pandas code, so they want (A) to have things in good working order and fast for things they use and (B) have good relationship+knowledge for when there's bug fixes needed. I understand you talked to Alex Scammon a month ago or so? So, yes, they are interested in having me work on such a suite, and more broadly on a personal level I think it would be great if all the Pandas-like libraries benefited from it. As a first pass, my thought was that the Pandas test suite would be a reasonable starting point. So as a prototype I'm going to try to take some pandas test modules, and see if they can be generalized to run multiple libraries, Pandas and Modin to begin with. If that seems workable we can discuss next steps (versioning and change over time seems like one big issue; Dask support seems like another, although I imagine it's not your particular concern, unless/until Modin ever exposes an publicly lazy API). Does that seem reasonable? Do you have other ideas for approaching this? |
Yes. I think that it would be good for the overall community as well. There's a lot of opaqueness to users coming from Pandas, and asking "will it run?" should be a simple answer in an ideal world.
The Pandas test suite tests very small dataframes. The only reason I haven't tried this for Modin CI is because we need to test partitioning and communication on larger dataframes, otherwise the computation is mostly trivial. For this use-case, it makes more sense. Since the tests are so small, the overheads of IPC and RPCs will dominate and Modin will take a lot longer on those tests since there are so many. Modin will likely not expose a publicly lazy API for a while at least. Dask support is interesting to me from a functional coverage comparison perspective, but it does have some different semantics in terms of ordering, which might make correctness evaluation difficult. |
I guess large datasets are more likely to encounter Modin bugs, so definitely useful to have too, but perhaps that's phase 2 or 3 or 5. Presumably that is at least somewhat covered by the Modin test suite, as opposed to API coverage. So: I will start prototyping with just Pandas and Modin, using some bit of the Pandas test suite as a basis, and see how it goes. |
Started work at https://github.com/pythonspeed/pandas-compliance-suite-prototype, initially just writing a design-as-verb document. |
OK, I have expanded it with some initial implementation ideas. I am now leaning towards using type annotation analysis, rather than the the Pandas test suite. Your feedback would be very welcome! |
Here is my specific proposal: Proposal1. Pandas adds highly-specific full-coverage type annotationsBy highly-specific, I mean e.g. using This is a bunch of work, but it's a good documentation practice and will also benefit people who are just using Pandas. 2. Modin etc. add highly-specific full-coverage type annotations
This is a bunch of work, but it's a good documentation practice and will also benefit people who are just using Modin. 3. Users of Pandas can use static analysis (mypy etc.) to validate that switching to Modin will workSimply by having items 1 and 2, switching import from Pandas to Modin will allow type checking if APIs are compatible. No additional work needed by maintainers. 4. Modin etc. can optionally have a runtime checking mode for users attempting to switch from Pandas to Modin.Static analysis may be difficult for some users. So using e.g. typeguard, maintainers of Modin etc. can enable runtime checking with some sort of API flag or environment variable, so there's no cost by default. This is a small amount of work, probably. 5. A new tool can generate diffs between type annotations for Pandas and type annotations for Modin etc., for documentation purposes and for
|
@itamarst That seems like a very useful utility. There is a lot of undocumented behavior in pandas and Modin that just floats around in maintainer heads. This would either force the documentation of a particular behavior or force its removal. |
So I guess next step is socialize this idea with the Pandas devs, since would need some buy-in from them to add/keep maintained the type annotations. Do you have some good relationships there? Or do you think the data APIs consortium would be a good starting point? Otherwise I can see how G-Research is on contacts. |
I do have a good relationship with them, I can reach out to a couple of them. The type annotations should not add any burden to the maintenance of pandas, so hopefully it should be too much of an issue. I will see if someone there can comment here. |
Hey, have you heard anything back re Pandas type annotations? |
Yes, there's no overall issue for type hints. I pointed @TomAugspurger here, I think he will comment when he has the chance. They do want to add type hints, the closest issue for this in the pandas issue tracker is here: pandas-dev/pandas#28142 |
@TomAugspurger I'd be happy to do a quick video call about this too, since it's higher bandwidth. |
Pandas has an ongoing effort to add types to the public API, though I'm not following it closely. In general, the idea of matching types seems sensible. It's like a stricter version of verifying that the function signatures match. One question: pandas types return |
@TomAugspurger the idea is that as a user, I just change my import from |
So after some research it seems like type annotations, while definitely worth pursuing, are probably not a good short-term answer. So after some discussion, I wrote up another approach that might be helpful in the short term (and also helpful to pure Pandas-only users). |
Thanks @itamarst, it seems much more doable. A couple of questions: |
|
https://github.com/g-research/bearcat now has a tracing-based prototype, demonstrating some of the many many edge cases that need to be handled. |
There are clear issues with this (e.g. bugs in Pandas—how would they be handled?) but I suspect they would be fixable.
Is this something Modin would be interested in? Any idea if anyone has discussed this before?
The text was updated successfully, but these errors were encountered: