-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Roadmap #27478
Add a Roadmap #27478
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, great work. Added some comments/ideas, but looks great.
Couple of points that I'd like to add in the future (there is no consensus atm):
- Standardize the plotting API (move
pandas.plotting.*
->df.plot(kind='*')
, e.g. andrews_curves) - Build the IO adapters (
pandas.read_*
andDataFrame.to_*
) as a set of standard modules and allow a ecosystem of third-party pandas IO packages (CLN: Implement io modules as plugins #26804)
Since we mention the benchmarks, may be we could also add a section about the CI? There are many things I'd like to see implemented to improve our workflow (the main ones are #26930 and #23115).
doc/source/roadmap.rst
Outdated
|
||
We'd like have a pandas DataFrame be backed by Arrow memory and data types | ||
by default. This should simplify pandas internals and ensure more consistent | ||
handling of data types through operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if there is agreement on also implementing DataFrames based on memory maps too. I think it was discussed in the past. If people is happy with it, we could mention it here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I know enough about memmaps to know whether this is a good idea / feasible for pandas.
Conditional on having a good way to keep it up to date, I have no strong opinion. Otherwise I'd rather not have it.
Depends largely on: who is the target audience? |
I think it adds a lot of value to users, developers of other tools of the ecosystem, and also companies or institutions who can be incentivaized to provide funds to see the points in the roadmap implemented. Also to ourselves, at least to me it's good to have an overview or what are the points we all think we should be moving towards. I think having this page in the top level of the docs will make it visible enough to not be obviously outdated. But in any case, I think all the points will take months (or years) to be implemented, I don't see the roadmap being outdated easily. We can add a check to the release process to double check that items completed are deleted from the roadmap. We may forget to add new points, or remove points that stop being a priority, but I think we can keep it updated enough, and even if it's not, to me it's better to have a not very accurate roadmap, than not having it. |
That's my hope. I don't expect many of these to be checked off within the year.
Happy to have something here like that. Would you mind pushing it here, or as a followup? I was struggling with summarizing what remains to be done with those linked issues. |
Touching on the Target Audience thing again, two ideas for niches this might help with
|
doc/source/development/roadmap.rst
Outdated
typical pandas use cases (for example, support for nullable integers). | ||
|
||
We'd like have a pandas DataFrame be backed by Arrow memory and data types | ||
by default. This should simplify pandas internals and ensure more consistent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the sprint I think we talked about Arrow as a powerful alternative to numpy, not as a default, and even less as the only alternative (which is the way, unless I misunderstand, to simplify internals). Maybe I misunderstood since the beginning, but I thought that Arrow is read-only, which means a huge API change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I thought that Arrow is read-only
I believe arrays are immutable right now, but the underlying buffers are mutable. @jorisvandenbossche if there are plans for mutable arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the sprint I think we talked about Arrow as a powerful alternative to numpy, not as a default, and even less as the only alternative
I wasn't at the sprint, so I may be out of sync with the group. But in my mind, we'll continue to support extension arrays indefinitely. My "by default" was aimed at what to do when given a non-array sequence (buffer from an IO routine, a list, etc). I'm not sure what we would want to do if a user does pd.Series(np.array([1, None, 3])
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this discussion is surfacing a problem with the roadmap as written.
- It should be clear (at least to us, possibly in the roadmap document) that these are speculative. Even guaranteed funding doesn't necessarily mean that something will be implemented.
- Each roadmap entry should link to a GitHub issue / design doc where we talk through specifics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Anyway on this point I think that if that Arrow is a way to simplify the internals, it is because it becomes the only alternative: making it the default shouldn't lower the complexity, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. But isn't the easiest fix to this just making the Int
EA the default integer container? Do we need Arrow for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this kind of discussion belongs in a dedicated issue, not a roadmap meta-issue :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I thought that Arrow is read-only
I believe arrays are immutable right now, but the underlying buffers are mutable. @jorisvandenbossche if there are plans for mutable arrays?
There are plans for mutability in arrow, see https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.pgrntoqlanlq (although it might not be at the array level, but in any case, as you said, the buffers are mutable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buffers passed between components working with Arrow are immutable. But when you use Arrow-style memory in a project, you are open to mutate the buffers as long as you don't pass them out to a different consumer. This is currently not reflected in the pyarrow
package as the main focus of the project was more on using Arrow as an interchange format and not is only slowly shifting to using Arrow as a data storage where data is actively worked on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that in the case of the string type in arrow where all string data is stored in a single data array, mutation is impossible. The in-place modification of NumPy strings is only possible because for every row the same fixed amount of space is allocated but even there you need to copy when you would like to insert a new string that is longer than the current longest string.
Thus for a non-object
string type, we need to include this in the API discussions.
Maybe too technical, or just not important enough for a "big picture" talk, but simplifying the indexing code is certainly on the roadmap (thinking of https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restructuring-indexing-code ) |
I will look soon |
doc/source/development/roadmap.rst
Outdated
|
||
We'd like have a pandas DataFrame be backed by Arrow memory and data types | ||
by default. This should simplify pandas internals and ensure more consistent | ||
handling of data types through operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this will ever be the default. Sure we can support it thru EA but default is a big ask here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll strike "by default"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I would also strike the "this should simplify"... again, Int64
already provides the solution to NaNs in ints. I would just say that Arrow is very efficient and portable across languages...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove this entire section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no problem with removing it, but I think the following could also work.
Apache Arrow Interoperability
-----------------------------
`Apache Arrow <https://arrow.apache.org>`__ is a cross-language development
platform for in-memory data. The Arrow logical types are closely aligned with
typical pandas use cases.
A more integrated support for Arrow memory and data types within DataFrame and
Series will allow users to exploit the performances of such platform, its I/O
capabilities, and to allow for better interoperability with other languages and
libraries supporting it.
Added a short numba section if you want to take a look. I emphasized the rolling / window Going forward, it'd be nice if we had a GitHub issue dedicated to each of things to scope out the problem and serve as a discussion site for specifics (like #27478 (comment)). During those discussions we may determine that the roadmap item is not appropriate and remove it from the roadmap. We shouldn't take these items as gospel handed down from on high. |
Haven't reviewed deeply yet but just to bring attention to it we also have something like this on our donations page: https://pandas.pydata.org/donate.html So probably want to align these |
Any objections to merging this as is? |
@TomAugspurger you said you'd take a look at my ideas in #27652 and see if they are worth including. Thanks. |
Right, I'm still mulling that over. Once that discussion gets to the point
where we think it's ready we can go through the evolution process to add it
to this document.
…On Thu, Aug 1, 2019 at 8:40 AM Irv Lustig ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger> you said you'd take a
look at my ideas in #27652
<#27652> and see if they are
worth including. Thanks.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#27478?email_source=notifications&email_token=AAKAOIUNUYIBM3XLKE2N4MLQCLRWBA5CNFSM4IFG47G2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3KUGOA#issuecomment-517292856>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIVONKMH333P7AQHPZDQCLRWBANCNFSM4IFG47GQ>
.
|
@Dr-Irv and thanks for opening that issues! (it's also on my to do list to look at it) But it is not a blocker of merging this PR, this PR is only meant to kickstart a roadmap document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm.
ok by me. I don't think we need a more formal process than a PR, though IIUC you are suggesting a higher bar for major changes here? |
@TomAugspurger @jorisvandenbossche Thanks for replying. No reason to stand in front of the merge |
Do you mean for adding to the roadmap, or for inclusion to pandas? For adding to pandas, no. That'll be the same process (just a PR). For adding to the roadmap, I suppose it's a higher bar in the sense that there was no bar before :) |
Co-Authored-By: Simon Hawkins <[email protected]>
Co-Authored-By: Simon Hawkins <[email protected]>
Co-Authored-By: Simon Hawkins <[email protected]>
right, once we merge, then we should have a 'policy' of what the bar is to include in the roadmap. ok for now just as consensus. |
also ok with this on 0.25.1 as then it will be in the main docs sooner, but up to you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two minor wording comments, but fine to go ahead as well
Moved to 0.25.1 so this gets in the docs sooner. |
Yep, I also don't really know what would work best. Some random ideas:
I suppose the first is fine to try for now, we only need to have the discipline to actually do that (although that is true for all options, you need to have always ensure it is on the right forum) |
This PR adds a roadmap document. This is useful when pursing funding; we can point to a list of known items that we'd like to do if we had the person time (and funding) to tackle them.
Let's have two discussions
cc @pandas-dev/pandas-core