Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Roadmap #27478

Merged
merged 30 commits into from
Aug 1, 2019
Merged

Add a Roadmap #27478

merged 30 commits into from
Aug 1, 2019

Conversation

TomAugspurger
Copy link
Contributor

This PR adds a roadmap document. This is useful when pursing funding; we can point to a list of known items that we'd like to do if we had the person time (and funding) to tackle them.

Let's have two discussions

  1. Do we want this? Roadmaps tend to go stale. How can we keep this up to date?
  2. If so, what items should go on it? I've mostly picked from https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit#heading=h.qm48l6dargmd plus a couple pet peeves of mine.

cc @pandas-dev/pandas-core

@TomAugspurger TomAugspurger added the Admin Administrative tasks related to the pandas project label Jul 19, 2019
@TomAugspurger TomAugspurger added this to the 1.0 milestone Jul 19, 2019
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, great work. Added some comments/ideas, but looks great.

Couple of points that I'd like to add in the future (there is no consensus atm):

  • Standardize the plotting API (move pandas.plotting.* -> df.plot(kind='*'), e.g. andrews_curves)
  • Build the IO adapters (pandas.read_* and DataFrame.to_*) as a set of standard modules and allow a ecosystem of third-party pandas IO packages (CLN: Implement io modules as plugins #26804)

Since we mention the benchmarks, may be we could also add a section about the CI? There are many things I'd like to see implemented to improve our workflow (the main ones are #26930 and #23115).

doc/source/roadmap.rst Outdated Show resolved Hide resolved

We'd like have a pandas DataFrame be backed by Arrow memory and data types
by default. This should simplify pandas internals and ensure more consistent
handling of data types through operations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there is agreement on also implementing DataFrames based on memory maps too. I think it was discussed in the past. If people is happy with it, we could mention it here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I know enough about memmaps to know whether this is a good idea / feasible for pandas.

doc/source/roadmap.rst Outdated Show resolved Hide resolved
@jbrockmendel
Copy link
Member

Do we want this? Roadmaps tend to go stale. How can we keep this up to date?

Conditional on having a good way to keep it up to date, I have no strong opinion. Otherwise I'd rather not have it.

If so, what items should go on it?

Depends largely on: who is the target audience?

@datapythonista
Copy link
Member

  1. Do we want this?

I think it adds a lot of value to users, developers of other tools of the ecosystem, and also companies or institutions who can be incentivaized to provide funds to see the points in the roadmap implemented. Also to ourselves, at least to me it's good to have an overview or what are the points we all think we should be moving towards.

I think having this page in the top level of the docs will make it visible enough to not be obviously outdated. But in any case, I think all the points will take months (or years) to be implemented, I don't see the roadmap being outdated easily. We can add a check to the release process to double check that items completed are deleted from the roadmap. We may forget to add new points, or remove points that stop being a priority, but I think we can keep it updated enough, and even if it's not, to me it's better to have a not very accurate roadmap, than not having it.

@TomAugspurger
Copy link
Contributor Author

But in any case, I think all the points will take months (or years) to be implemented, I don't see the roadmap being outdated easily.

That's my hope. I don't expect many of these to be checked off within the year.

Since we mention the benchmarks, may be we could also add a section about the CI?

Happy to have something here like that. Would you mind pushing it here, or as a followup? I was struggling with summarizing what remains to be done with those linked issues.

@jbrockmendel
Copy link
Member

Touching on the Target Audience thing again, two ideas for niches this might help with

  • clarifying what Big Picture things are being actively worked on (possibly by whom)
  • we have Good First Issue tags but there is a gap between newcomer and veteran where contributors might appreciate ideas on where help is needed

typical pandas use cases (for example, support for nullable integers).

We'd like have a pandas DataFrame be backed by Arrow memory and data types
by default. This should simplify pandas internals and ensure more consistent
Copy link
Member

@toobaz toobaz Jul 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the sprint I think we talked about Arrow as a powerful alternative to numpy, not as a default, and even less as the only alternative (which is the way, unless I misunderstand, to simplify internals). Maybe I misunderstood since the beginning, but I thought that Arrow is read-only, which means a huge API change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I thought that Arrow is read-only

I believe arrays are immutable right now, but the underlying buffers are mutable. @jorisvandenbossche if there are plans for mutable arrays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the sprint I think we talked about Arrow as a powerful alternative to numpy, not as a default, and even less as the only alternative

I wasn't at the sprint, so I may be out of sync with the group. But in my mind, we'll continue to support extension arrays indefinitely. My "by default" was aimed at what to do when given a non-array sequence (buffer from an IO routine, a list, etc). I'm not sure what we would want to do if a user does pd.Series(np.array([1, None, 3]).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this discussion is surfacing a problem with the roadmap as written.

  1. It should be clear (at least to us, possibly in the roadmap document) that these are speculative. Even guaranteed funding doesn't necessarily mean that something will be implemented.
  2. Each roadmap entry should link to a GitHub issue / design doc where we talk through specifics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Anyway on this point I think that if that Arrow is a way to simplify the internals, it is because it becomes the only alternative: making it the default shouldn't lower the complexity, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. But isn't the easiest fix to this just making the Int EA the default integer container? Do we need Arrow for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this kind of discussion belongs in a dedicated issue, not a roadmap meta-issue :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I thought that Arrow is read-only

I believe arrays are immutable right now, but the underlying buffers are mutable. @jorisvandenbossche if there are plans for mutable arrays?

There are plans for mutability in arrow, see https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.pgrntoqlanlq (although it might not be at the array level, but in any case, as you said, the buffers are mutable)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffers passed between components working with Arrow are immutable. But when you use Arrow-style memory in a project, you are open to mutate the buffers as long as you don't pass them out to a different consumer. This is currently not reflected in the pyarrow package as the main focus of the project was more on using Arrow as an interchange format and not is only slowly shifting to using Arrow as a data storage where data is actively worked on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that in the case of the string type in arrow where all string data is stored in a single data array, mutation is impossible. The in-place modification of NumPy strings is only possible because for every row the same fixed amount of space is allocated but even there you need to copy when you would like to insert a new string that is longer than the current longest string.

Thus for a non-object string type, we need to include this in the API discussions.

@toobaz
Copy link
Member

toobaz commented Jul 19, 2019

Maybe too technical, or just not important enough for a "big picture" talk, but simplifying the indexing code is certainly on the roadmap (thinking of https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restructuring-indexing-code )

@jreback
Copy link
Contributor

jreback commented Jul 20, 2019

I will look soon

doc/source/development/roadmap.rst Outdated Show resolved Hide resolved
doc/source/development/roadmap.rst Outdated Show resolved Hide resolved
doc/source/development/roadmap.rst Outdated Show resolved Hide resolved

We'd like have a pandas DataFrame be backed by Arrow memory and data types
by default. This should simplify pandas internals and ensure more consistent
handling of data types through operations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this will ever be the default. Sure we can support it thru EA but default is a big ask here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll strike "by default"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I would also strike the "this should simplify"... again, Int64 already provides the solution to NaNs in ints. I would just say that Arrow is very efficient and portable across languages...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove this entire section?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no problem with removing it, but I think the following could also work.

Apache Arrow Interoperability
-----------------------------

`Apache Arrow <https://arrow.apache.org>`__ is a cross-language development
platform for in-memory data. The Arrow logical types are closely aligned with
typical pandas use cases.

A more integrated support for Arrow memory and data types within DataFrame and
Series will allow users to exploit the performances of such platform, its I/O
capabilities, and to allow for better interoperability with other languages and
libraries supporting it.

doc/source/development/roadmap.rst Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

Added a short numba section if you want to take a look. I emphasized the rolling / window .apply aspect.


Going forward, it'd be nice if we had a GitHub issue dedicated to each of things to scope out the problem and serve as a discussion site for specifics (like #27478 (comment)). During those discussions we may determine that the roadmap item is not appropriate and remove it from the roadmap. We shouldn't take these items as gospel handed down from on high.

@WillAyd
Copy link
Member

WillAyd commented Jul 25, 2019

Haven't reviewed deeply yet but just to bring attention to it we also have something like this on our donations page:

https://pandas.pydata.org/donate.html

So probably want to align these

@TomAugspurger
Copy link
Contributor Author

Any objections to merging this as is?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 1, 2019

@TomAugspurger you said you'd take a look at my ideas in #27652 and see if they are worth including. Thanks.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Aug 1, 2019 via email

@jorisvandenbossche
Copy link
Member

@Dr-Irv and thanks for opening that issues! (it's also on my to do list to look at it) But it is not a blocker of merging this PR, this PR is only meant to kickstart a roadmap document

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

doc/source/development/roadmap.rst Outdated Show resolved Hide resolved
doc/source/development/roadmap.rst Outdated Show resolved Hide resolved
doc/source/development/roadmap.rst Outdated Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Aug 1, 2019

ok by me. I don't think we need a more formal process than a PR, though IIUC you are suggesting a higher bar for major changes here?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 1, 2019

@TomAugspurger @jorisvandenbossche Thanks for replying. No reason to stand in front of the merge

@TomAugspurger
Copy link
Contributor Author

I don't think we need a more formal process than a PR, though IIUC you are suggesting a higher bar for major changes here?

Do you mean for adding to the roadmap, or for inclusion to pandas?

For adding to pandas, no. That'll be the same process (just a PR).

For adding to the roadmap, I suppose it's a higher bar in the sense that there was no bar before :)

@jreback
Copy link
Contributor

jreback commented Aug 1, 2019

I don't think we need a more formal process than a PR, though IIUC you are suggesting a higher bar for major changes here?

Do you mean for adding to the roadmap, or for inclusion to pandas?

For adding to pandas, no. That'll be the same process (just a PR).

For adding to the roadmap, I suppose it's a higher bar in the sense that there was no bar before :)

right, once we merge, then we should have a 'policy' of what the bar is to include in the roadmap. ok for now just as consensus.

@jreback
Copy link
Contributor

jreback commented Aug 1, 2019

also ok with this on 0.25.1 as then it will be in the main docs sooner, but up to you.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor wording comments, but fine to go ahead as well

doc/source/development/roadmap.rst Outdated Show resolved Hide resolved
doc/source/development/roadmap.rst Outdated Show resolved Hide resolved
@TomAugspurger TomAugspurger modified the milestones: 1.0, 0.25.1 Aug 1, 2019
@TomAugspurger
Copy link
Contributor Author

Moved to 0.25.1 so this gets in the docs sooner.

@TomAugspurger TomAugspurger merged commit 95be01d into pandas-dev:master Aug 1, 2019
@TomAugspurger TomAugspurger deleted the roadmap branch August 1, 2019 18:04
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Aug 1, 2019
@jorisvandenbossche
Copy link
Member

Regarding the "Roadmap evolution" you added, one concern I have is about using github issues for this, as it is very easy to miss an issue in the large flood of github notifications.

Indeed. I'm not sure what's best here. I don't want to implement too much overhead, especially as we're still learning what process works best for pandas. As a compromise, perhaps we require that the mailing list also be notified?

Yep, I also don't really know what would work best. Some random ideas:

  • use github issue in the pandas repo + notify the mailing list of the discussion (what you included now)
  • use mailing list for discussion itself
  • use github issue in separate repo (pandas-design, as we had at some point, which can be a lower traffic repo with higher level discussions, easier to follow the notifications)
  • use a separate forum such as discourse for such discussions

I suppose the first is fine to try for now, we only need to have the discipline to actually do that (although that is true for all options, you need to have always ensure it is on the right forum)

jreback pushed a commit that referenced this pull request Aug 2, 2019
quintusdias pushed a commit to quintusdias/pandas_dev that referenced this pull request Aug 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Admin Administrative tasks related to the pandas project
Projects
None yet
Development

Successfully merging this pull request may close these issues.