Add a Roadmap #27478

TomAugspurger · 2019-07-19T15:10:58Z

This PR adds a roadmap document. This is useful when pursing funding; we can point to a list of known items that we'd like to do if we had the person time (and funding) to tackle them.

Let's have two discussions

Do we want this? Roadmaps tend to go stale. How can we keep this up to date?
If so, what items should go on it? I've mostly picked from https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit#heading=h.qm48l6dargmd plus a couple pet peeves of mine.

cc @pandas-dev/pandas-core

datapythonista

Thanks for this, great work. Added some comments/ideas, but looks great.

Couple of points that I'd like to add in the future (there is no consensus atm):

Standardize the plotting API (move pandas.plotting.* -> df.plot(kind='*'), e.g. andrews_curves)
Build the IO adapters (pandas.read_* and DataFrame.to_*) as a set of standard modules and allow a ecosystem of third-party pandas IO packages (CLN: Implement io modules as plugins #26804)

Since we mention the benchmarks, may be we could also add a section about the CI? There are many things I'd like to see implemented to improve our workflow (the main ones are #26930 and #23115).

doc/source/roadmap.rst

datapythonista · 2019-07-19T15:33:35Z

doc/source/roadmap.rst

+
+We'd like have a pandas DataFrame be backed by Arrow memory and data types
+by default. This should simplify pandas internals and ensure more consistent
+handling of data types through operations.


Not sure if there is agreement on also implementing DataFrames based on memory maps too. I think it was discussed in the past. If people is happy with it, we could mention it here too.

I'm not sure I know enough about memmaps to know whether this is a good idea / feasible for pandas.

doc/source/roadmap.rst

jbrockmendel · 2019-07-19T15:58:11Z

Do we want this? Roadmaps tend to go stale. How can we keep this up to date?

Conditional on having a good way to keep it up to date, I have no strong opinion. Otherwise I'd rather not have it.

If so, what items should go on it?

Depends largely on: who is the target audience?

datapythonista · 2019-07-19T16:10:21Z

Do we want this?

I think it adds a lot of value to users, developers of other tools of the ecosystem, and also companies or institutions who can be incentivaized to provide funds to see the points in the roadmap implemented. Also to ourselves, at least to me it's good to have an overview or what are the points we all think we should be moving towards.

I think having this page in the top level of the docs will make it visible enough to not be obviously outdated. But in any case, I think all the points will take months (or years) to be implemented, I don't see the roadmap being outdated easily. We can add a check to the release process to double check that items completed are deleted from the roadmap. We may forget to add new points, or remove points that stop being a priority, but I think we can keep it updated enough, and even if it's not, to me it's better to have a not very accurate roadmap, than not having it.

TomAugspurger · 2019-07-19T18:10:00Z

But in any case, I think all the points will take months (or years) to be implemented, I don't see the roadmap being outdated easily.

That's my hope. I don't expect many of these to be checked off within the year.

Since we mention the benchmarks, may be we could also add a section about the CI?

Happy to have something here like that. Would you mind pushing it here, or as a followup? I was struggling with summarizing what remains to be done with those linked issues.

jbrockmendel · 2019-07-19T21:33:16Z

Touching on the Target Audience thing again, two ideas for niches this might help with

clarifying what Big Picture things are being actively worked on (possibly by whom)
we have Good First Issue tags but there is a gap between newcomer and veteran where contributors might appreciate ideas on where help is needed

toobaz · 2019-07-19T23:02:20Z

doc/source/development/roadmap.rst

+typical pandas use cases (for example, support for nullable integers).
+
+We'd like have a pandas DataFrame be backed by Arrow memory and data types
+by default. This should simplify pandas internals and ensure more consistent


At the sprint I think we talked about Arrow as a powerful alternative to numpy, not as a default, and even less as the only alternative (which is the way, unless I misunderstand, to simplify internals). Maybe I misunderstood since the beginning, but I thought that Arrow is read-only, which means a huge API change.

but I thought that Arrow is read-only

I believe arrays are immutable right now, but the underlying buffers are mutable. @jorisvandenbossche if there are plans for mutable arrays?

At the sprint I think we talked about Arrow as a powerful alternative to numpy, not as a default, and even less as the only alternative

I wasn't at the sprint, so I may be out of sync with the group. But in my mind, we'll continue to support extension arrays indefinitely. My "by default" was aimed at what to do when given a non-array sequence (buffer from an IO routine, a list, etc). I'm not sure what we would want to do if a user does pd.Series(np.array([1, None, 3]).

Also, this discussion is surfacing a problem with the roadmap as written.

It should be clear (at least to us, possibly in the roadmap document) that these are speculative. Even guaranteed funding doesn't necessarily mean that something will be implemented.

Each roadmap entry should link to a GitHub issue / design doc where we talk through specifics.

OK. Anyway on this point I think that if that Arrow is a way to simplify the internals, it is because it becomes the only alternative: making it the default shouldn't lower the complexity, right?

I see. But isn't the easiest fix to this just making the Int EA the default integer container? Do we need Arrow for this?

I think this kind of discussion belongs in a dedicated issue, not a roadmap meta-issue :)

but I thought that Arrow is read-only

I believe arrays are immutable right now, but the underlying buffers are mutable. @jorisvandenbossche if there are plans for mutable arrays?

There are plans for mutability in arrow, see https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.pgrntoqlanlq (although it might not be at the array level, but in any case, as you said, the buffers are mutable)

The buffers passed between components working with Arrow are immutable. But when you use Arrow-style memory in a project, you are open to mutate the buffers as long as you don't pass them out to a different consumer. This is currently not reflected in the pyarrow package as the main focus of the project was more on using Arrow as an interchange format and not is only slowly shifting to using Arrow as a data storage where data is actively worked on.

Also note that in the case of the string type in arrow where all string data is stored in a single data array, mutation is impossible. The in-place modification of NumPy strings is only possible because for every row the same fixed amount of space is allocated but even there you need to copy when you would like to insert a new string that is longer than the current longest string.

Thus for a non-object string type, we need to include this in the API discussions.

doc/source/development/roadmap.rst

toobaz · 2019-07-19T23:08:49Z

Maybe too technical, or just not important enough for a "big picture" talk, but simplifying the indexing code is certainly on the roadmap (thinking of https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restructuring-indexing-code )

jreback · 2019-07-20T16:44:09Z

I will look soon

doc/source/development/roadmap.rst

jreback · 2019-07-23T11:47:57Z

doc/source/development/roadmap.rst

+
+We'd like have a pandas DataFrame be backed by Arrow memory and data types
+by default. This should simplify pandas internals and ensure more consistent
+handling of data types through operations.


I am not sure this will ever be the default. Sure we can support it thru EA but default is a big ask here.

I'll strike "by default"

Again, I would also strike the "this should simplify"... again, Int64 already provides the solution to NaNs in ints. I would just say that Arrow is very efficient and portable across languages...

Should we remove this entire section?

I have no problem with removing it, but I think the following could also work.

Apache Arrow Interoperability ----------------------------- `Apache Arrow <https://arrow.apache.org>`__ is a cross-language development platform for in-memory data. The Arrow logical types are closely aligned with typical pandas use cases. A more integrated support for Arrow memory and data types within DataFrame and Series will allow users to exploit the performances of such platform, its I/O capabilities, and to allow for better interoperability with other languages and libraries supporting it.

doc/source/development/roadmap.rst

TomAugspurger · 2019-07-23T13:41:43Z

Added a short numba section if you want to take a look. I emphasized the rolling / window .apply aspect.

Going forward, it'd be nice if we had a GitHub issue dedicated to each of things to scope out the problem and serve as a discussion site for specifics (like #27478 (comment)). During those discussions we may determine that the roadmap item is not appropriate and remove it from the roadmap. We shouldn't take these items as gospel handed down from on high.

WillAyd · 2019-07-25T14:36:53Z

Haven't reviewed deeply yet but just to bring attention to it we also have something like this on our donations page:

https://pandas.pydata.org/donate.html

So probably want to align these

TomAugspurger · 2019-08-01T13:17:10Z

Any objections to merging this as is?

Dr-Irv · 2019-08-01T13:40:39Z

@TomAugspurger you said you'd take a look at my ideas in #27652 and see if they are worth including. Thanks.

TomAugspurger · 2019-08-01T13:43:08Z

Right, I'm still mulling that over. Once that discussion gets to the point where we think it's ready we can go through the evolution process to add it to this document.

…

On Thu, Aug 1, 2019 at 8:40 AM Irv Lustig ***@***.***> wrote: @TomAugspurger <https://github.com/TomAugspurger> you said you'd take a look at my ideas in #27652 <#27652> and see if they are worth including. Thanks. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#27478?email_source=notifications&email_token=AAKAOIUNUYIBM3XLKE2N4MLQCLRWBA5CNFSM4IFG47G2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3KUGOA#issuecomment-517292856>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIVONKMH333P7AQHPZDQCLRWBANCNFSM4IFG47GQ> .

jorisvandenbossche · 2019-08-01T13:44:08Z

@Dr-Irv and thanks for opening that issues! (it's also on my to do list to look at it) But it is not a blocker of merging this PR, this PR is only meant to kickstart a roadmap document

simonjayhawkins

lgtm.

doc/source/development/roadmap.rst

jreback · 2019-08-01T13:56:00Z

ok by me. I don't think we need a more formal process than a PR, though IIUC you are suggesting a higher bar for major changes here?

Dr-Irv · 2019-08-01T13:56:31Z

@TomAugspurger @jorisvandenbossche Thanks for replying. No reason to stand in front of the merge

TomAugspurger · 2019-08-01T13:58:23Z

I don't think we need a more formal process than a PR, though IIUC you are suggesting a higher bar for major changes here?

Do you mean for adding to the roadmap, or for inclusion to pandas?

For adding to pandas, no. That'll be the same process (just a PR).

For adding to the roadmap, I suppose it's a higher bar in the sense that there was no bar before :)

Co-Authored-By: Simon Hawkins <[email protected]>

jreback · 2019-08-01T14:02:37Z

I don't think we need a more formal process than a PR, though IIUC you are suggesting a higher bar for major changes here?

Do you mean for adding to the roadmap, or for inclusion to pandas?

For adding to pandas, no. That'll be the same process (just a PR).

For adding to the roadmap, I suppose it's a higher bar in the sense that there was no bar before :)

right, once we merge, then we should have a 'policy' of what the bar is to include in the roadmap. ok for now just as consensus.

jreback · 2019-08-01T14:03:03Z

also ok with this on 0.25.1 as then it will be in the main docs sooner, but up to you.

jorisvandenbossche

Two minor wording comments, but fine to go ahead as well

doc/source/development/roadmap.rst

TomAugspurger · 2019-08-01T18:04:02Z

Moved to 0.25.1 so this gets in the docs sooner.

jorisvandenbossche · 2019-08-01T20:50:46Z

Regarding the "Roadmap evolution" you added, one concern I have is about using github issues for this, as it is very easy to miss an issue in the large flood of github notifications.

Indeed. I'm not sure what's best here. I don't want to implement too much overhead, especially as we're still learning what process works best for pandas. As a compromise, perhaps we require that the mailing list also be notified?

Yep, I also don't really know what would work best. Some random ideas:

use github issue in the pandas repo + notify the mailing list of the discussion (what you included now)
use mailing list for discussion itself
use github issue in separate repo (pandas-design, as we had at some point, which can be a lower traffic repo with higher level discussions, easier to follow the notifications)
use a separate forum such as discourse for such discussions

I suppose the first is fine to try for now, we only need to have the discipline to actually do that (although that is true for all options, you need to have always ensure it is on the right forum)

TomAugspurger added 2 commits July 18, 2019 19:43

added roadmap

965ecd1

added roadmap

98656c8

TomAugspurger added the Admin Administrative tasks related to the pandas project label Jul 19, 2019

TomAugspurger added this to the 1.0 milestone Jul 19, 2019

datapythonista reviewed Jul 19, 2019

View reviewed changes

datapythonista mentioned this pull request Jul 19, 2019

Project roadmaps python-sprints/dataframe-summit#2

Open

datapythonista mentioned this pull request Jul 19, 2019

Towards "pandas 1.0" #10000

Closed

update roadmap

12f1f67

move to development

fb844ae

toobaz reviewed Jul 19, 2019

View reviewed changes

doc/source/development/roadmap.rst Outdated Show resolved Hide resolved

TomAugspurger added 2 commits July 22, 2019 09:50

Merge remote-tracking branch 'upstream' into roadmap

c640c73

indexing

c310370

toobaz reviewed Jul 23, 2019

View reviewed changes

doc/source/development/roadmap.rst Outdated Show resolved Hide resolved

toobaz reviewed Jul 23, 2019

View reviewed changes

doc/source/development/roadmap.rst Outdated Show resolved Hide resolved

toobaz reviewed Jul 23, 2019

View reviewed changes

doc/source/development/roadmap.rst Show resolved Hide resolved

TomAugspurger added 2 commits July 23, 2019 05:39

Merge remote-tracking branch 'upstream' into roadmap

d5573bb

typos

4aef936

jreback requested changes Jul 23, 2019

View reviewed changes

TomAugspurger added 2 commits July 23, 2019 08:36

numba

200ac63

reword

8dbd981

arrow

9ac38f0

simonjayhawkins approved these changes Aug 1, 2019

View reviewed changes

doc/source/development/roadmap.rst Outdated Show resolved Hide resolved

doc/source/development/roadmap.rst Outdated Show resolved Hide resolved

doc/source/development/roadmap.rst Outdated Show resolved Hide resolved

jreback approved these changes Aug 1, 2019

View reviewed changes

TomAugspurger and others added 3 commits August 1, 2019 09:00

Update doc/source/development/roadmap.rst

c3b5b5f

Co-Authored-By: Simon Hawkins <[email protected]>

Update doc/source/development/roadmap.rst

d3c9424

Co-Authored-By: Simon Hawkins <[email protected]>

Update doc/source/development/roadmap.rst

a10f78c

Co-Authored-By: Simon Hawkins <[email protected]>

jorisvandenbossche approved these changes Aug 1, 2019

View reviewed changes

doc/source/development/roadmap.rst Outdated Show resolved Hide resolved

doc/source/development/roadmap.rst Outdated Show resolved Hide resolved

TomAugspurger added 4 commits August 1, 2019 09:13

link to tracker

6a05c2b

numba link

ce5a2e0

Merge remote-tracking branch 'upstream/master' into roadmap

7ac38b5

fix link

ecdffeb

TomAugspurger modified the milestones: 1.0, 0.25.1 Aug 1, 2019

TomAugspurger merged commit 95be01d into pandas-dev:master Aug 1, 2019

TomAugspurger deleted the roadmap branch August 1, 2019 18:04

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Aug 1, 2019

Backport PR pandas-dev#27478: Add a Roadmap

2fde819

meeseeksmachine mentioned this pull request Aug 1, 2019

Backport PR #27478 on branch 0.25.x (Add a Roadmap) #27698

Merged

jreback pushed a commit that referenced this pull request Aug 2, 2019

Backport PR #27478: Add a Roadmap (#27698)

e3f8348

quintusdias pushed a commit to quintusdias/pandas_dev that referenced this pull request Aug 16, 2019

DOC: Add a Roadmap (pandas-dev#27478)

333edf9

h-vetinari mentioned this pull request Oct 10, 2019

Pandas Enhancement Proposals? #28568

Closed

Add a Roadmap #27478

Add a Roadmap #27478

Conversation

TomAugspurger commented Jul 19, 2019

datapythonista left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Jul 19, 2019

datapythonista commented Jul 19, 2019

TomAugspurger commented Jul 19, 2019

jbrockmendel commented Jul 19, 2019

toobaz Jul 19, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

toobaz commented Jul 19, 2019

jreback commented Jul 20, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Jul 23, 2019

WillAyd commented Jul 25, 2019

TomAugspurger commented Aug 1, 2019

Dr-Irv commented Aug 1, 2019

TomAugspurger commented Aug 1, 2019 via email

jorisvandenbossche commented Aug 1, 2019

simonjayhawkins left a comment

Choose a reason for hiding this comment

jreback commented Aug 1, 2019

Dr-Irv commented Aug 1, 2019

TomAugspurger commented Aug 1, 2019

jreback commented Aug 1, 2019

jreback commented Aug 1, 2019

jorisvandenbossche left a comment

Choose a reason for hiding this comment

TomAugspurger commented Aug 1, 2019

jorisvandenbossche commented Aug 1, 2019

toobaz Jul 19, 2019 •

edited

Loading