Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Roadmap #27478

Merged
merged 30 commits into from
Aug 1, 2019
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
965ecd1
added roadmap
TomAugspurger Jul 19, 2019
98656c8
added roadmap
TomAugspurger Jul 19, 2019
12f1f67
update roadmap
TomAugspurger Jul 19, 2019
fb844ae
move to development
TomAugspurger Jul 19, 2019
c640c73
Merge remote-tracking branch 'upstream' into roadmap
TomAugspurger Jul 22, 2019
c310370
indexing
TomAugspurger Jul 22, 2019
d5573bb
Merge remote-tracking branch 'upstream' into roadmap
TomAugspurger Jul 23, 2019
4aef936
typos
TomAugspurger Jul 23, 2019
200ac63
numba
TomAugspurger Jul 23, 2019
8dbd981
reword
TomAugspurger Jul 23, 2019
9ac38f0
arrow
TomAugspurger Jul 25, 2019
d2883c4
Merge remote-tracking branch 'upstream' into roadmap
TomAugspurger Jul 25, 2019
8c65297
Merge remote-tracking branch 'upstream' into roadmap
TomAugspurger Jul 29, 2019
4e1af82
Intro
TomAugspurger Jul 29, 2019
5702a18
cleanup
TomAugspurger Jul 29, 2019
755a5e4
case
TomAugspurger Jul 29, 2019
a549cf7
str
TomAugspurger Jul 29, 2019
da01cb4
added evolution
TomAugspurger Jul 29, 2019
b52d6b9
typos
TomAugspurger Jul 29, 2019
bf1338b
missing function
TomAugspurger Jul 29, 2019
fb6980c
scope and ML
TomAugspurger Jul 29, 2019
85cf5ee
Merge remote-tracking branch 'upstream/master' into roadmap
TomAugspurger Jul 31, 2019
65653ee
add note on in / out
TomAugspurger Jul 31, 2019
c3b5b5f
Update doc/source/development/roadmap.rst
TomAugspurger Aug 1, 2019
d3c9424
Update doc/source/development/roadmap.rst
TomAugspurger Aug 1, 2019
a10f78c
Update doc/source/development/roadmap.rst
TomAugspurger Aug 1, 2019
6a05c2b
link to tracker
TomAugspurger Aug 1, 2019
ce5a2e0
numba link
TomAugspurger Aug 1, 2019
7ac38b5
Merge remote-tracking branch 'upstream/master' into roadmap
TomAugspurger Aug 1, 2019
ecdffeb
fix link
TomAugspurger Aug 1, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/development/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,4 @@ Development
internals
extending
developer
roadmap
130 changes: 130 additions & 0 deletions doc/source/development/roadmap.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
.. _roadmap:

=======
Roadmap
=======

This page provides an overview of the major themes pandas' development. Implementation
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
of these goals may be hastened with dedicated funding.
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved

Extensibility
-------------

Pandas Extension Arrays provide 3rd party libraries the ability to
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
extend pandas' supported types. In theory, these 3rd party types can do
everything one of pandas. In practice, many places in pandas will unintentionally
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
convert the ExtensionArray to a NumPy array of Python objects, causing
performance issues and the loss of type information. These problems are especially
pronounced for nested data.

We'd like to improve the handling of extension arrays throughout the library,
making their behavior more consistent with the handling of NumPy arrays.

String Dtype
------------

Currently, pandas stores text data in an ``object`` -dtype NumPy array.
Each array stores Python strings. While pragmatic, since we rely on NumPy
for storage and Python for string operations, this is memory inefficient
and slow. We'd like to provide a native string type for pandas.

The most obvious alternative is Apache Arrow. Currently, Arrow provides
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
storage for string data. We can work with the Arrow developers to implement
the operations needed for pandas users (for example, ``Series.str.upper``).
These operations could be implemented in Numba (
as prototyped in `Fletcher <https://github.com/xhochy/fletcher>`__)
or in the Apache Arrow C++ library.

Apache Arrow Interoperability
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
-----------------------------

`Apache Arrow <https://arrow.apache.org>`__ is a cross-language development
platform for in-memory data. The Arrow logical types are closely aligned with
typical pandas use cases (for example, support for nullable integers).

We'd like have a pandas DataFrame be backed by Arrow memory and data types
by default. This should simplify pandas internals and ensure more consistent
Copy link
Member

@toobaz toobaz Jul 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the sprint I think we talked about Arrow as a powerful alternative to numpy, not as a default, and even less as the only alternative (which is the way, unless I misunderstand, to simplify internals). Maybe I misunderstood since the beginning, but I thought that Arrow is read-only, which means a huge API change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I thought that Arrow is read-only

I believe arrays are immutable right now, but the underlying buffers are mutable. @jorisvandenbossche if there are plans for mutable arrays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the sprint I think we talked about Arrow as a powerful alternative to numpy, not as a default, and even less as the only alternative

I wasn't at the sprint, so I may be out of sync with the group. But in my mind, we'll continue to support extension arrays indefinitely. My "by default" was aimed at what to do when given a non-array sequence (buffer from an IO routine, a list, etc). I'm not sure what we would want to do if a user does pd.Series(np.array([1, None, 3]).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this discussion is surfacing a problem with the roadmap as written.

  1. It should be clear (at least to us, possibly in the roadmap document) that these are speculative. Even guaranteed funding doesn't necessarily mean that something will be implemented.
  2. Each roadmap entry should link to a GitHub issue / design doc where we talk through specifics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Anyway on this point I think that if that Arrow is a way to simplify the internals, it is because it becomes the only alternative: making it the default shouldn't lower the complexity, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. But isn't the easiest fix to this just making the Int EA the default integer container? Do we need Arrow for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this kind of discussion belongs in a dedicated issue, not a roadmap meta-issue :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I thought that Arrow is read-only

I believe arrays are immutable right now, but the underlying buffers are mutable. @jorisvandenbossche if there are plans for mutable arrays?

There are plans for mutability in arrow, see https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.pgrntoqlanlq (although it might not be at the array level, but in any case, as you said, the buffers are mutable)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffers passed between components working with Arrow are immutable. But when you use Arrow-style memory in a project, you are open to mutate the buffers as long as you don't pass them out to a different consumer. This is currently not reflected in the pyarrow package as the main focus of the project was more on using Arrow as an interchange format and not is only slowly shifting to using Arrow as a data storage where data is actively worked on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that in the case of the string type in arrow where all string data is stored in a single data array, mutation is impossible. The in-place modification of NumPy strings is only possible because for every row the same fixed amount of space is allocated but even there you need to copy when you would like to insert a new string that is longer than the current longest string.

Thus for a non-object string type, we need to include this in the API discussions.

handling of data types through operations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this will ever be the default. Sure we can support it thru EA but default is a big ask here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll strike "by default"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I would also strike the "this should simplify"... again, Int64 already provides the solution to NaNs in ints. I would just say that Arrow is very efficient and portable across languages...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove this entire section?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no problem with removing it, but I think the following could also work.

Apache Arrow Interoperability
-----------------------------

`Apache Arrow <https://arrow.apache.org>`__ is a cross-language development
platform for in-memory data. The Arrow logical types are closely aligned with
typical pandas use cases.

A more integrated support for Arrow memory and data types within DataFrame and
Series will allow users to exploit the performances of such platform, its I/O
capabilities, and to allow for better interoperability with other languages and
libraries supporting it.


Block Manager Rewrite
---------------------

We'd like to replace pandas current internal data structures (a collection of
1 or 2-D arrays) with a simpler collection of 1-D arrays.

Pandas internal data model is quite complex. A DataFrame is made up of
one or more 2-dimension "blocks", with one or more blocks per dtype. This
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
collection of 2-D arrays is managed by the BlockManager.

The primary benefit of the BlockManager is improved performance on certain
operations (construction from a 2D array, binary operations, reductions across the columns),
especially for wide DataFrames. However, the BlockManager substantially increases the
complexity and maintenance burden of pandas'.
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved

By replacing the BlockManager we hope to achieve

* Substantially simpler code
* Easier extensibility with new logical types
* Better user control over memory use and layout
* Improved microperformance
* Option to provide a C / Cython API to pandas' internals

See `these design documents <https://dev.pandas.io/pandas2/internal-architecture.html#removal-of-blockmanager-new-dataframe-internals>`__
for more.
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved

Weighted Operations
-------------------

In many fields, sample weights are necessary to correctly estimate population
statistics. We'd like to support weighted operations (like `mean`, `sum`, `std`,
etc.), possibly with an API similar to `DataFrame.groupby`.
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved

See https://github.com/pandas-dev/pandas/issues/10030 for more.

Documentation Improvements
--------------------------

We'd like to improve the content, structure, and presentation of pandas documentation.
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
Some specific goals include

* Overhaul the HTML theme with a modern, responsive design.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could link here to the issue discussing this (#15556). We could probably do that in several places, do we want to do that in general?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. My preference is for each of these to be a small summary of a larger discussion / proposal (GitHub issue or some other design document).

* Improve the "Getting Started" documentation, designing and writing learning paths
for users different backgrounds (e.g. brand new to programming, familiar with
other languages like R, already familiar with Python).
* Improve the overall organization of the documentation and specific subsections
of the documentation to make navigation and finding content easier.

Package Docstring Validation
----------------------------

To improve the quality and consistency of pandas docstrings, we've developed
tooling to check docstrings in a variety of ways.
https://github.com/pandas-dev/pandas/blob/master/scripts/validate_docstrings.py
contains the checks.

Like many other projects, pandas uses the
`numpydoc <https://numpydoc.readthedocs.io/en/latest/>`__ style for writing
docstrings. With the collaboration of the numpydoc maintainers, we'd like to
move the checks to a package other than pandas so that other projects can easily
use them as well.
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved

Performance Monitoring
----------------------

Pandas uses `airspeed velocity <https://asv.readthedocs.io/en/stable/>`__ to
monitor for performance regressions. ASV itself is a fabulous tool, but requires
some additional work to be integrated into an open source project's workflow.

The `asv-runner <https://github.com/asv-runner>`__ organization, currently made up
of pandas maintainers, provides tools built on top of ASV. We have a physical
machine for running a number of project's benchmarks, and tools managing the
benchmark runs and reporting on results.

We'd like to fund improvements and maintenance of these tools to

* Be more stable. Currently, they're maintained on the nights and weekends when
a maintainer has free time.
* Tune the system for benchmarks to improve stability, following
https://pyperf.readthedocs.io/en/latest/system.html
* Build a GitHub bot to request ASV runs *before* a PR is merged. Currently, the
benchmarks are only run nightly.