Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Roadmap #27478
Add a Roadmap #27478
Changes from 4 commits
965ecd1
98656c8
12f1f67
fb844ae
c640c73
c310370
d5573bb
4aef936
200ac63
8dbd981
9ac38f0
d2883c4
8c65297
4e1af82
5702a18
755a5e4
a549cf7
da01cb4
b52d6b9
bf1338b
fb6980c
85cf5ee
65653ee
c3b5b5f
d3c9424
a10f78c
6a05c2b
ce5a2e0
7ac38b5
ecdffeb
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the sprint I think we talked about Arrow as a powerful alternative to numpy, not as a default, and even less as the only alternative (which is the way, unless I misunderstand, to simplify internals). Maybe I misunderstood since the beginning, but I thought that Arrow is read-only, which means a huge API change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe arrays are immutable right now, but the underlying buffers are mutable. @jorisvandenbossche if there are plans for mutable arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't at the sprint, so I may be out of sync with the group. But in my mind, we'll continue to support extension arrays indefinitely. My "by default" was aimed at what to do when given a non-array sequence (buffer from an IO routine, a list, etc). I'm not sure what we would want to do if a user does
pd.Series(np.array([1, None, 3])
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this discussion is surfacing a problem with the roadmap as written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Anyway on this point I think that if that Arrow is a way to simplify the internals, it is because it becomes the only alternative: making it the default shouldn't lower the complexity, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. But isn't the easiest fix to this just making the
Int
EA the default integer container? Do we need Arrow for this?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this kind of discussion belongs in a dedicated issue, not a roadmap meta-issue :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are plans for mutability in arrow, see https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.pgrntoqlanlq (although it might not be at the array level, but in any case, as you said, the buffers are mutable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buffers passed between components working with Arrow are immutable. But when you use Arrow-style memory in a project, you are open to mutate the buffers as long as you don't pass them out to a different consumer. This is currently not reflected in the
pyarrow
package as the main focus of the project was more on using Arrow as an interchange format and not is only slowly shifting to using Arrow as a data storage where data is actively worked on.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that in the case of the string type in arrow where all string data is stored in a single data array, mutation is impossible. The in-place modification of NumPy strings is only possible because for every row the same fixed amount of space is allocated but even there you need to copy when you would like to insert a new string that is longer than the current longest string.
Thus for a non-
object
string type, we need to include this in the API discussions.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this will ever be the default. Sure we can support it thru EA but default is a big ask here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll strike "by default"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I would also strike the "this should simplify"... again,
Int64
already provides the solution to NaNs in ints. I would just say that Arrow is very efficient and portable across languages...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove this entire section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no problem with removing it, but I think the following could also work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could link here to the issue discussing this (#15556). We could probably do that in several places, do we want to do that in general?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. My preference is for each of these to be a small summary of a larger discussion / proposal (GitHub issue or some other design document).