-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-9761: [C/C++] Add experimental C stream inferface #8052
Conversation
cc @wesm |
int (*get_schema)(struct ArrowArrayStream*, struct ArrowSchema* out); | ||
// Callback to get the next array | ||
// (if no error and the array is released, the stream has ended) | ||
// Return value: 0 if successful, an `errno`-compatible error code otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
silly question is errno
-compatible a well defined unix/windows term? someplace where I can read more about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how to phrase it: it returns value that are errno
error codes (in case of error). A number of values are standard in C++: https://en.cppreference.com/w/cpp/error/errno_macros
// Release callback: release the stream's own resources. | ||
// Note that arrays returned by `get_next` must be individually released. | ||
void (*release)(struct ArrowArrayStream*); | ||
// Opaque producer-specific data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit ntew line here or remove the new line above?
const char* (*get_last_error)(struct ArrowArrayStream*); | ||
|
||
// Release callback: release the stream's own resources. | ||
// Note that arrays returned by `get_next` must be individually released. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are arrays produced by this stream tied to the lifecycle of this stream (must they be released first?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I'd say no.
// Return value: 0 if successful, an `errno`-compatible error code otherwise. | ||
int (*get_schema)(struct ArrowArrayStream*, struct ArrowSchema* out); | ||
// Callback to get the next array | ||
// (if no error and the array is released, the stream has ended) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
released
is something defined in the ArrowArray-spec. Is it stronger or weaker guarantee then returning a nullptr here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nullptr cannot be returned. The callback returns an int
. However, we could say that returning -1
means end of stream.
I have one suggestion: The proposal indicates that the stream is finished when the array resulting from get_next is released. This seems a bit odd, how about just setting its length to 0? Or is it possible in the stream API that individual chunks are of length 0 but subsequent chunks are not? |
It's possible to have some chunks with 0 length in a stream, yes. I don't see any reason to forbid it in this API (and such corner cases are often annoying to deal with). |
However, as said above, perhaps returning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. The next steps would be to integrate this in pyarrow and do the plumbing to get something working end-to-end with a third party project (e.g. DuckDB)
ec7c5aa
to
1064c70
Compare
1064c70
to
04c25d8
Compare
I added some docs, please take a look. |
04c25d8
to
b71d965
Compare
Rebased. Does someone want to review the doc at https://pitrou.net/arrowdevdoc/format/CStreamInterface.html ? |
I took a quick look and seemed reasonable (nothing jumped out at me that needed changes). |
b71d965
to
61e2c5d
Compare
Rebased again. We can probably merge soon, if CI agrees. |
One thing we noticed: There is no way to |
Would |
I would think so, no? |
Ok, I just wanted to make sure you weren't looking to go back a number of rows given as argument :-) |
I think JDBC might in some cases allow for "scrolling" type of rewind vs restart (not saying this should support it but it is worth mentioning. |
Each environment (Python, C++, JDBC, etc.) has its own feature set for streams / iterators, which doesn't make our task easy :-/ |
Starting simple sounds good to me. At least for JDBC, I don't think it is required for driver implementations to support scrolling. |
Rewinding doesn't strike me as something which needs to be part of the C stream protocol. APIs can still provide rewind and other semantics while using a simple-as-possible stream as a building block. In the case of a SQL view which can be viewed multiple times for example, let the SQL view need not be a stream. It could instead be a function returning streams (each beginning at the start of the view): -ArrowArrayStream* MakeSqlView(...);
+SqlView* MakeSqlView(...);
+ArrowArrayStream* GetStream(SqlView*); Rewind, scrolling, offsets, reverse iteration, etc can all be accommodated in this fashion so IMHO they don't belong in the protocol. |
Another thing that occurred to me is whether we want to enable batch-level metadata (which would be implementation-defined). This is supported in Flight for example https://github.com/apache/arrow/blob/master/format/Flight.proto#L316 |
The goal is to have a standardized ABI to communicate streams of homogeneous arrays or record batches (for example for database result sets). The trickiest part is error reporting. This proposal tries to strike a compromise between simplicity (an integer error code mapping to errno values) and expressivity (an optional description string for application-specific and context-specific details).
61e2c5d
to
9ade9a9
Compare
9ade9a9
to
87e8f73
Compare
I don't think it makes sense to continue hesitating on the API. I think I'm going to merge the PR as-is. The interface is marked experimental so can still evolve (or even be removed). |
The goal is to have a standardized ABI to communicate streams of homogeneous arrays or record batches (for example for database result sets). The trickiest part is error reporting. This proposal tries to strike a compromise between simplicity (an integer error code mapping to errno values) and expressivity (an optional description string for application-specific and context-specific details). Closes apache#8052 from pitrou/ARROW-9761-c-array-stream Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
The goal is to have a standardized ABI to communicate streams of homogeneous arrays or record batches (for example for database result sets). The trickiest part is error reporting. This proposal tries to strike a compromise between simplicity (an integer error code mapping to errno values) and expressivity (an optional description string for application-specific and context-specific details). Closes apache#8052 from pitrou/ARROW-9761-c-array-stream Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
The goal is to have a standardized ABI to communicate streams of homogeneous arrays or record batches (for example for database result sets).
The trickiest part is error reporting. This proposal tries to strike a compromise between simplicity (an integer error code mapping to errno values) and expressivity (an optional description string for application-specific and context-specific details).