Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-9761: [C/C++] Add experimental C stream inferface #8052

Closed
wants to merge 7 commits into from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Aug 25, 2020

The goal is to have a standardized ABI to communicate streams of homogeneous arrays or record batches (for example for database result sets).

The trickiest part is error reporting. This proposal tries to strike a compromise between simplicity (an integer error code mapping to errno values) and expressivity (an optional description string for application-specific and context-specific details).

@pitrou
Copy link
Member Author

pitrou commented Aug 25, 2020

cc @wesm

@github-actions
Copy link

int (*get_schema)(struct ArrowArrayStream*, struct ArrowSchema* out);
// Callback to get the next array
// (if no error and the array is released, the stream has ended)
// Return value: 0 if successful, an `errno`-compatible error code otherwise.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

silly question is errno-compatible a well defined unix/windows term? someplace where I can read more about it?

Copy link
Member Author

@pitrou pitrou Aug 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to phrase it: it returns value that are errno error codes (in case of error). A number of values are standard in C++: https://en.cppreference.com/w/cpp/error/errno_macros

// Release callback: release the stream's own resources.
// Note that arrays returned by `get_next` must be individually released.
void (*release)(struct ArrowArrayStream*);
// Opaque producer-specific data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit ntew line here or remove the new line above?

const char* (*get_last_error)(struct ArrowArrayStream*);

// Release callback: release the stream's own resources.
// Note that arrays returned by `get_next` must be individually released.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are arrays produced by this stream tied to the lifecycle of this stream (must they be released first?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I'd say no.

// Return value: 0 if successful, an `errno`-compatible error code otherwise.
int (*get_schema)(struct ArrowArrayStream*, struct ArrowSchema* out);
// Callback to get the next array
// (if no error and the array is released, the stream has ended)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

released is something defined in the ArrowArray-spec. Is it stronger or weaker guarantee then returning a nullptr here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nullptr cannot be returned. The callback returns an int. However, we could say that returning -1 means end of stream.

@hannes
Copy link

hannes commented Aug 26, 2020

I have one suggestion: The proposal indicates that the stream is finished when the array resulting from get_next is released. This seems a bit odd, how about just setting its length to 0? Or is it possible in the stream API that individual chunks are of length 0 but subsequent chunks are not?

@pitrou
Copy link
Member Author

pitrou commented Aug 26, 2020

It's possible to have some chunks with 0 length in a stream, yes. I don't see any reason to forbid it in this API (and such corner cases are often annoying to deal with).

@pitrou
Copy link
Member Author

pitrou commented Aug 26, 2020

However, as said above, perhaps returning -1 (rather than 0 for success or a positive errno-like value for errors) could mean an end of stream. Thoughts?

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. The next steps would be to integrate this in pyarrow and do the plumbing to get something working end-to-end with a third party project (e.g. DuckDB)

@pitrou pitrou force-pushed the ARROW-9761-c-array-stream branch 6 times, most recently from ec7c5aa to 1064c70 Compare September 1, 2020 14:35
@pitrou pitrou marked this pull request as ready for review September 1, 2020 14:35
@pitrou
Copy link
Member Author

pitrou commented Sep 1, 2020

I added some docs, please take a look.

@pitrou pitrou changed the title ARROW-9761: [C/C++] Add experimental C ArrowArrayStream ABI ARROW-9761: [C/C++] Add experimental C stream inferface Sep 1, 2020
@pitrou
Copy link
Member Author

pitrou commented Sep 9, 2020

Rebased. Does someone want to review the doc at https://pitrou.net/arrowdevdoc/format/CStreamInterface.html ?

@emkornfield
Copy link
Contributor

Rebased. Does someone want to review the doc at https://pitrou.net/arrowdevdoc/format/CStreamInterface.html ?

I took a quick look and seemed reasonable (nothing jumped out at me that needed changes).

@pitrou
Copy link
Member Author

pitrou commented Sep 21, 2020

Rebased again. We can probably merge soon, if CI agrees.

@hannes
Copy link

hannes commented Sep 21, 2020

One thing we noticed: There is no way to rewind such a stream, but we would really like to be able to do so. For example, if someone defines an Arrow-backed SQL VIEW, it can be scanned multiple times. Once it has been read once, it is currently invalidated and cannot be used again. Would it be possible to add a callback for that? If not supported, it could be set to NULL.

@pitrou
Copy link
Member Author

pitrou commented Sep 21, 2020

Would rewind go back to the start of stream always?

@hannes
Copy link

hannes commented Sep 21, 2020

Would rewind go back to the start of stream always?

I would think so, no?

@pitrou
Copy link
Member Author

pitrou commented Sep 21, 2020

Ok, I just wanted to make sure you weren't looking to go back a number of rows given as argument :-)

@emkornfield
Copy link
Contributor

I think JDBC might in some cases allow for "scrolling" type of rewind vs restart (not saying this should support it but it is worth mentioning.

@pitrou
Copy link
Member Author

pitrou commented Sep 21, 2020

Each environment (Python, C++, JDBC, etc.) has its own feature set for streams / iterators, which doesn't make our task easy :-/

@emkornfield
Copy link
Contributor

Starting simple sounds good to me. At least for JDBC, I don't think it is required for driver implementations to support scrolling.

@bkietz
Copy link
Member

bkietz commented Sep 21, 2020

Rewinding doesn't strike me as something which needs to be part of the C stream protocol. APIs can still provide rewind and other semantics while using a simple-as-possible stream as a building block.

In the case of a SQL view which can be viewed multiple times for example, let the SQL view need not be a stream. It could instead be a function returning streams (each beginning at the start of the view):

-ArrowArrayStream* MakeSqlView(...);
+SqlView* MakeSqlView(...);
+ArrowArrayStream* GetStream(SqlView*);

Rewind, scrolling, offsets, reverse iteration, etc can all be accommodated in this fashion so IMHO they don't belong in the protocol.

@wesm
Copy link
Member

wesm commented Sep 21, 2020

Another thing that occurred to me is whether we want to enable batch-level metadata (which would be implementation-defined). This is supported in Flight for example

https://github.com/apache/arrow/blob/master/format/Flight.proto#L316

The goal is to have a standardized ABI to communicate streams of homogeneous
arrays or record batches (for example for database result sets).

The trickiest part is error reporting.  This proposal tries to strike
a compromise between simplicity (an integer error code mapping to errno values)
and expressivity (an optional description string for application-specific
and context-specific details).
@pitrou
Copy link
Member Author

pitrou commented Oct 1, 2020

I don't think it makes sense to continue hesitating on the API. I think I'm going to merge the PR as-is. The interface is marked experimental so can still evolve (or even be removed).

@pitrou pitrou closed this in 97879eb Oct 1, 2020
@pitrou pitrou deleted the ARROW-9761-c-array-stream branch October 1, 2020 10:00
emkornfield pushed a commit to emkornfield/arrow that referenced this pull request Oct 16, 2020
The goal is to have a standardized ABI to communicate streams of homogeneous arrays or record batches (for example for database result sets).

The trickiest part is error reporting.  This proposal tries to strike a compromise between simplicity (an integer error code mapping to errno values) and expressivity (an optional description string for application-specific and context-specific details).

Closes apache#8052 from pitrou/ARROW-9761-c-array-stream

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
The goal is to have a standardized ABI to communicate streams of homogeneous arrays or record batches (for example for database result sets).

The trickiest part is error reporting.  This proposal tries to strike a compromise between simplicity (an integer error code mapping to errno values) and expressivity (an optional description string for application-specific and context-specific details).

Closes apache#8052 from pitrou/ARROW-9761-c-array-stream

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants