-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bigquery: Expose Apache Arrow data #8100
Comments
thanks for reaching out @zeroshade about exposing Arrow and support for ADBC. I was the person working on integration with our internal Storage Read API using Arrow in our SDK and got to say that multiple paths were considered for doing so. One of those paths included giving access to Arrow records to users, but as the surface area was already becoming too big, we ended deciding to remove that from the initial release, but some foundation for that work it there under the hood. In regards to your point on writing Arrow records, unfortunately our Storage Write doesn't have support for Arrow. For the time being it supports writing Protobufs. Here you can see that most of the logic for iterating within Arrow records is in this internal
One snag that we had when discussing such public exposure, was around avoiding exposing a major version of the Go Arrow library like you mentioned. Which approach do you recommend ? Basically copying the I'll be more than happy to work on that to expose the an Arrow iterator and also collaborate on the ABDC driver. I'm a bit busy with some work related to the Storage Write API in a separated SDK (NodeJS), but in the coming weeks I'll have more cycles to work on this. |
Are there plans to add support for writing Arrow Record streams directly to it? It would be really cool and efficient if so.
The route I ended up taking for Snowflake, which I think can work here, was to expose the IPC streams directly as Ideally, you could expose some way to identify the different read streams so that in ADBC we could let the consumer control the parallelism. In ADBC one of the exposed interfaces is For reference: I'm one of the primary contributors to the Arrow Go library and the ADBC Go library. So I'm open to whatever we need to get this moving since it's such a highly requested feature for ADBC. I'm happy to help contribute PRs here or make modifications in the Arrow / ADBC interfaces if necessary to make this easier and make more sense. 😄. Looking forward to working with you on this when you have the cycles! 😀 |
Hi @alvarowolfx Based on some of the discussions here, I tried to expose arrow data in this PR : https://github.com/googleapis/google-cloud-go/pull/8500/files |
hey @k-anshul thanks for the PR, I'll take a look on it. This work was on hold because we have some planned work to support Arrow data fetching on other query APIs, so we need to think of an interface that will support all of those query paths and also work as a base for other Arrow projects like ADBC. I can share a draft PR with some work that I have done here and will add some comments to your PR. |
Thanks. Looking forward to it. |
@alvarowolfx If you are already working on parallel work to expose ArrowIterator, should I close my PR and wait for your changes or take inputs from your work and modify my PR accordingly ? |
As we have some planned work to support Arrow data fetching on other query APIs, so we need to think of an interface that will support all of those query paths and also work as a base for other Arrow projects like ADBC. So this PR detaches the Storage API from the Arrow Decoder and creates a new ArrowIterator interface. This new interface is implemented by the Storage iterator and later can be implemented for other query interfaces that supports Arrow. Resolves #8100
🤖 I have created a release *beep* *boop* --- ## [1.57.0](https://togithub.com/googleapis/google-cloud-go/compare/bigquery/v1.56.0...bigquery/v1.57.0) (2023-10-30) ### Features * **bigquery/biglake:** Promote to GA ([e864fbc](https://togithub.com/googleapis/google-cloud-go/commit/e864fbcbc4f0a49dfdb04850b07451074c57edc8)) * **bigquery/storage/managedwriter:** Support default value controls ([#8686](https://togithub.com/googleapis/google-cloud-go/issues/8686)) ([dfa8e22](https://togithub.com/googleapis/google-cloud-go/commit/dfa8e22edf560211ae2a2ebf1f9a23b86887c7be)) * **bigquery:** Expose Apache Arrow data through ArrowIterator ([#8506](https://togithub.com/googleapis/google-cloud-go/issues/8506)) ([c8e7692](https://togithub.com/googleapis/google-cloud-go/commit/c8e76923621b379fb7deb6dfb944011af1d980bd)), refs [#8100](https://togithub.com/googleapis/google-cloud-go/issues/8100) * **bigquery:** Introduce query preview features ([#8653](https://togithub.com/googleapis/google-cloud-go/issues/8653)) ([f29683b](https://togithub.com/googleapis/google-cloud-go/commit/f29683bcd06567e4fc2d404f53bedbea5b5f0f90)) ### Bug Fixes * **bigquery:** Handle storage read api Recv call errors ([#8666](https://togithub.com/googleapis/google-cloud-go/issues/8666)) ([c73963f](https://togithub.com/googleapis/google-cloud-go/commit/c73963f64ef667daa8a33a5a4cc2156818fc6914)) * **bigquery:** Update golang.org/x/net to v0.17.0 ([174da47](https://togithub.com/googleapis/google-cloud-go/commit/174da47254fefb12921bbfc65b7829a453af6f5d)) * **bigquery:** Update grpc-go to v1.56.3 ([343cea8](https://togithub.com/googleapis/google-cloud-go/commit/343cea8c43b1e31ae21ad50ad31d3b0b60143f8c)) * **bigquery:** Update grpc-go to v1.59.0 ([81a97b0](https://togithub.com/googleapis/google-cloud-go/commit/81a97b06cb28b25432e4ece595c55a9857e960b7)) --- This PR was generated with [Release Please](https://togithub.com/googleapis/release-please). See [documentation](https://togithub.com/googleapis/release-please#release-please).
As we have some planned work to support Arrow data fetching on other query APIs, so we need to think of an interface that will support all of those query paths and also work as a base for other Arrow projects like ADBC. So this PR detaches the Storage API from the Arrow Decoder and creates a new ArrowIterator interface. This new interface is implemented by the Storage iterator and later can be implemented for other query interfaces that supports Arrow. Resolves #8100
🤖 I have created a release *beep* *boop* --- ## [1.57.0](https://togithub.com/googleapis/google-cloud-go/compare/bigquery/v1.56.0...bigquery/v1.57.0) (2023-10-30) ### Features * **bigquery/biglake:** Promote to GA ([e864fbc](https://togithub.com/googleapis/google-cloud-go/commit/e864fbcbc4f0a49dfdb04850b07451074c57edc8)) * **bigquery/storage/managedwriter:** Support default value controls ([#8686](https://togithub.com/googleapis/google-cloud-go/issues/8686)) ([dfa8e22](https://togithub.com/googleapis/google-cloud-go/commit/dfa8e22edf560211ae2a2ebf1f9a23b86887c7be)) * **bigquery:** Expose Apache Arrow data through ArrowIterator ([#8506](https://togithub.com/googleapis/google-cloud-go/issues/8506)) ([c8e7692](https://togithub.com/googleapis/google-cloud-go/commit/c8e76923621b379fb7deb6dfb944011af1d980bd)), refs [#8100](https://togithub.com/googleapis/google-cloud-go/issues/8100) * **bigquery:** Introduce query preview features ([#8653](https://togithub.com/googleapis/google-cloud-go/issues/8653)) ([f29683b](https://togithub.com/googleapis/google-cloud-go/commit/f29683bcd06567e4fc2d404f53bedbea5b5f0f90)) ### Bug Fixes * **bigquery:** Handle storage read api Recv call errors ([#8666](https://togithub.com/googleapis/google-cloud-go/issues/8666)) ([c73963f](https://togithub.com/googleapis/google-cloud-go/commit/c73963f64ef667daa8a33a5a4cc2156818fc6914)) * **bigquery:** Update golang.org/x/net to v0.17.0 ([174da47](https://togithub.com/googleapis/google-cloud-go/commit/174da47254fefb12921bbfc65b7829a453af6f5d)) * **bigquery:** Update grpc-go to v1.56.3 ([343cea8](https://togithub.com/googleapis/google-cloud-go/commit/343cea8c43b1e31ae21ad50ad31d3b0b60143f8c)) * **bigquery:** Update grpc-go to v1.59.0 ([81a97b0](https://togithub.com/googleapis/google-cloud-go/commit/81a97b06cb28b25432e4ece595c55a9857e960b7)) --- This PR was generated with [Release Please](https://togithub.com/googleapis/release-please). See [documentation](https://togithub.com/googleapis/release-please#release-please).
Is your feature request related to a problem? Please describe.
Currently even though Apache Arrow is used internally by the BigQuery client, the Arrow iterator and Arrow data is not, itself, exposed externally by the client. There is no way to retrieve the data in the columnar format which imposes a performance cost of having to transpose the data to get it row-by-row.
Describe the solution you'd like
Ideally the Arrow IPC streams could be exposed directly to allow consumers to be able to retrieve the data without needing to be using the same major version of the Go Arrow libraries. Potentially, if available, exposing the ability to parallelize fetching if possible (based on multiple partitions being available etc.) At a minimum though, exposing an interface that meets the RecordReader interface would be very useful.
Describe alternatives you've considered
The only alternative I can think of would be to bypass this library entirely and attempt to implement fetching from BigQuery's API directly which would be inconvenient and unmaintainable ultimately.
Additional context
I'm looking to build a driver for BigQuery for Arrow Database Connectivity (ADBC) and for ease of deployment, was going to build it using Go (as we already have done for Arrow Flight SQL and Snowflake ADBC drivers). At conferences where I've spoken about ADBC, the 2nd biggest request has been for a BigQuery driver, so I'd like to start looking into this. I thought it would be easy since I saw that BigQuery already uses Arrow, but I couldn't find anything in the pkg.go.dev docs on how to actually retrieve the Arrow data directly. If I'm wrong, then I'd happily take a link to any documentation on how to retrieve the arrow data (and even better, write Arrow data BigQuery!).
Thank you very much!
The text was updated successfully, but these errors were encountered: