Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bigquery: Expose Apache Arrow data #8100

Closed
zeroshade opened this issue Jun 15, 2023 · 7 comments · Fixed by #8506
Closed

bigquery: Expose Apache Arrow data #8100

zeroshade opened this issue Jun 15, 2023 · 7 comments · Fixed by #8506
Assignees
Labels
api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@zeroshade
Copy link

Is your feature request related to a problem? Please describe.
Currently even though Apache Arrow is used internally by the BigQuery client, the Arrow iterator and Arrow data is not, itself, exposed externally by the client. There is no way to retrieve the data in the columnar format which imposes a performance cost of having to transpose the data to get it row-by-row.

Describe the solution you'd like
Ideally the Arrow IPC streams could be exposed directly to allow consumers to be able to retrieve the data without needing to be using the same major version of the Go Arrow libraries. Potentially, if available, exposing the ability to parallelize fetching if possible (based on multiple partitions being available etc.) At a minimum though, exposing an interface that meets the RecordReader interface would be very useful.

Describe alternatives you've considered
The only alternative I can think of would be to bypass this library entirely and attempt to implement fetching from BigQuery's API directly which would be inconvenient and unmaintainable ultimately.

Additional context
I'm looking to build a driver for BigQuery for Arrow Database Connectivity (ADBC) and for ease of deployment, was going to build it using Go (as we already have done for Arrow Flight SQL and Snowflake ADBC drivers). At conferences where I've spoken about ADBC, the 2nd biggest request has been for a BigQuery driver, so I'd like to start looking into this. I thought it would be easy since I saw that BigQuery already uses Arrow, but I couldn't find anything in the pkg.go.dev docs on how to actually retrieve the Arrow data directly. If I'm wrong, then I'd happily take a link to any documentation on how to retrieve the arrow data (and even better, write Arrow data BigQuery!).

Thank you very much!

@zeroshade zeroshade added the triage me I really want to be triaged. label Jun 15, 2023
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the BigQuery API. label Jun 15, 2023
@shollyman shollyman assigned alvarowolfx and unassigned shollyman Jun 15, 2023
@noahdietz noahdietz added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed triage me I really want to be triaged. labels Jun 15, 2023
@alvarowolfx
Copy link
Contributor

thanks for reaching out @zeroshade about exposing Arrow and support for ADBC. I was the person working on integration with our internal Storage Read API using Arrow in our SDK and got to say that multiple paths were considered for doing so. One of those paths included giving access to Arrow records to users, but as the surface area was already becoming too big, we ended deciding to remove that from the initial release, but some foundation for that work it there under the hood.

In regards to your point on writing Arrow records, unfortunately our Storage Write doesn't have support for Arrow. For the time being it supports writing Protobufs.

Here you can see that most of the logic for iterating within Arrow records is in this internal arrowIterator

type arrowIterator struct {
, which can be extended to have a public version available to users. It also reads BigQuery Read Streams in parallel to have improved performance ( internally we got from 4-50x improvements in download speed and ~50% reduction in allocations).

One snag that we had when discussing such public exposure, was around avoiding exposing a major version of the Go Arrow library like you mentioned. Which approach do you recommend ? Basically copying the RecordReader interface in our end and making the (future) ArrowIterator conform to that publicly ?

I'll be more than happy to work on that to expose the an Arrow iterator and also collaborate on the ABDC driver. I'm a bit busy with some work related to the Storage Write API in a separated SDK (NodeJS), but in the coming weeks I'll have more cycles to work on this.

@zeroshade
Copy link
Author

In regards to your point on writing Arrow records, unfortunately our Storage Write doesn't have support for Arrow. For the time being it supports writing Protobufs.

Are there plans to add support for writing Arrow Record streams directly to it? It would be really cool and efficient if so.

One snag that we had when discussing such public exposure, was around avoiding exposing a major version of the Go Arrow library like you mentioned. Which approach do you recommend ? Basically copying the RecordReader interface in our end and making the (future) ArrowIterator conform to that publicly ?

The route I ended up taking for Snowflake, which I think can work here, was to expose the IPC streams directly as io.Readers. This way you're not exposing a major version of the Go Arrow library and it's easy for the consumer to manage the ipc.Reader of whatever version of the Arrow library they want to use.

Ideally, you could expose some way to identify the different read streams so that in ADBC we could let the consumer control the parallelism. In ADBC one of the exposed interfaces is ExecutePartition which executes a query and returns a schema + a collection of "Partition IDs". These IDs can then be fed into the connection via ReadPartition to read a single stream of Arrow records, allowing the consumer to control the parallelism of the read stream if they desire to do so.

For reference: I'm one of the primary contributors to the Arrow Go library and the ADBC Go library. So I'm open to whatever we need to get this moving since it's such a highly requested feature for ADBC. I'm happy to help contribute PRs here or make modifications in the Arrow / ADBC interfaces if necessary to make this easier and make more sense. 😄.

Looking forward to working with you on this when you have the cycles! 😀

@k-anshul
Copy link

Hi @alvarowolfx

Based on some of the discussions here, I tried to expose arrow data in this PR : https://github.com/googleapis/google-cloud-go/pull/8500/files
I would really appreciate if we could take a look at these changes and I'm more than willing to collaborate on any design enhancements you may propose.
For additional context, this has been very helpful for us in developing a bigquery data connector for our application where we ingest records from bigquery to duckdb
We are able to see a performance improvement of more than 100% when ingesting records as arrow stream vs ingesting row by row. There are tremendous savings in getting arrow records from bigquery sdk vs individual rows (For 2 million rows of a public dataset it just takes 1.2 seconds to fetch all the records as arrow stream but takes 12.5 seconds to do so row by row on GCP)

@alvarowolfx
Copy link
Contributor

hey @k-anshul thanks for the PR, I'll take a look on it. This work was on hold because we have some planned work to support Arrow data fetching on other query APIs, so we need to think of an interface that will support all of those query paths and also work as a base for other Arrow projects like ADBC.

I can share a draft PR with some work that I have done here and will add some comments to your PR.

@k-anshul
Copy link

Thanks. Looking forward to it.

@alvarowolfx
Copy link
Contributor

@k-anshul I pushed some draft work on this PR
#8506

@k-anshul
Copy link

@alvarowolfx If you are already working on parallel work to expose ArrowIterator, should I close my PR and wait for your changes or take inputs from your work and modify my PR accordingly ?
I am fine with both ways.

gcf-merge-on-green bot pushed a commit that referenced this issue Oct 23, 2023
As we have some planned work to support Arrow data fetching on other query APIs, so we need to think of an interface that will support all of those query paths and also work as a base for other Arrow projects like ADBC. So this PR detaches the Storage API from the Arrow Decoder and creates a new ArrowIterator interface. This new interface is implemented by the Storage iterator and later can be implemented for other query interfaces that supports Arrow.

Resolves #8100
gcf-merge-on-green bot pushed a commit that referenced this issue Oct 30, 2023
🤖 I have created a release *beep* *boop*
---


## [1.57.0](https://togithub.com/googleapis/google-cloud-go/compare/bigquery/v1.56.0...bigquery/v1.57.0) (2023-10-30)


### Features

* **bigquery/biglake:** Promote to GA ([e864fbc](https://togithub.com/googleapis/google-cloud-go/commit/e864fbcbc4f0a49dfdb04850b07451074c57edc8))
* **bigquery/storage/managedwriter:** Support default value controls ([#8686](https://togithub.com/googleapis/google-cloud-go/issues/8686)) ([dfa8e22](https://togithub.com/googleapis/google-cloud-go/commit/dfa8e22edf560211ae2a2ebf1f9a23b86887c7be))
* **bigquery:** Expose Apache Arrow data through ArrowIterator  ([#8506](https://togithub.com/googleapis/google-cloud-go/issues/8506)) ([c8e7692](https://togithub.com/googleapis/google-cloud-go/commit/c8e76923621b379fb7deb6dfb944011af1d980bd)), refs [#8100](https://togithub.com/googleapis/google-cloud-go/issues/8100)
* **bigquery:** Introduce query preview features ([#8653](https://togithub.com/googleapis/google-cloud-go/issues/8653)) ([f29683b](https://togithub.com/googleapis/google-cloud-go/commit/f29683bcd06567e4fc2d404f53bedbea5b5f0f90))


### Bug Fixes

* **bigquery:** Handle storage read api Recv call errors ([#8666](https://togithub.com/googleapis/google-cloud-go/issues/8666)) ([c73963f](https://togithub.com/googleapis/google-cloud-go/commit/c73963f64ef667daa8a33a5a4cc2156818fc6914))
* **bigquery:** Update golang.org/x/net to v0.17.0 ([174da47](https://togithub.com/googleapis/google-cloud-go/commit/174da47254fefb12921bbfc65b7829a453af6f5d))
* **bigquery:** Update grpc-go to v1.56.3 ([343cea8](https://togithub.com/googleapis/google-cloud-go/commit/343cea8c43b1e31ae21ad50ad31d3b0b60143f8c))
* **bigquery:** Update grpc-go to v1.59.0 ([81a97b0](https://togithub.com/googleapis/google-cloud-go/commit/81a97b06cb28b25432e4ece595c55a9857e960b7))

---
This PR was generated with [Release Please](https://togithub.com/googleapis/release-please). See [documentation](https://togithub.com/googleapis/release-please#release-please).
bhshkh pushed a commit that referenced this issue Nov 3, 2023
As we have some planned work to support Arrow data fetching on other query APIs, so we need to think of an interface that will support all of those query paths and also work as a base for other Arrow projects like ADBC. So this PR detaches the Storage API from the Arrow Decoder and creates a new ArrowIterator interface. This new interface is implemented by the Storage iterator and later can be implemented for other query interfaces that supports Arrow.

Resolves #8100
bhshkh pushed a commit that referenced this issue Nov 3, 2023
🤖 I have created a release *beep* *boop*
---


## [1.57.0](https://togithub.com/googleapis/google-cloud-go/compare/bigquery/v1.56.0...bigquery/v1.57.0) (2023-10-30)


### Features

* **bigquery/biglake:** Promote to GA ([e864fbc](https://togithub.com/googleapis/google-cloud-go/commit/e864fbcbc4f0a49dfdb04850b07451074c57edc8))
* **bigquery/storage/managedwriter:** Support default value controls ([#8686](https://togithub.com/googleapis/google-cloud-go/issues/8686)) ([dfa8e22](https://togithub.com/googleapis/google-cloud-go/commit/dfa8e22edf560211ae2a2ebf1f9a23b86887c7be))
* **bigquery:** Expose Apache Arrow data through ArrowIterator  ([#8506](https://togithub.com/googleapis/google-cloud-go/issues/8506)) ([c8e7692](https://togithub.com/googleapis/google-cloud-go/commit/c8e76923621b379fb7deb6dfb944011af1d980bd)), refs [#8100](https://togithub.com/googleapis/google-cloud-go/issues/8100)
* **bigquery:** Introduce query preview features ([#8653](https://togithub.com/googleapis/google-cloud-go/issues/8653)) ([f29683b](https://togithub.com/googleapis/google-cloud-go/commit/f29683bcd06567e4fc2d404f53bedbea5b5f0f90))


### Bug Fixes

* **bigquery:** Handle storage read api Recv call errors ([#8666](https://togithub.com/googleapis/google-cloud-go/issues/8666)) ([c73963f](https://togithub.com/googleapis/google-cloud-go/commit/c73963f64ef667daa8a33a5a4cc2156818fc6914))
* **bigquery:** Update golang.org/x/net to v0.17.0 ([174da47](https://togithub.com/googleapis/google-cloud-go/commit/174da47254fefb12921bbfc65b7829a453af6f5d))
* **bigquery:** Update grpc-go to v1.56.3 ([343cea8](https://togithub.com/googleapis/google-cloud-go/commit/343cea8c43b1e31ae21ad50ad31d3b0b60143f8c))
* **bigquery:** Update grpc-go to v1.59.0 ([81a97b0](https://togithub.com/googleapis/google-cloud-go/commit/81a97b06cb28b25432e4ece595c55a9857e960b7))

---
This PR was generated with [Release Please](https://togithub.com/googleapis/release-please). See [documentation](https://togithub.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants