bigquery: Expose Apache Arrow data #8100

zeroshade · 2023-06-15T15:31:08Z

Is your feature request related to a problem? Please describe.
Currently even though Apache Arrow is used internally by the BigQuery client, the Arrow iterator and Arrow data is not, itself, exposed externally by the client. There is no way to retrieve the data in the columnar format which imposes a performance cost of having to transpose the data to get it row-by-row.

Describe the solution you'd like
Ideally the Arrow IPC streams could be exposed directly to allow consumers to be able to retrieve the data without needing to be using the same major version of the Go Arrow libraries. Potentially, if available, exposing the ability to parallelize fetching if possible (based on multiple partitions being available etc.) At a minimum though, exposing an interface that meets the RecordReader interface would be very useful.

Describe alternatives you've considered
The only alternative I can think of would be to bypass this library entirely and attempt to implement fetching from BigQuery's API directly which would be inconvenient and unmaintainable ultimately.

Additional context
I'm looking to build a driver for BigQuery for Arrow Database Connectivity (ADBC) and for ease of deployment, was going to build it using Go (as we already have done for Arrow Flight SQL and Snowflake ADBC drivers). At conferences where I've spoken about ADBC, the 2nd biggest request has been for a BigQuery driver, so I'd like to start looking into this. I thought it would be easy since I saw that BigQuery already uses Arrow, but I couldn't find anything in the pkg.go.dev docs on how to actually retrieve the Arrow data directly. If I'm wrong, then I'd happily take a link to any documentation on how to retrieve the arrow data (and even better, write Arrow data BigQuery!).

Thank you very much!

alvarowolfx · 2023-06-15T21:42:43Z

thanks for reaching out @zeroshade about exposing Arrow and support for ADBC. I was the person working on integration with our internal Storage Read API using Arrow in our SDK and got to say that multiple paths were considered for doing so. One of those paths included giving access to Arrow records to users, but as the surface area was already becoming too big, we ended deciding to remove that from the initial release, but some foundation for that work it there under the hood.

In regards to your point on writing Arrow records, unfortunately our Storage Write doesn't have support for Arrow. For the time being it supports writing Protobufs.

Here you can see that most of the logic for iterating within Arrow records is in this internal arrowIterator

google-cloud-go/bigquery/storage_iterator.go

Line 36 in fc3d840

type arrowIterator struct {

, which can be extended to have a public version available to users. It also reads BigQuery Read Streams in parallel to have improved performance ( internally we got from 4-50x improvements in download speed and ~50% reduction in allocations).

One snag that we had when discussing such public exposure, was around avoiding exposing a major version of the Go Arrow library like you mentioned. Which approach do you recommend ? Basically copying the RecordReader interface in our end and making the (future) ArrowIterator conform to that publicly ?

I'll be more than happy to work on that to expose the an Arrow iterator and also collaborate on the ABDC driver. I'm a bit busy with some work related to the Storage Write API in a separated SDK (NodeJS), but in the coming weeks I'll have more cycles to work on this.

zeroshade · 2023-06-16T15:08:29Z

In regards to your point on writing Arrow records, unfortunately our Storage Write doesn't have support for Arrow. For the time being it supports writing Protobufs.

Are there plans to add support for writing Arrow Record streams directly to it? It would be really cool and efficient if so.

One snag that we had when discussing such public exposure, was around avoiding exposing a major version of the Go Arrow library like you mentioned. Which approach do you recommend ? Basically copying the RecordReader interface in our end and making the (future) ArrowIterator conform to that publicly ?

The route I ended up taking for Snowflake, which I think can work here, was to expose the IPC streams directly as io.Readers. This way you're not exposing a major version of the Go Arrow library and it's easy for the consumer to manage the ipc.Reader of whatever version of the Arrow library they want to use.

Ideally, you could expose some way to identify the different read streams so that in ADBC we could let the consumer control the parallelism. In ADBC one of the exposed interfaces is ExecutePartition which executes a query and returns a schema + a collection of "Partition IDs". These IDs can then be fed into the connection via ReadPartition to read a single stream of Arrow records, allowing the consumer to control the parallelism of the read stream if they desire to do so.

For reference: I'm one of the primary contributors to the Arrow Go library and the ADBC Go library. So I'm open to whatever we need to get this moving since it's such a highly requested feature for ADBC. I'm happy to help contribute PRs here or make modifications in the Arrow / ADBC interfaces if necessary to make this easier and make more sense. 😄.

Looking forward to working with you on this when you have the cycles! 😀

k-anshul · 2023-08-28T12:01:55Z

Hi @alvarowolfx

Based on some of the discussions here, I tried to expose arrow data in this PR : https://github.com/googleapis/google-cloud-go/pull/8500/files
I would really appreciate if we could take a look at these changes and I'm more than willing to collaborate on any design enhancements you may propose.
For additional context, this has been very helpful for us in developing a bigquery data connector for our application where we ingest records from bigquery to duckdb
We are able to see a performance improvement of more than 100% when ingesting records as arrow stream vs ingesting row by row. There are tremendous savings in getting arrow records from bigquery sdk vs individual rows (For 2 million rows of a public dataset it just takes 1.2 seconds to fetch all the records as arrow stream but takes 12.5 seconds to do so row by row on GCP)

alvarowolfx · 2023-08-29T12:49:46Z

hey @k-anshul thanks for the PR, I'll take a look on it. This work was on hold because we have some planned work to support Arrow data fetching on other query APIs, so we need to think of an interface that will support all of those query paths and also work as a base for other Arrow projects like ADBC.

I can share a draft PR with some work that I have done here and will add some comments to your PR.

k-anshul · 2023-08-29T12:59:04Z

Thanks. Looking forward to it.

alvarowolfx · 2023-08-29T15:42:29Z

@k-anshul I pushed some draft work on this PR
#8506

k-anshul · 2023-08-29T16:17:50Z

@alvarowolfx If you are already working on parallel work to expose ArrowIterator, should I close my PR and wait for your changes or take inputs from your work and modify my PR accordingly ?
I am fine with both ways.

As we have some planned work to support Arrow data fetching on other query APIs, so we need to think of an interface that will support all of those query paths and also work as a base for other Arrow projects like ADBC. So this PR detaches the Storage API from the Arrow Decoder and creates a new ArrowIterator interface. This new interface is implemented by the Storage iterator and later can be implemented for other query interfaces that supports Arrow. Resolves #8100

As we have some planned work to support Arrow data fetching on other query APIs, so we need to think of an interface that will support all of those query paths and also work as a base for other Arrow projects like ADBC. So this PR detaches the Storage API from the Arrow Decoder and creates a new ArrowIterator interface. This new interface is implemented by the Storage iterator and later can be implemented for other query interfaces that supports Arrow. Resolves #8100

zeroshade added the triage me I really want to be triaged. label Jun 15, 2023

product-auto-label bot added the api: bigquery Issues related to the BigQuery API. label Jun 15, 2023

blunderbuss-gcf bot assigned shollyman Jun 15, 2023

shollyman assigned alvarowolfx and unassigned shollyman Jun 15, 2023

noahdietz added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed triage me I really want to be triaged. labels Jun 15, 2023

k-anshul mentioned this issue Aug 28, 2023

feat(bigquery): expose Apache Arrow data #8500

Closed

alvarowolfx mentioned this issue Aug 29, 2023

feat(bigquery): expose Apache Arrow data through ArrowIterator #8506

Merged

gcf-merge-on-green bot closed this as completed in #8506 Oct 23, 2023

release-please bot mentioned this issue Oct 23, 2023

chore(main): release bigquery 1.57.0 #8696

Merged

begelundmuller mentioned this issue Oct 23, 2023

Runtime: Incorporate official Arrow support in BigQuery SDK rilldata/rill#3302

Closed

This was referenced Nov 7, 2023

November 06, 2023 kitta65/bq-extension-vscode#251

Closed

November 06, 2023 kitta65/prettier-plugin-bq#258

Closed

November 06, 2023 kitta65/bq2cst#268

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bigquery: Expose Apache Arrow data #8100

bigquery: Expose Apache Arrow data #8100

zeroshade commented Jun 15, 2023

alvarowolfx commented Jun 15, 2023

zeroshade commented Jun 16, 2023

k-anshul commented Aug 28, 2023

alvarowolfx commented Aug 29, 2023

k-anshul commented Aug 29, 2023

alvarowolfx commented Aug 29, 2023

k-anshul commented Aug 29, 2023

bigquery: Expose Apache Arrow data #8100

bigquery: Expose Apache Arrow data #8100

Comments

zeroshade commented Jun 15, 2023

alvarowolfx commented Jun 15, 2023

zeroshade commented Jun 16, 2023

k-anshul commented Aug 28, 2023

alvarowolfx commented Aug 29, 2023

k-anshul commented Aug 29, 2023

alvarowolfx commented Aug 29, 2023

k-anshul commented Aug 29, 2023