feat(bigquery): expose Apache Arrow data #8500

k-anshul · 2023-08-28T10:27:44Z

This PR exposes an iterator to fetch serialised apache arrow records.

zeroshade · 2023-08-28T21:03:44Z

bigquery/iterator.go

+// NextRecord returns next batch of rows as serialised arrow.Record
+func (a *ArrowIterator) NextRecord() ([]byte, error) {
+	return a.r.arrowIterator.next()
+}
+
+// Schema is available after first call to NextRecord
+func (a *ArrowIterator) Schema() []byte {
+	return a.r.arrowIterator.decoder.rawArrowSchema
+}


It might be more interoperable with the Arrow libraries if you could expose a direct io.Reader for the raw IPC stream

Actually I am using this to construct a array.RecordReader. Sample here : https://github.com/rilldata/rill/blob/09726146ec04415e68f7e41efa64e5b5db12aa7e/runtime/drivers/bigquery/arrow.go#L19
I can move this code to the SDK and make ArrowIterator confirm to array.RecordReader as proposed in other comments as well. Is the array.RecordReader locked for further modifications ?

It seemed tricky to expose a io.Reader that can read from multiple IPC streams.
Also I was looking at the spec for io.Reader and saw this statement : If some data is available but not len(p) bytes, Read conventionally returns what is available instead of waiting for more.
I am wondering what should be the behaviour here? Should we wait for next API call to succeed or return whatever is available ?

The difficulty with having ArrowIterator conform to array.RecordReader is that it would require explicitly exposing the major versioned arrow.Record interface which I believe was trying to be avoided. If it's okay to directly include the versioned arrow.Record interface in the API then I would agree that conforming to the array.RecordReader interface would be best.

I agree you wouldn't expose an io.Reader to read from multiple IPC streams, but rather you'd expose an individual io.Reader for each IPC stream, and then allow the consumer to construct an array.RecordReader for each stream. That's the common way of doing it to allow the consumer to control the backpressure etc. It's also how I implemented the Snowflake arrow stream handling, it provides an io.Reader per stream.

Thanks for the inputs @zeroshade
Should the io.Reader implementation wait for the next batch to be available or return whatever bytes are available as per the io.Reader specs?

it should return whatever bytes are available as per the io.Reader spec. Just make sure you don't return an io.EOF unless you actually are at the end 😄

zeroshade · 2023-08-28T21:04:30Z

bigquery/query.go

+	if rowIter.arrowIterator == nil {
+		return nil, errors.New("bigquery: results not available as arrow records")
+	}


is there a particular kind of query / results that can't be returned as an arrow record batch stream? Can we document those?

As I understood, this is not possible right now since we enforce the usage of storageAPI. I added this as a defensive check. I can remove this if this is adding confusion.

the forceStorageAPI flag was meant only for testing, because in some cases for smaller result sets, there is extra overhead involved into using the Storage API for fetching data, so it might be faster in those scenarios to use the "old" path with jobs.query/getqueryresults.

@zeroshade there is an easy way to convert a JSON Stream to a Arrow stream ? This way we can backport the REST API on those cases of the faster paths or small results sets.

also, even when forcing the Storage API with the flag, if some error happen opening a Read Stream (like if you hit one of the limitations or quota limits like the amount of CreateReadSession per minute), we still fallback to the REST/JSON API

@alvarowolfx As long as you have a schema, you can use https://pkg.go.dev/github.com/apache/arrow/go/v13/arrow/array#NewJSONReader or the individual FromJSON functions etc.

In addition, all of the Builders have an UnmarshalJSON method which will append the unmarshalled json to the builder.

alvarowolfx

Can some integration tests be added to show how the ArrowIterator interface is being used ?

alvarowolfx · 2023-08-29T12:54:18Z

bigquery/query.go

+	if rowIter.arrowIterator == nil {
+		return nil, errors.New("bigquery: results not available as arrow records")
+	}


the forceStorageAPI flag was meant only for testing, because in some cases for smaller result sets, there is extra overhead involved into using the Storage API for fetching data, so it might be faster in those scenarios to use the "old" path with jobs.query/getqueryresults.

@zeroshade there is an easy way to convert a JSON Stream to a Arrow stream ? This way we can backport the REST API on those cases of the faster paths or small results sets.

alvarowolfx · 2023-08-29T13:02:51Z

bigquery/query.go

+// ReadAsArrowObjects submits a query for execution and returns the results via an ArrowIterator.
+// As a prerequisite storage read client should be enabled.
+// It will always use jobs.insert path so may not be efficient for small queries.
+func (q *Query) ReadAsArrowObjects(ctx context.Context) (it *ArrowIterator, err error) {


What about other query/data paths, like reading from a Job or a Table ? I feel like they should have the same interface.

I will add this for those paths as well.

alvarowolfx · 2023-08-29T13:06:32Z

bigquery/query.go

+	if rowIter.arrowIterator == nil {
+		return nil, errors.New("bigquery: results not available as arrow records")
+	}


also, even when forcing the Storage API with the flag, if some error happen opening a Read Stream (like if you hit one of the limitations or quota limits like the amount of CreateReadSession per minute), we still fallback to the REST/JSON API

k-anshul requested review from a team as code owners August 28, 2023 10:27

k-anshul requested a review from chalmerlowe August 28, 2023 10:27

product-auto-label bot added size: s Pull request size is small. api: bigquery Issues related to the BigQuery API. labels Aug 28, 2023

alvarowolfx removed the request for review from chalmerlowe August 28, 2023 13:04

alvarowolfx self-assigned this Aug 28, 2023

k-anshul mentioned this pull request Aug 28, 2023

Runtime: Use Arrow to ingest data from BigQuery into DuckDB rilldata/rill#2953

Open

alvarowolfx self-requested a review August 28, 2023 17:30

zeroshade reviewed Aug 28, 2023

View reviewed changes

alvarowolfx reviewed Aug 29, 2023

View reviewed changes

k-anshul closed this Aug 30, 2023

k-anshul force-pushed the main branch from 1c2929d to fbfaf21 Compare August 30, 2023 11:06

product-auto-label bot added size: u Pull request is empty. and removed size: s Pull request size is small. labels Aug 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bigquery): expose Apache Arrow data #8500

feat(bigquery): expose Apache Arrow data #8500

k-anshul commented Aug 28, 2023 •

edited

Loading

zeroshade Aug 28, 2023

k-anshul Aug 29, 2023

zeroshade Aug 29, 2023

k-anshul Aug 29, 2023

zeroshade Aug 29, 2023

zeroshade Aug 28, 2023

k-anshul Aug 29, 2023

alvarowolfx Aug 29, 2023

alvarowolfx Aug 29, 2023

zeroshade Aug 29, 2023

alvarowolfx left a comment

alvarowolfx Aug 29, 2023

alvarowolfx Aug 29, 2023

k-anshul Aug 29, 2023

alvarowolfx Aug 29, 2023

feat(bigquery): expose Apache Arrow data #8500

feat(bigquery): expose Apache Arrow data #8500

Conversation

k-anshul commented Aug 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alvarowolfx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k-anshul commented Aug 28, 2023 •

edited

Loading