Implement BigQuery Driver #168

judahrand · 2022-11-08T22:12:21Z

This could be another interesting one as the API can return Arrow formatted data. Perhaps implemented in Go as I believe that's the 1st class SDK?

lidavidm · 2022-11-10T15:23:24Z

BigQuery Storage sends Arrow over gRPC so it could be done natively for all of C++, Go, and Java. It would be interesting. Though, BQS can't evaluate SQL. So we might want to add the 'inverse' of the ADBC ingest API, for scanning a table without issuing an explicit query (or, specifying that drivers can translate a Substrait read request to such a scan).

I'm less familiar with the 'standard' BigQuery API. The REST API gives row-oriented JSON which isn't as great.

judahrand · 2022-11-10T15:34:57Z

Yeah, the Go BQ/BQS SDK has docs on how to do this: https://github.com/GoogleCloudPlatform/golang-samples/blob/f2c65eb0ee3118298a5c8b84ca22067fe84eb5db/bigquery/bigquery_storage_quickstart/main.go#L333-L369

Appreciate that doing it in C++ might be preferable first?

lidavidm · 2022-11-15T21:47:58Z

(Sorry for the delay.) Go might be interesting just to prove it out quickly. It may also be interesting to see Go implement the C interface and build an embeddable shared/static library to reduce the maintenance costs. (Right now Go can bind to the C interface, but not yet the other way.)

paleolimbot · 2022-11-24T15:23:39Z

Obviously the Arrow interface is preferable, but I thought I'd post the C++ that the bigrquery R package uses to parse the JSON since the output data structure is pretty similar and I happen to know where it lives: https://github.com/r-dbi/bigrquery/blob/main/src/BqField.cpp

lidavidm · 2022-11-25T20:38:10Z

Absolutely, but the Arrow interface only applies to (effectively) full table scans with some filters ("BigQuery Storage" != "BigQuery"), so we will need to parse one of the alternative outputs for general queries. Thanks for the reference though!

judahrand · 2023-01-05T16:09:49Z

Absolutely, but the Arrow interface only applies to (effectively) full table scans with some filters ("BigQuery Storage" != "BigQuery"), so we will need to parse one of the alternative outputs for general queries. Thanks for the reference though!

The Python BigQuery SDK uses a trick to push any query into a table and then use BigQuery Storage API to fetch the result. We could use that here too in order to simplify things.

lidavidm · 2023-01-05T16:19:22Z

Ah, interesting. That would be great, then. Thanks for pointing that out.

I assume that would have cost/pricing implications though, and requires you to materialize the result before reading it?

judahrand · 2023-01-05T16:28:06Z

This works because of this:

https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.job.QueryJob

BigQuery just 'handles' it in the backend through the caching mechanism. Effectively, you run the query twice; once with the BigQuery API and once with the BigQuery Storage API but the second time it hits the cache. So I was a bit wrong in the Python manually pushing it into a table - it doesn't. So this shouldn't have cost implications.

judahrand · 2023-01-05T16:48:10Z

Other docs on the fact that BigQuery actually writes ALL queries to a table: https://cloud.google.com/bigquery/docs/cached-results

lidavidm · 2023-01-05T16:50:02Z

Oops, clearly I don't understand BigQuery well enough. Thanks (again) for digging into this!

judahrand · 2023-01-06T21:55:12Z

Another useful source of inspiration - the Go SDK is in the process of implementing the same fast path as the Python SDK: googleapis/google-cloud-go#6822

To make the Go implementation straight forward interfaces on top of the changes made here would be made to add a method which is analogous to Python's RowIterator.to_arrow_iterable. That doesn't seem like a stretch, however.

zeroshade · 2023-01-07T00:04:19Z

A nice aspect is that the Go bigquery quickstart example ( https://github.com/GoogleCloudPlatform/golang-samples/blob/main/bigquery/bigquery_storage_quickstart/main.go) actually uses the latest released version of arrow (as opposed to snowflake which vendored a 2 year old version of Go Arrow right into their module)

josevalim · 2024-03-31T11:33:29Z

I believe the Go client now exposes the Arrow iterator and data: googleapis/google-cloud-go#8506 (which is likely using the RPC API to read the data).

lidavidm · 2024-09-05T07:30:11Z

This was included in the last release.

cocoa-xu mentioned this issue Apr 12, 2024

feat(c/driver/bigquery): add Google BigQuery support #1717

Closed

3 tasks

lidavidm closed this as completed Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement BigQuery Driver #168

Implement BigQuery Driver #168

judahrand commented Nov 8, 2022

lidavidm commented Nov 10, 2022

judahrand commented Nov 10, 2022 •

edited

Loading

lidavidm commented Nov 15, 2022

paleolimbot commented Nov 24, 2022

lidavidm commented Nov 25, 2022 •

edited

Loading

judahrand commented Jan 5, 2023

lidavidm commented Jan 5, 2023

judahrand commented Jan 5, 2023 •

edited

Loading

judahrand commented Jan 5, 2023

lidavidm commented Jan 5, 2023

judahrand commented Jan 6, 2023

zeroshade commented Jan 7, 2023

josevalim commented Mar 31, 2024

lidavidm commented Sep 5, 2024

Implement BigQuery Driver #168

Implement BigQuery Driver #168

Comments

judahrand commented Nov 8, 2022

lidavidm commented Nov 10, 2022

judahrand commented Nov 10, 2022 • edited Loading

lidavidm commented Nov 15, 2022

paleolimbot commented Nov 24, 2022

lidavidm commented Nov 25, 2022 • edited Loading

judahrand commented Jan 5, 2023

lidavidm commented Jan 5, 2023

judahrand commented Jan 5, 2023 • edited Loading

judahrand commented Jan 5, 2023

lidavidm commented Jan 5, 2023

judahrand commented Jan 6, 2023

zeroshade commented Jan 7, 2023

josevalim commented Mar 31, 2024

lidavidm commented Sep 5, 2024

judahrand commented Nov 10, 2022 •

edited

Loading

lidavidm commented Nov 25, 2022 •

edited

Loading

judahrand commented Jan 5, 2023 •

edited

Loading