-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement BigQuery Driver #168
Comments
BigQuery Storage sends Arrow over gRPC so it could be done natively for all of C++, Go, and Java. It would be interesting. Though, BQS can't evaluate SQL. So we might want to add the 'inverse' of the ADBC ingest API, for scanning a table without issuing an explicit query (or, specifying that drivers can translate a Substrait read request to such a scan). I'm less familiar with the 'standard' BigQuery API. The REST API gives row-oriented JSON which isn't as great. |
Yeah, the Go BQ/BQS SDK has docs on how to do this: https://github.com/GoogleCloudPlatform/golang-samples/blob/f2c65eb0ee3118298a5c8b84ca22067fe84eb5db/bigquery/bigquery_storage_quickstart/main.go#L333-L369 Appreciate that doing it in C++ might be preferable first? |
(Sorry for the delay.) Go might be interesting just to prove it out quickly. It may also be interesting to see Go implement the C interface and build an embeddable shared/static library to reduce the maintenance costs. (Right now Go can bind to the C interface, but not yet the other way.) |
Obviously the Arrow interface is preferable, but I thought I'd post the C++ that the bigrquery R package uses to parse the JSON since the output data structure is pretty similar and I happen to know where it lives: https://github.com/r-dbi/bigrquery/blob/main/src/BqField.cpp |
Absolutely, but the Arrow interface only applies to (effectively) full table scans with some filters ("BigQuery Storage" != "BigQuery"), so we will need to parse one of the alternative outputs for general queries. Thanks for the reference though! |
The Python BigQuery SDK uses a trick to push any query into a table and then use BigQuery Storage API to fetch the result. We could use that here too in order to simplify things. |
Ah, interesting. That would be great, then. Thanks for pointing that out. I assume that would have cost/pricing implications though, and requires you to materialize the result before reading it? |
This works because of this: BigQuery just 'handles' it in the backend through the caching mechanism. Effectively, you run the query twice; once with the BigQuery API and once with the BigQuery Storage API but the second time it hits the cache. So I was a bit wrong in the Python manually pushing it into a table - it doesn't. So this shouldn't have cost implications. |
Other docs on the fact that BigQuery actually writes ALL queries to a table: https://cloud.google.com/bigquery/docs/cached-results |
Oops, clearly I don't understand BigQuery well enough. Thanks (again) for digging into this! |
Another useful source of inspiration - the Go SDK is in the process of implementing the same fast path as the Python SDK: googleapis/google-cloud-go#6822 To make the Go implementation straight forward interfaces on top of the changes made here would be made to add a method which is analogous to Python's |
A nice aspect is that the Go bigquery quickstart example ( https://github.com/GoogleCloudPlatform/golang-samples/blob/main/bigquery/bigquery_storage_quickstart/main.go) actually uses the latest released version of arrow (as opposed to snowflake which vendored a 2 year old version of Go Arrow right into their module) |
I believe the Go client now exposes the Arrow iterator and data: googleapis/google-cloud-go#8506 (which is likely using the RPC API to read the data). |
This was included in the last release. |
This could be another interesting one as the API can return Arrow formatted data. Perhaps implemented in Go as I believe that's the 1st class SDK?
The text was updated successfully, but these errors were encountered: