bigquery: enable transparent storage api usage #3880

derekperkins · 2021-04-01T23:21:10Z

Is your feature request related to a problem? Please describe.
Retrieving data from larger queries is painfully slow. In the trace below, the query completed in < 1s, but then it took over 2 minutes to retrieve 180 MB. Typically transferring that inside GCP should take no more than a few seconds, if not sub-second.

Describe the solution you'd like
The Python libraries have already enabled usage by default. Not sure if there are any downsides to that, but it seems like it could be an easy win without exposing any more knobs. (just noticed that @shollyman approved it)
googleapis/python-bigquery@e75ff82

Benchmark copied from the Python PR googleapis/python-bigquery#55 (comment)

If for some reason it shouldn't be enabled automatically, adding an option to QueryConfig would also be a good option.

Describe alternatives you've considered
Using the storage api manually, but that's a completely different paradigm with Avro/Arrow schemas and has a significant mental overhead compared to just using the standard BigQuery package.

Additional context

The text was updated successfully, but these errors were encountered:

derekperkins · 2021-04-02T03:28:03Z

Thinking about it a little more, here's some proposed logic / pseudocode for when to use the storage api. Open to suggestions or clarifications of my assumptions.

var useStorageAPI bool
var tableRef *Table
var wg sync.WaitGroup

// the jobs.query response doesn't return the table reference we'll need for the storage api,
// so we should be able to fetch it concurrently while we wait for the query to complete (not tested)
// we may not need it, but it should be a simple call that will often reduce latency 
if jobsQuery {
    wg.Add(1)
    go func() {
      j, err := getJob()
      tableRef = j.QueryConfig.Dst
      wg.Done()
    }()
}

switch {
  // in this case, we either have all the rows or it's one more RPC to get the remainder, so no need
  // for added concurrency / complexity that will likely increase latency
  case jobsQuery && queryResp.TotalRows <= len(queryResp.Rows) * 2:
    useStorageAPI = false

  // for all other response sizes on jobs.query, use storage api to parallelize it
  case jobsQuery:
    useStorageAPI = true

  // for all jobs.insert calls, including when jobs.query falls back to polling, use the storage api
  // it should be the same number of RPC calls as getQueryRows, but can be parallelized.
  // There shouldn't be a latency penalty even in the case where getQueryRows only needs to be called once
  case jobsInsert:
    useStorageAPI = true
}

wg.Wait()

if useStorageAPI {
  // get data
} else {
  // get data
}

derekperkins · 2021-04-02T03:53:28Z

More thoughts:

Avro: since this should be transparent to the user, we should use Avro because it seems to have better support
New Dependency: regardless which transport is chosen, it's going to add a new dependency to the BQ module. That implies adding the first non-Google controlled top level dependency https://pkg.go.dev/cloud.google.com/go/bigquery?tab=imports
No New Options: none of the other read options like filtering / column selection need to be exposed, since those are part of the query. The only one that people might ask for is max streams, but since the server is choosing, that seems unlikely.

First class support for Storage API

This semi-exceeds the original scope of this issue, but seems like the logical outcome of adding support. Right now, using the storage api kicks you into raw mode, with none of the conveniences of iterating, unmarshaling, etc. Since all of those will necessarily need to be solved, then it should be simple to add a public api like client.ReadRows or something that handles session creation / unmarshaling for you, returning an iterator like a standard query. That would need the filter/column selection options, but those are pretty straightforward, and the transparent query could just set those on the user's behalf, so it all shares the same codepath.

shollyman · 2021-04-02T17:24:14Z

Yeah, there's a lot of moving parts here: much more complexity in schema mangling code, more external dependencies, lots of corner cases, and need for multiple knobs (stream concurrency, decoder concurrency, etc).

It's definitely something I'd like to improve, but I've been pushing on other efforts recently.

derekperkins · 2021-04-02T17:49:00Z

Would you accept a PR roughly following that outline?

shollyman · 2021-04-02T18:00:31Z

This is going to be pretty invasive and touches a lot, and we're also working on some relevant changes in the server contract that may impact here. Let me see about getting this more formalized first; I'd rather save you frustration on some things that will end up being opaque to you.

derekperkins · 2021-04-02T20:27:53Z

I appreciate the transparency. Any rough idea on timelines for that? We've been very happy with what BI Engine has done for our query latency, but that becomes irrelevant when it takes 10-1000x the actual query time to retrieve the results of those queries. I'm guessing we should just plan on using the storage api directly ourselves?

shollyman · 2021-04-02T22:41:56Z

Unclear on timelines, so forging ahead with direct storage interactions doesn't block you on me.

The expectation has generally been that BI query patterns don't tend to return large volumes of row data, but your use case seems to be in contrast with that. I've not tested BI-based results in this size/shape before, so it's not clear to me if its written in a way where we can create a read session with more than a single stream (and if you need the rows to retain ordering, you can only use a single stream anyway).

derekperkins · 2021-04-25T03:05:03Z

I tried implementing the read api myself and wow, it is not simple. After spending time using it, I understand why it is the way it is, since protobuf doesn't really allow for dynamic schemas, but it could and really should be abstracted by this library. There were some optimizations I couldn't do because using query.Read only returns the iterator, leaving you without any of the job metadata.

I was also very disappointed at the performance. It took a few hundred ms to create a read session and wasn't any faster for single page reads. There's still an opportunity for parallelization, though as you mentioned, that would introduce a need for client-side ordering.

derekperkins · 2022-05-19T23:15:32Z

Any updates on this?

shollyman · 2022-05-20T18:46:06Z

This hasn't gotten as much attention, my attention has been elsewhere.

The iterator does now expose a reference to the job so that you can go get query statistics while still using the direct query->read path to getting an iterator.

One of the other prerequisites I think we need to address before making an opinionated abstraction over the query using the storage API is around the arrow format. There's some support in the arrow project, but the IPC reader doesn't yet expose a good way of reading multiple independent RecordBatch instances while reusing a common reference schema. I did some hacking testing and it seemed much more performant than avro.

Then the followup issues get into how we want to present the iterator in this case given arrow being a columnar format. Do we need a row-based approach and transpose, or allow users to access column segments directly? How much data conversion do we do vs exposing the arrow types in arrow directly?

alvarowolfx · 2023-01-26T21:34:06Z

Resolved by #6822 and soon to be released by #7307

jameshartig · 2023-01-26T22:06:49Z

@alvarowolfx from what I can tell the Storage Read API is free for this usage, correct? The pricing page says

Bytes scanned as part of reads from temporary tables are free

and my best understanding for how this was implemented is that it just does a ReadRows call on the temporary table from the query.

derekperkins added the triage me I really want to be triaged. label Apr 1, 2021

product-auto-label bot added the api: bigquery Issues related to the BigQuery API. label Apr 1, 2021

blunderbuss-gcf bot assigned shollyman Apr 1, 2021

shollyman added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed triage me I really want to be triaged. labels Apr 2, 2021

alvarowolfx self-assigned this Oct 4, 2022

alvarowolfx mentioned this issue Oct 7, 2022

feat(bigquery): use storage api for query jobs #6822

Merged

alvarowolfx added the priority: p3 Desirable enhancement or fix. May not be included in next release. label Oct 10, 2022

alvarowolfx closed this as completed Jan 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bigquery: enable transparent storage api usage #3880

bigquery: enable transparent storage api usage #3880

derekperkins commented Apr 1, 2021

derekperkins commented Apr 2, 2021 •

edited

Loading

derekperkins commented Apr 2, 2021

shollyman commented Apr 2, 2021

derekperkins commented Apr 2, 2021

shollyman commented Apr 2, 2021

derekperkins commented Apr 2, 2021

shollyman commented Apr 2, 2021

derekperkins commented Apr 25, 2021

derekperkins commented May 19, 2022

shollyman commented May 20, 2022

alvarowolfx commented Jan 26, 2023

jameshartig commented Jan 26, 2023

bigquery: enable transparent storage api usage #3880

bigquery: enable transparent storage api usage #3880

Comments

derekperkins commented Apr 1, 2021

derekperkins commented Apr 2, 2021 • edited Loading

derekperkins commented Apr 2, 2021

First class support for Storage API

shollyman commented Apr 2, 2021

derekperkins commented Apr 2, 2021

shollyman commented Apr 2, 2021

derekperkins commented Apr 2, 2021

shollyman commented Apr 2, 2021

derekperkins commented Apr 25, 2021

derekperkins commented May 19, 2022

shollyman commented May 20, 2022

alvarowolfx commented Jan 26, 2023

jameshartig commented Jan 26, 2023

derekperkins commented Apr 2, 2021 •

edited

Loading