Performance #133

max-sixty · 2018-02-20T19:35:25Z

We're starting to use BigQuery heavily but becoming increasingly 'bottlenecked' with the performance of moving moderate amounts of data from BigQuery to python.

Here's a few stats:

29.1s: Pulling 500k rows with 3 columns of data (with cached data) using pandas-gbq
36.5s: Pulling the same query with google-cloud-bigquery - i.e. client.query(query)..to_dataframe()
2.4s: Pulling very similar data - same types, same size, from our existing MSSQL box hosted in AWS (using pd.read_sql). That's on standard drivers, nothing like turbodbc involved

...so using BigQuery with python is at least an order of magnitude slower than traditional DBs.

We've tried exporting tables to CSV on GCS and reading those in, which works fairly well for data processes, though not for exploration.

A few questions - feel free to jump in with partial replies:

Are these results expected, or are we doing something very wrong?
My prior is that a lot of this slowdown is caused by pulling in HTTP pages, converting to python objects, and then writing those into arrays. Is this approach really scalable? Should pandas-gbq invest resources into getting a format that's query-able in exploratory workflows that can deal with more reasonable datasets? (or at least encourage Google to)

The text was updated successfully, but these errors were encountered:

tswast · 2018-02-21T13:24:37Z

For some more stats, we can try the running the code in the "benchmark" folder for google-cloud-python. https://github.com/GoogleCloudPlatform/google-cloud-python/tree/6391eeef863d0c1b71cb0c6fa86f4d9831b814fa/bigquery/benchmark That should give us a baseline.

max-sixty · 2018-03-04T20:57:43Z

Need to look at this more, but TensorFlow looks like they've built something that may be fast, in C: tensorflow/tensorflow@1b881b7

tswast · 2018-03-22T13:25:21Z

Duplicate of #66?

max-sixty · 2018-03-22T14:34:35Z

Yes. I've done some work here and can improve time by about 2x and stop the bloating of memory. I'll post something soon-ish. Let's leave this open for the stats above

max-sixty · 2018-04-21T00:27:38Z

I haven't got around to getting this is a perfect state, but here's a dump for the moment:

Requirements

In order to go from BigQuery to Pandas we, at a minimum, need to:

Get some JSON over HTTP
Decode the JSON into Python objects
Create an array from the Python objects

Dos & Don'ts

Given the nature of python - where objects are expensive in both construction time and memory usage - there are some things we want to avoid:

Create a python object for each value
Create an additional python class instance for each row
Construct all python objects before starting to pack to arrays

Instead we want to:

Construct as few objects as possible
Convert objects to arrays and deref the objects as soon as possible

Current state

But currently neither pandas-gbq nor google.cloud.bigquery follows those principles (sometimes, though not always, understandably):

Each Row is a new python object instance (though using slots reduces the memory burden)
All rows need to be loaded into memory before we start packing into arrays
Without spending enough time in the google.cloud.bigquery code, at first glance each item requires a lot of python calls

Benchmark

On an approximate benchmark, both libraries take around a minute:

query="""
SELECT year, value, date FROM `bigquery-public-data.bls.unemployment_cps` 
LIMIT 1000000
"""

# I ran the query to ensure it was cached

pd.read_gbq(query, project_id='x', dialect='standard')

#Requesting query... ok.
#Job ID: c7d73162-afe2-4ebb-9f7a-b9503fd553fe
#Query running...
#Query done.
#Cache hit.

#Retrieving results...
#Got 1000000 rows.

"Total time taken 58.01 s."
# Finished at 2018-03-23 02:34:45.

c = Client()
q=c.query(query)
i=q.result()


%timeit -n 1 -r 1 i.to_dataframe()
"1 loop, best of 1: 1min 12s per loop"

Prototype

I did a quick prototype to see if reducing the python overhead would help. (these are notes from a notebook from a while ago, so may not be fully repro)

TBC, this is v hacky, and makes a bunch of assumptions that are not generalizable.

def get_lists(page):
    for d in page:
        dicts, = d.values()
        yield (x for y in dicts for x in y.values())

def dfs(i):
    i.item_to_value = lambda x,y: y
    for page in i.pages:
        df= pd.DataFrame(get_lists(page))
        # the foreknowledge is obv a cheat
        df[[0,1]] = df[[0,1]].astype(float)
        df[2] = pd.to_datetime(df[2])
        yield df

%timeit -n 1 -r 1 pd.concat(dfs(i))
"1 loop, best of 1: 35.8 s per loop"

The biggest two changes are:

Pack each page into a df before moving onto the next page (and python will throw away the temporary objects, so you're limited by memory of the array-packed objects, rather than memory in python objects)
Don't do excess work on each row - don't construct an object
...and then we're obv not doing work around type checking, so this is a rosy picture

Stepping back

I think there's some case for working through whether this is generalizable. But, for larger datasets, we'd still be fairly slow relative to downloading CSVs:

Still constructing python objects before putting into arrays
The specific JSON format is very inefficient - a dictionary for each row, and another for each value

And even CSVs aren't that great - everything is a string, no native formats

Options

Stay where we are
Roll out some of these changes (e.g. closest change is to construct an array-per-page)
Allow larger datasets through a different path - probably exporting to a GCS file (CSV / parquet?), downloading that file, and loading that

tswast · 2018-04-21T01:53:07Z

Foreknowledge isn't actually as big of a cheat as you might think. Every query creates a temporary table, from which we could get the total_rows (on RowIterator, only populated after the first page) / num_rows (on the Table) and the schema. From these we can construct the dataframe ahead of time and then populate.

I'm still not sure how to get around all the JSON though. Export to GCS and download is really the only option right now. Unfortunately Parquet is supported for import to BigQuery but not export yet (I think they are working on it).

tswast · 2018-04-21T01:55:17Z

I think an array-per-page would be a reasonable step. We could try blasting up the page_size to see if that helps, though I think the default might give the largest possible page already.

max-sixty · 2018-04-27T18:00:18Z

From https://www.tidyverse.org/articles/2018/04/bigrquery-1-0-0/

Improved download speeds
The system for downloading data from BigQuery into R has been rewritten from the ground up to considerably improve performance:

By default, data is downloaded from BigQuery in pages of 10,000 rows. Previously, bigrquery downloaded then parsed each page. Now, bigrquery downloads all pages, then parses all pages. This means that you’ll now see two progress bars: one for downloading JSON from BigQuery and one for parsing that JSON into a data frame.

Because all pages are downloaded in a single pass, we can now download in parallel, using up to 6 simultaneous connections by default. This generally doesn’t result in a six-fold speed up, but should at least double download speed.

The parsing code has been rewritten in C++. This eliminates several expensive intermediate computations, and means that bigrquery no longer requires readr.

All up, I can now download the first million rows of publicdata.samples.natality in about a minute, about 8x faster than the previous version. This data frame takes up 170 MB of space in BigQuery and 140 MB of memory in R, so a minute to download doesn’t seem unreasonable. The bottleneck for loading BigQuery data is now parsing BigQuery’s JSON format, which is difficult to optimise further because I’m already using the fastest C++ JSON parser, RapidJson. If this is still too slow (because you download a lot of data), see ?bq_table_download for an alternative approach.

max-sixty · 2018-08-02T00:03:32Z

We just moved internally from using a CSV loader to the google.cloud.bigquery loader, and it got 3x faster:

df=pd.DataFrame(np.random.rand(1000000,2), columns=list('ab')).assign(date=pd.Timestamp('2017'), s='abc')
 %timeit -n 1 -r 1 write_gbq(df, dataset_id='ishares', table_name='test', if_exists='replace')

# new: 1 loop, best of 1: 11 s per loop
# old: 1 loop, best of 1: 32.3 s per loop

Would be good if we can move this library over to delegating down to that, too

tswast · 2018-08-02T19:39:58Z

Great to hear! The one difference with the google.cloud.bigquery loader that I know of is that it uploads the index column and pandas-gbq doesn't. Should be doable to "drop indexes" in pandas-gbq or add a feature to google.cloud.bigquery to do that for us.

max-sixty · 2018-08-02T20:22:28Z

Do you know whether it'll upload a default index? (I can look if you don't know off hand)

That would mean we need to make a change in that library if we want to maintain back-compat (and I also think forcing an index upload is bad). We could surmount this by setting a random column as the index, but I think that's going to result in a bunch of small bugs around index vs column differences.

tswast · 2018-08-02T20:50:17Z

Yes, I believe it does upload the default index. I couldn't figure out a good way to avoid it when using Parquet.

We probably can't always remove the index because I think we do want to upload the index when an explicit index is specified, as we do in the code sample:

https://github.com/GoogleCloudPlatform/google-cloud-python/blob/9d64d3094bea8002fce329c96c70cb6f48684633/docs/bigquery/snippets.py#L3098-L3129

max-sixty · 2018-08-03T00:48:39Z

Yes - I had a look at the parquet libraries and it doesn't look like excluding the indexes is an option.

I'd probably favor having parquet over optional indexes, though I'm not sure if it's worth the change. What do you think?

tswast · 2018-08-03T13:27:39Z

There are a couple of other benefits to parquet, such as the ability to support struct and array columns (as well as we can, anyway). Let's try it out and see what tests break.

My biggest concern is that it does require we add pyarrow as a dependency.

max-sixty · 2018-08-03T15:47:27Z

Let's try it out and see what tests break.

Right - to be clear - (I think) the output will always be different - the existing version never uploads an index, and the parquet version always uploads an index.

We can set the first column as the index as a workaround, though I worry that might cause some bugs.

max-sixty · 2018-08-07T16:30:47Z

I found a resolution to this - if we construct the schema ourselves, we can choose to exclude the index:

Here's a sample:

    config = bigquery.job.LoadJobConfig()
    config.write_disposition = if_exists_map[if_exists]
    config.schema = _bq_schema(df, schema_overrides)

    job = client.load_table_from_dataframe(
        dataframe=df,
        destination=table,
        job_config=config)

So we can have a param on the write function that decides whether to upload the index.

Bonus points: what should the default be? I think probably to discard the index, since that was the previous implementation. But we could alternatively attempt to assess whether the index is useful; e.g. if named (which we do internally) or if a non-default index.

PS: Not sure this behavior (i.e. discarding columns not specified in the schema without raising an error) is reliable. @tswast lmk if you know

tswast · 2018-08-07T18:08:05Z

PS: Not sure this behavior (i.e. discarding columns not specified in the schema without raising an error) is reliable. @tswast lmk if you know

Sounds like a bug in BigQuery. We should set configuration.load.ignoreUnknownValues to be safe.

tswast · 2018-08-07T18:08:52Z

Bonus points: what should the default be? I think probably to discard the index, since that was the previous implementation. But we could alternatively attempt to assess whether the index is useful; e.g. if named (which we do internally) or if a non-default index.

What if for the default we check if the index starts with double-underscores and drop it then?

max-sixty · 2018-08-07T18:44:18Z

What if for the default we check if the index starts with double-underscores and drop it then?

We could check if the name is None?
It's __index... only in the parquet file (because it can't be None there), so that would be too late for us to check that. Or am I misunderstanding?

tswast · 2018-08-07T19:08:02Z

It's __index... only in the parquet file (because it can't be None there)

Ah yeah, you're right.

tsando · 2018-09-28T11:52:50Z

Thanks for the research on this. So has there been conclusion on this? What do you think is the best practice to query big datasets from google BigQuery using python then? I am experiencing very slow performance from the pandas function read_gbq

max-sixty · 2018-09-28T14:36:35Z

@tsando the latest version is better than prior versions, so it's worth trying to update to master.

Depending on how product trajectory changes, I may finish off the read_gbq_bulk function.

tsando · 2018-10-01T09:04:15Z

@max-sixty thanks, can i do that via $ pip install git+https://github.com/pydata/pandas-gbq.git?

tsando · 2018-10-01T09:16:50Z

actually this doesn't work for me... do you know how i can 'update to master'?

max-sixty · 2018-10-01T15:40:39Z

You had it correct the first time! Why doesn't that work?

tsando · 2018-10-03T09:21:10Z

well, i have anaconda in my environment, so when i install as above i get two versions - one from conda and one from pip:

So I tried to uninstall the conda version but if I do, then i get an error when I call the function complaining that the module is not found.

Thanks for your help!

max-sixty · 2018-10-03T10:27:04Z

I don't you conda, so this is just a guess, but you may want to run conda install git+https://github.com/pydata/pandas-gbq.git rather than pip install ...

tsando · 2018-10-03T12:01:07Z

I get the following error if I remove pandas-gbq from anaconda and install the dev version via pip:

ImportError: Load data from Google BigQuery

the pandas-gbq package is not installed
see the docs: https://pandas-gbq.readthedocs.io

you can install via pip or conda:
pip install pandas-gbq
conda install pandas-gbq -c conda-forge

tsando · 2018-10-03T12:04:38Z

could this be because i need to compile all of pandas again to pick up the dev version of pandas-gbq?

max-sixty · 2018-10-03T14:09:17Z

My conda knowledge has capped out unfortunately. I imagine there's a standard way to install libraries via pip (and have them recognized!)? It may be worth asking on SO with a conda tag...

melissachang · 2018-10-25T16:52:07Z

It takes 2 hours to read a 2G table from BigQuery into pandas. I'm using pandas_gbq 0.7.0. So +1 to this issue.

Do you think master will be faster?

max-sixty · 2018-10-25T17:17:13Z

Master won't be faster than 0.7.0; you'd need a solution like the 'bulk load' one for that size (though 2 hours seems excessive! Maybe with more memory / CPU you can get 2-5x faster, but it's still going to be painful with JSON over HTTP)

If exporting large datasets is important to you, please feel free to raise the prioritization of better APIs with your Google coverage

ruojol · 2018-10-26T07:12:11Z

Does anyone have a workaround for this? I have a table that's just 2gb but with 31 million rows and it's a pain in the ass to get that data into a dataframe

max-sixty · 2018-10-26T14:49:43Z

Running the 'bulk load' code - exporting to CSV, downloading, importing; is the best way for a large dataset

tswast · 2018-10-26T23:27:24Z

I decided to do some profiling today to see where all the time is spent, following this Python profiling guide.

pandas_gbq_bench.py:

import pandas_gbq
pandas_gbq.read_gbq(
    "SELECT * FROM `bigquery-public-data.usa_names.usa_1910_2013`",
    dialect='standard')

$ python -m cProfile -o pandas_gbq.cprof pandas_gbq_bench.py
$ pyprof2calltree -k -i pandas_gbq.cprof

$ head -n 100 pandas_gbq_bench.txt
         253357679 function calls (253138077 primitive calls) in 328.945 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    766/1    0.015    0.000  328.947  328.947 {built-in method builtins.exec}
        1    1.400    1.400  328.947  328.947 pandas_gbq_bench.py:2(<module>)
        1    0.000    0.000  327.030  327.030 gbq.py:622(read_gbq)
        1    0.791    0.791  291.680  291.680 gbq.py:327(run_query)
  5552453    2.724    0.000  290.075    0.000 page_iterator.py:197(_items_iter)
       62    0.001    0.000  222.242    3.585 client.py:332(_call_api)
    64/60    0.001    0.000  222.241    3.704 retry.py:249(retry_wrapped_func)
    64/60    0.077    0.001  222.240    3.704 retry.py:140(retry_target)
       62    0.001    0.000  222.160    3.583 _http.py:214(api_request)
       57    0.000    0.000  218.804    3.839 page_iterator.py:218(_page_iter)
       57    0.001    0.000  218.804    3.839 page_iterator.py:341(_next_page)
       56    0.001    0.000  218.803    3.907 table.py:1127(_get_next_page_response)
       62    0.000    0.000  206.398    3.329 _http.py:142(_make_request)
       62    0.001    0.000  206.398    3.329 _http.py:185(_do_request)
       62    0.001    0.000  206.396    3.329 requests.py:181(request)
       63    0.001    0.000  206.368    3.276 sessions.py:441(request)
       63    0.003    0.000  205.819    3.267 sessions.py:589(send)
    42157    0.130    0.000  198.129    0.005 socket.py:572(readinto)
    42157    0.110    0.000  197.840    0.005 ssl.py:998(recv_into)
    42157    0.081    0.000  197.700    0.005 ssl.py:863(read)
    42157    0.056    0.000  197.612    0.005 ssl.py:624(read)
    42157  197.556    0.005  197.556    0.005 {method 'read' of '_ssl._SSLSocket' objects}
   102716    0.210    0.000  194.229    0.002 {method 'readline' of '_io.BufferedReader' objects}
       63    0.001    0.000  189.560    3.009 adapters.py:388(send)
       63    0.002    0.000  189.512    3.008 connectionpool.py:447(urlopen)
       63    0.002    0.000  189.476    3.008 connectionpool.py:322(_make_request)
       63    0.001    0.000  188.468    2.992 client.py:1287(getresponse)
       63    0.002    0.000  188.465    2.992 client.py:290(begin)
       63    0.003    0.000  188.380    2.990 client.py:257(_read_status)
  5552508    4.559    0.000   68.547    0.000 page_iterator.py:122(next)
  5552452    6.358    0.000   59.783    0.000 table.py:1321(_item_to_row)
  5552452   32.403    0.000   50.283    0.000 _helpers.py:197(_row_tuple_from_json)
        1    0.291    0.291   32.305   32.305 gbq.py:603(_parse_data)
        1    0.853    0.853   30.564   30.564 frame.py:334(__init__)
        1    0.000    0.000   26.148   26.148 frame.py:7453(_to_arrays)
        1    8.938    8.938   23.952   23.952 __init__.py:130(lmap)
      250    0.001    0.000   16.248    0.065 models.py:810(content)
202383/638    0.702    0.000   16.247    0.025 {method 'join' of 'bytes' objects}
       62    0.071    0.001   15.755    0.254 models.py:868(json)
    57160    0.034    0.000   15.588    0.000 models.py:741(generate)
    57160    0.035    0.000   15.554    0.000 response.py:415(stream)
    57160    0.296    0.000   15.519    0.000 response.py:571(read_chunked)
       67    0.001    0.000   15.042    0.225 __init__.py:302(loads)
       67    0.001    0.000   15.041    0.224 decoder.py:334(decode)
       67   15.039    0.224   15.039    0.224 decoder.py:345(raw_decode)
 33314712    9.321    0.000   13.310    0.000 table.py:1073(__getitem__)
 11104904    6.516    0.000    7.641    0.000 _helpers.py:38(_int_from_json)
   101190    0.280    0.000    6.193    0.000 response.py:535(_update_chunk_length)
   101127    0.249    0.000    5.511    0.000 response.py:549(_handle_chunk)
   201745    0.484    0.000    5.263    0.000 client.py:596(_safe_read)

We see very similar results from google-cloud-bigquery to_dataframe:

from google.cloud import bigquery
client = bigquery.Client()

table_ref = bigquery.TableReference.from_string(
    'bigquery-public-data.usa_names.usa_1910_2013')
table = client.get_table(table_ref)
rows = client.list_rows(table)
rows.to_dataframe()

Most of the time is spent actually waiting on the BigQuery API. JSON parsing is a much smaller fraction.

max-sixty · 2018-10-27T00:01:00Z

That's great @tswast , thanks for running that.

As a comparison, running the 'bulk query' path:

17 seconds to run the query with a destination table
17 seconds (also) to extract the table to a CSV
10 seconds to download the CSV
10 seconds to parse to a df

...total of 55 seconds

Running the same test you did took 419 seconds (and the 17 second query was cached, so we should add that back).
Looking at your stats, it looks like about half of that is reading from the API (though I couldn't see the absolute numbers). Looking at my profile, 170s is reading from the socket.
I'm not sure why reading the JSON over HTTP takes 170-200 seconds, while downloading the CSV takes 10s. But I know very little about this area, so I'm not surprised that I'm surprised.

My conclusion is that it's a combination of a) parsing the JSON and b) downloading inefficiently.
Is that right?

One note:

Most of the time is spent actually waiting on the BigQuery API

When you say "waiting", do you mean downloading? Or waiting for the query to finish?

         260937310 function calls (260707117 primitive calls) in 418.889 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    51634  171.764    0.003  171.764    0.003 {method 'read' of '_ssl._SSLSocket' objects}
  5552452   56.045    0.000   93.541    0.000 _helpers.py:197(_row_tuple_from_json)
       65   46.776    0.720   46.776    0.720 decoder.py:370(raw_decode)
 33314712   19.022    0.000   26.235    0.000 table.py:1073(__getitem__)
      378   17.039    0.045   46.087    0.122 {map}
 11104904   11.411    0.000   13.319    0.000 _helpers.py:38(_int_from_json)
  5552452   10.476    0.000   10.476    0.000 {zip}
  5552452    7.478    0.000  105.049    0.000 table.py:1321(_item_to_row)
33321903/33321778    7.217    0.000    7.229    0.000 {isinstance}
      194    7.113    0.037    7.113    0.037 {_codecs.utf_8_decode}
  5552508    5.432    0.000  114.877    0.000 page_iterator.py:122(next)
 27762292    4.400    0.000    4.400    0.000 schema.py:73(field_type)
  5552508    4.396    0.000    4.396    0.000 {next}
        5    4.228    0.846    4.228    0.846 {pandas._libs.lib.maybe_convert_objects}
 27762276    4.178    0.000    4.178    0.000 schema.py:82(mode)
  5552453    3.811    0.000  350.807    0.000 page_iterator.py:197(_items_iter)
 27994555    3.256    0.000    3.256    0.000 {method 'append' of 'list' objects}
5552514/5552513    3.148    0.000    3.148    0.000 {iter}
   114672    2.921    0.000    2.921    0.000 {built-in method decompress}
        1    2.526    2.526  417.437  417.437 gbq.py:24(read_gbq)
  5552453    2.406    0.000    2.406    0.000 table.py:1142(schema)
  5552452    2.270    0.000    2.813    0.000 table.py:1070(__len__)
 11104904    1.908    0.000    1.908    0.000 _helpers.py:33(_not_null)
 16657356    1.908    0.000    1.908    0.000 _helpers.py:62(_string_from_json)
  5552452    1.624    0.000    1.624    0.000 table.py:987(__init__)
  5552453    1.593    0.000    4.741    0.000 gbq.py:602(<genexpr>)
        3    1.560    0.520    1.560    0.520 {pandas._libs.missing.isnaobj}
        1    1.559    1.559    1.559    1.559 {pandas._libs.lib.to_object_array_tuples}
        1    1.452    1.452  418.889  418.889 <string>:1(<module>)
        1    1.151    1.151   58.203   58.203 frame.py:334(__init__)
        1    1.001    1.001  352.762  352.762 gbq.py:323(run_query)
   229202    0.820    0.000    2.691    0.000 socket.py:340(read)
5829895/5829880    0.582    0.000    0.583    0.000 {len}
       62    0.566    0.009   54.458    0.878 models.py:872(json)
   229202    0.555    0.000    3.412    0.000 httplib.py:667(_safe_read)
        1    0.454    0.454    0.454    0.454 {method 'read' of 'file' objects}
   115609    0.451    0.000  171.478    0.001 socket.py:410(readline)
        1    0.415    0.415   60.529   60.529 gbq.py:599(_parse_data)
        2    0.368    0.184    0.408    0.204 internals.py:5017(_stack_arrays)
   114735    0.286    0.000    5.938    0.000 response.py:564(_update_chunk_length)
   366738    0.280    0.000    0.280    0.000 {method 'write' of 'cStringIO.StringO' objects}
   114735    0.277    0.000    3.198    0.000 response.py:71(decompress)
231175/1910    0.267    0.000   13.432    0.007 {method 'join' of 'str' objects}
    65018    0.245    0.000   13.135    0.000 response.py:600(read_chunked)
   524477    0.242    0.000    0.242    0.000 {method 'read' of 'cStringIO.StringO' objects}
       10    0.223    0.022    1.785    0.178 missing.py:189(_isna_ndarraylike)
   114672    0.221    0.000    3.633    0.000 response.py:578(_handle_chunk)
   669812    0.180    0.000    0.180    0.000 {method 'seek' of 'cStringIO.StringO' objects}
   114672    0.113    0.000    3.311    0.000 response.py:289(_decode)
    51634    0.079    0.000  171.851    0.003 ssl.py:642(read)
   344874    0.077    0.000    0.077    0.000 {cStringIO.StringIO}
    51634    0.068    0.000  171.929    0.003 ssl.py:759(recv)
   229416    0.067    0.000    0.067    0.000 {min}

tswast · 2018-10-28T20:37:51Z

When you say "waiting", do you mean downloading? Or waiting for the query to finish?

I mean downloading. The BigQuery tabledata.list API for downloading JSON rows is clearly not as optimized as reading bytes from GCS.

tswast · 2019-02-22T22:29:56Z

Released today (https://cloud.google.com/bigquery/docs/release-notes#february_22_2019) the BigQuery Storage API. It should make getting data into pandas a lot faster: https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas

I'm thinking of adding a parameter: use_bqstorage=False to read_gbq() to optionally use this API.

max-sixty · 2019-02-23T01:32:04Z

Amazing!! This is great news. I'll check it out.

smith-m · 2019-03-10T16:16:06Z

This is exciting. It even adds the ability to add simple row filters without creating a copy of the table. The only odd thing about the API is selected fields must be specified - there is no "select *" equivalent when reading directly from the storage api.

tswast · 2019-03-16T00:07:36Z

I started working on adding the BigQuery Storage API to pandas-gbq (though I haven't gotten too far, yet). master...tswast:issue133-performance-bqstorage

How about we add a use_bqstorage_api parameter to read_gbq? It should default to False for a couple reasons:

It increases the overall cost.
It cannot read query results for small (<100 MB or so) result sets unless they are written to a destination table.

Should google-cloud-bigquery-storage be an optional dependency? I'm thinking yes, but I guess I should see how big the overall package is before making a decision. It'd probably be easier for folks to use if all that had to install was pandas-gbq.

tswast · 2019-04-02T00:40:14Z

I got pulled into some other client work, but am slowly getting back to this. I'm intending to add a use_bqstorage_api parameter to the %%bigquery magic in googleapis/google-cloud-python#7640. Implementation for pandas-gbq will be very similar.

sebbacon · 2019-04-25T08:45:18Z

For reference it appears @twast merged this BQ Storage API support in #270

tswast · 2019-04-25T16:16:04Z

For reference it appears @twast merged this BQ Storage API support in #270

And released in pandas-gbq version 0.10.0!

It's definitely not the last word in "performance", as I know the Avro blocks from BQ Storage -> DataFrame logic has a lot of room for optimization, but that's probably more for the google-cloud-bigquery package to fix. (Which is still me, but just tracked in a different place.) I've filed googleapis/google-cloud-python#7805 to track further speed-ups we can get with the BQ Storage API.

max-sixty · 2019-04-25T17:26:41Z

Thanks for everyone's contribution to the issue! Most of all @tswast for getting us most of the way there!

This was referenced Apr 21, 2018

BigQuery: to_arrow() method similar to to_dataframe() googleapis/google-cloud-python#5204

Closed

'Bulk' download #167

Closed

max-sixty mentioned this issue Aug 7, 2018

BigQuery: Load to table from dataframe without index googleapis/google-cloud-python#5572

Closed

max-sixty mentioned this issue Aug 21, 2018

Possible performance issue when reading large datasets from BigQuery #66

Closed

max-sixty mentioned this issue Aug 29, 2018

MemoryError while reading nearly 1.8M rows of data from Bigquery table #205

Closed

max-sixty mentioned this issue Sep 19, 2018

Avoid pulling entire result set into memory when constructing dataframe. googleapis/google-cloud-python#5870

Merged

melissachang mentioned this issue Oct 29, 2018

Indexer performance for very large tables (> 2Gb) DataBiosphere/data-explorer-indexers#90

Closed

melissachang mentioned this issue Dec 18, 2018

Views no longer supported by the BigQuery indexer DataBiosphere/data-explorer-indexers#99

Closed

tswast mentioned this issue Jan 25, 2019

CLN: Use to_dataframe to download query results. #247

Merged

tswast closed this as completed Apr 25, 2019

muscovitebob mentioned this issue Feb 10, 2022

[CT-171] [CT-60] [Bug] It takes too long to generate docs (BigQuery) dbt-labs/dbt-bigquery#115

Closed

1 task

Performance #133

Performance #133

Comments

max-sixty commented Feb 20, 2018

tswast commented Feb 21, 2018

max-sixty commented Mar 4, 2018

tswast commented Mar 22, 2018

max-sixty commented Mar 22, 2018

max-sixty commented Apr 21, 2018

Requirements

Dos & Don'ts

Current state

Benchmark

Prototype

Stepping back

Options

tswast commented Apr 21, 2018

tswast commented Apr 21, 2018

max-sixty commented Apr 27, 2018

max-sixty commented Aug 2, 2018

tswast commented Aug 2, 2018

max-sixty commented Aug 2, 2018

tswast commented Aug 2, 2018

max-sixty commented Aug 3, 2018

tswast commented Aug 3, 2018

max-sixty commented Aug 3, 2018

max-sixty commented Aug 7, 2018 • edited Loading

tswast commented Aug 7, 2018

tswast commented Aug 7, 2018

max-sixty commented Aug 7, 2018

tswast commented Aug 7, 2018

tsando commented Sep 28, 2018

max-sixty commented Sep 28, 2018

tsando commented Oct 1, 2018

tsando commented Oct 1, 2018

max-sixty commented Oct 1, 2018

tsando commented Oct 3, 2018

max-sixty commented Oct 3, 2018

tsando commented Oct 3, 2018

tsando commented Oct 3, 2018

max-sixty commented Oct 3, 2018

melissachang commented Oct 25, 2018

max-sixty commented Oct 25, 2018

ruojol commented Oct 26, 2018

max-sixty commented Oct 26, 2018

tswast commented Oct 26, 2018 • edited Loading

max-sixty commented Oct 27, 2018 • edited Loading

tswast commented Oct 28, 2018

tswast commented Feb 22, 2019

max-sixty commented Feb 23, 2019

smith-m commented Mar 10, 2019

tswast commented Mar 16, 2019

tswast commented Apr 2, 2019

sebbacon commented Apr 25, 2019

tswast commented Apr 25, 2019

max-sixty commented Apr 25, 2019

max-sixty commented Aug 7, 2018 •

edited

Loading

tswast commented Oct 26, 2018 •

edited

Loading

max-sixty commented Oct 27, 2018 •

edited

Loading