Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance #133

Closed
max-sixty opened this issue Feb 20, 2018 · 45 comments
Closed

Performance #133

max-sixty opened this issue Feb 20, 2018 · 45 comments

Comments

@max-sixty
Copy link
Contributor

We're starting to use BigQuery heavily but becoming increasingly 'bottlenecked' with the performance of moving moderate amounts of data from BigQuery to python.

Here's a few stats:

  • 29.1s: Pulling 500k rows with 3 columns of data (with cached data) using pandas-gbq
  • 36.5s: Pulling the same query with google-cloud-bigquery - i.e. client.query(query)..to_dataframe()
  • 2.4s: Pulling very similar data - same types, same size, from our existing MSSQL box hosted in AWS (using pd.read_sql). That's on standard drivers, nothing like turbodbc involved

...so using BigQuery with python is at least an order of magnitude slower than traditional DBs.

We've tried exporting tables to CSV on GCS and reading those in, which works fairly well for data processes, though not for exploration.

A few questions - feel free to jump in with partial replies:

  • Are these results expected, or are we doing something very wrong?
  • My prior is that a lot of this slowdown is caused by pulling in HTTP pages, converting to python objects, and then writing those into arrays. Is this approach really scalable? Should pandas-gbq invest resources into getting a format that's query-able in exploratory workflows that can deal with more reasonable datasets? (or at least encourage Google to)
@tswast
Copy link
Collaborator

tswast commented Feb 21, 2018

For some more stats, we can try the running the code in the "benchmark" folder for google-cloud-python. https://github.com/GoogleCloudPlatform/google-cloud-python/tree/6391eeef863d0c1b71cb0c6fa86f4d9831b814fa/bigquery/benchmark That should give us a baseline.

@max-sixty
Copy link
Contributor Author

Need to look at this more, but TensorFlow looks like they've built something that may be fast, in C: tensorflow/tensorflow@1b881b7

@tswast
Copy link
Collaborator

tswast commented Mar 22, 2018

Duplicate of #66?

@max-sixty
Copy link
Contributor Author

Yes. I've done some work here and can improve time by about 2x and stop the bloating of memory. I'll post something soon-ish. Let's leave this open for the stats above

@max-sixty
Copy link
Contributor Author

I haven't got around to getting this is a perfect state, but here's a dump for the moment:

Requirements

In order to go from BigQuery to Pandas we, at a minimum, need to:

  • Get some JSON over HTTP
  • Decode the JSON into Python objects
  • Create an array from the Python objects

Dos & Don'ts

Given the nature of python - where objects are expensive in both construction time and memory usage - there are some things we want to avoid:

  • Create a python object for each value
  • Create an additional python class instance for each row
  • Construct all python objects before starting to pack to arrays

Instead we want to:

  • Construct as few objects as possible
  • Convert objects to arrays and deref the objects as soon as possible

Current state

But currently neither pandas-gbq nor google.cloud.bigquery follows those principles (sometimes, though not always, understandably):

  • Each Row is a new python object instance (though using slots reduces the memory burden)
  • All rows need to be loaded into memory before we start packing into arrays
  • Without spending enough time in the google.cloud.bigquery code, at first glance each item requires a lot of python calls

Benchmark

On an approximate benchmark, both libraries take around a minute:

query="""
SELECT year, value, date FROM `bigquery-public-data.bls.unemployment_cps` 
LIMIT 1000000
"""

# I ran the query to ensure it was cached

pd.read_gbq(query, project_id='x', dialect='standard')

#Requesting query... ok.
#Job ID: c7d73162-afe2-4ebb-9f7a-b9503fd553fe
#Query running...
#Query done.
#Cache hit.

#Retrieving results...
#Got 1000000 rows.

"Total time taken 58.01 s."
# Finished at 2018-03-23 02:34:45.

c = Client()
q=c.query(query)
i=q.result()


%timeit -n 1 -r 1 i.to_dataframe()
"1 loop, best of 1: 1min 12s per loop"

Prototype

I did a quick prototype to see if reducing the python overhead would help. (these are notes from a notebook from a while ago, so may not be fully repro)

TBC, this is v hacky, and makes a bunch of assumptions that are not generalizable.

def get_lists(page):
    for d in page:
        dicts, = d.values()
        yield (x for y in dicts for x in y.values())

def dfs(i):
    i.item_to_value = lambda x,y: y
    for page in i.pages:
        df= pd.DataFrame(get_lists(page))
        # the foreknowledge is obv a cheat
        df[[0,1]] = df[[0,1]].astype(float)
        df[2] = pd.to_datetime(df[2])
        yield df

%timeit -n 1 -r 1 pd.concat(dfs(i))
"1 loop, best of 1: 35.8 s per loop"

The biggest two changes are:

  • Pack each page into a df before moving onto the next page (and python will throw away the temporary objects, so you're limited by memory of the array-packed objects, rather than memory in python objects)
  • Don't do excess work on each row - don't construct an object
    ...and then we're obv not doing work around type checking, so this is a rosy picture

Stepping back

I think there's some case for working through whether this is generalizable. But, for larger datasets, we'd still be fairly slow relative to downloading CSVs:

  • Still constructing python objects before putting into arrays
  • The specific JSON format is very inefficient - a dictionary for each row, and another for each value

And even CSVs aren't that great - everything is a string, no native formats

Options

  • Stay where we are
  • Roll out some of these changes (e.g. closest change is to construct an array-per-page)
  • Allow larger datasets through a different path - probably exporting to a GCS file (CSV / parquet?), downloading that file, and loading that

@tswast
Copy link
Collaborator

tswast commented Apr 21, 2018

Foreknowledge isn't actually as big of a cheat as you might think. Every query creates a temporary table, from which we could get the total_rows (on RowIterator, only populated after the first page) / num_rows (on the Table) and the schema. From these we can construct the dataframe ahead of time and then populate.

I'm still not sure how to get around all the JSON though. Export to GCS and download is really the only option right now. Unfortunately Parquet is supported for import to BigQuery but not export yet (I think they are working on it).

@tswast
Copy link
Collaborator

tswast commented Apr 21, 2018

I think an array-per-page would be a reasonable step. We could try blasting up the page_size to see if that helps, though I think the default might give the largest possible page already.

@max-sixty
Copy link
Contributor Author

From https://www.tidyverse.org/articles/2018/04/bigrquery-1-0-0/

Improved download speeds
The system for downloading data from BigQuery into R has been rewritten from the ground up to considerably improve performance:

By default, data is downloaded from BigQuery in pages of 10,000 rows. Previously, bigrquery downloaded then parsed each page. Now, bigrquery downloads all pages, then parses all pages. This means that you’ll now see two progress bars: one for downloading JSON from BigQuery and one for parsing that JSON into a data frame.

Because all pages are downloaded in a single pass, we can now download in parallel, using up to 6 simultaneous connections by default. This generally doesn’t result in a six-fold speed up, but should at least double download speed.

The parsing code has been rewritten in C++. This eliminates several expensive intermediate computations, and means that bigrquery no longer requires readr.

All up, I can now download the first million rows of publicdata.samples.natality in about a minute, about 8x faster than the previous version. This data frame takes up 170 MB of space in BigQuery and 140 MB of memory in R, so a minute to download doesn’t seem unreasonable. The bottleneck for loading BigQuery data is now parsing BigQuery’s JSON format, which is difficult to optimise further because I’m already using the fastest C++ JSON parser, RapidJson. If this is still too slow (because you download a lot of data), see ?bq_table_download for an alternative approach.

@max-sixty
Copy link
Contributor Author

We just moved internally from using a CSV loader to the google.cloud.bigquery loader, and it got 3x faster:

df=pd.DataFrame(np.random.rand(1000000,2), columns=list('ab')).assign(date=pd.Timestamp('2017'), s='abc')
 %timeit -n 1 -r 1 write_gbq(df, dataset_id='ishares', table_name='test', if_exists='replace')

# new: 1 loop, best of 1: 11 s per loop
# old: 1 loop, best of 1: 32.3 s per loop

Would be good if we can move this library over to delegating down to that, too

@tswast
Copy link
Collaborator

tswast commented Aug 2, 2018

Great to hear! The one difference with the google.cloud.bigquery loader that I know of is that it uploads the index column and pandas-gbq doesn't. Should be doable to "drop indexes" in pandas-gbq or add a feature to google.cloud.bigquery to do that for us.

@max-sixty
Copy link
Contributor Author

Do you know whether it'll upload a default index? (I can look if you don't know off hand)

That would mean we need to make a change in that library if we want to maintain back-compat (and I also think forcing an index upload is bad). We could surmount this by setting a random column as the index, but I think that's going to result in a bunch of small bugs around index vs column differences.

@tswast
Copy link
Collaborator

tswast commented Aug 2, 2018

Yes, I believe it does upload the default index. I couldn't figure out a good way to avoid it when using Parquet.

We probably can't always remove the index because I think we do want to upload the index when an explicit index is specified, as we do in the code sample:

https://github.com/GoogleCloudPlatform/google-cloud-python/blob/9d64d3094bea8002fce329c96c70cb6f48684633/docs/bigquery/snippets.py#L3098-L3129

@max-sixty
Copy link
Contributor Author

Yes - I had a look at the parquet libraries and it doesn't look like excluding the indexes is an option.

I'd probably favor having parquet over optional indexes, though I'm not sure if it's worth the change. What do you think?

@tswast
Copy link
Collaborator

tswast commented Aug 3, 2018

There are a couple of other benefits to parquet, such as the ability to support struct and array columns (as well as we can, anyway). Let's try it out and see what tests break.

My biggest concern is that it does require we add pyarrow as a dependency.

@max-sixty
Copy link
Contributor Author

Let's try it out and see what tests break.

Right - to be clear - (I think) the output will always be different - the existing version never uploads an index, and the parquet version always uploads an index.

We can set the first column as the index as a workaround, though I worry that might cause some bugs.

@max-sixty
Copy link
Contributor Author

max-sixty commented Aug 7, 2018

I found a resolution to this - if we construct the schema ourselves, we can choose to exclude the index:

Here's a sample:

    config = bigquery.job.LoadJobConfig()
    config.write_disposition = if_exists_map[if_exists]
    config.schema = _bq_schema(df, schema_overrides)

    job = client.load_table_from_dataframe(
        dataframe=df,
        destination=table,
        job_config=config)

So we can have a param on the write function that decides whether to upload the index.

Bonus points: what should the default be? I think probably to discard the index, since that was the previous implementation. But we could alternatively attempt to assess whether the index is useful; e.g. if named (which we do internally) or if a non-default index.

PS: Not sure this behavior (i.e. discarding columns not specified in the schema without raising an error) is reliable. @tswast lmk if you know

@tswast
Copy link
Collaborator

tswast commented Aug 7, 2018

PS: Not sure this behavior (i.e. discarding columns not specified in the schema without raising an error) is reliable. @tswast lmk if you know

Sounds like a bug in BigQuery. We should set configuration.load.ignoreUnknownValues to be safe.

@tswast
Copy link
Collaborator

tswast commented Aug 7, 2018

Bonus points: what should the default be? I think probably to discard the index, since that was the previous implementation. But we could alternatively attempt to assess whether the index is useful; e.g. if named (which we do internally) or if a non-default index.

What if for the default we check if the index starts with double-underscores and drop it then?

@max-sixty
Copy link
Contributor Author

What if for the default we check if the index starts with double-underscores and drop it then?

We could check if the name is None?
It's __index... only in the parquet file (because it can't be None there), so that would be too late for us to check that. Or am I misunderstanding?

@tswast
Copy link
Collaborator

tswast commented Aug 7, 2018

It's __index... only in the parquet file (because it can't be None there)

Ah yeah, you're right.

@tsando
Copy link

tsando commented Sep 28, 2018

Thanks for the research on this. So has there been conclusion on this? What do you think is the best practice to query big datasets from google BigQuery using python then? I am experiencing very slow performance from the pandas function read_gbq

@max-sixty
Copy link
Contributor Author

@tsando the latest version is better than prior versions, so it's worth trying to update to master.

Depending on how product trajectory changes, I may finish off the read_gbq_bulk function.

@tsando
Copy link

tsando commented Oct 1, 2018

@max-sixty thanks, can i do that via $ pip install git+https://github.com/pydata/pandas-gbq.git?

@tsando
Copy link

tsando commented Oct 1, 2018

actually this doesn't work for me... do you know how i can 'update to master'?

@max-sixty
Copy link
Contributor Author

You had it correct the first time! Why doesn't that work?

@tsando
Copy link

tsando commented Oct 3, 2018

well, i have anaconda in my environment, so when i install as above i get two versions - one from conda and one from pip:

screenshot 2018-10-03 at 10 19 54

So I tried to uninstall the conda version but if I do, then i get an error when I call the function complaining that the module is not found.

Thanks for your help!

@max-sixty
Copy link
Contributor Author

I don't you conda, so this is just a guess, but you may want to run conda install git+https://github.com/pydata/pandas-gbq.git rather than pip install ...

@tsando
Copy link

tsando commented Oct 3, 2018

I get the following error if I remove pandas-gbq from anaconda and install the dev version via pip:

ImportError: Load data from Google BigQuery

the pandas-gbq package is not installed
see the docs: https://pandas-gbq.readthedocs.io

you can install via pip or conda:
pip install pandas-gbq
conda install pandas-gbq -c conda-forge

@tsando
Copy link

tsando commented Oct 3, 2018

could this be because i need to compile all of pandas again to pick up the dev version of pandas-gbq?

@max-sixty
Copy link
Contributor Author

My conda knowledge has capped out unfortunately. I imagine there's a standard way to install libraries via pip (and have them recognized!)? It may be worth asking on SO with a conda tag...

@melissachang
Copy link
Contributor

It takes 2 hours to read a 2G table from BigQuery into pandas. I'm using pandas_gbq 0.7.0. So +1 to this issue.

Do you think master will be faster?

@max-sixty
Copy link
Contributor Author

Master won't be faster than 0.7.0; you'd need a solution like the 'bulk load' one for that size (though 2 hours seems excessive! Maybe with more memory / CPU you can get 2-5x faster, but it's still going to be painful with JSON over HTTP)

If exporting large datasets is important to you, please feel free to raise the prioritization of better APIs with your Google coverage

@ruojol
Copy link

ruojol commented Oct 26, 2018

Does anyone have a workaround for this? I have a table that's just 2gb but with 31 million rows and it's a pain in the ass to get that data into a dataframe

@max-sixty
Copy link
Contributor Author

Running the 'bulk load' code - exporting to CSV, downloading, importing; is the best way for a large dataset

@tswast
Copy link
Collaborator

tswast commented Oct 26, 2018

I decided to do some profiling today to see where all the time is spent, following this Python profiling guide.

pandas_gbq_bench.py:

import pandas_gbq
pandas_gbq.read_gbq(
    "SELECT * FROM `bigquery-public-data.usa_names.usa_1910_2013`",
    dialect='standard')
$ python -m cProfile -o pandas_gbq.cprof pandas_gbq_bench.py
$ pyprof2calltree -k -i pandas_gbq.cprof

image

$ head -n 100 pandas_gbq_bench.txt
         253357679 function calls (253138077 primitive calls) in 328.945 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    766/1    0.015    0.000  328.947  328.947 {built-in method builtins.exec}
        1    1.400    1.400  328.947  328.947 pandas_gbq_bench.py:2(<module>)
        1    0.000    0.000  327.030  327.030 gbq.py:622(read_gbq)
        1    0.791    0.791  291.680  291.680 gbq.py:327(run_query)
  5552453    2.724    0.000  290.075    0.000 page_iterator.py:197(_items_iter)
       62    0.001    0.000  222.242    3.585 client.py:332(_call_api)
    64/60    0.001    0.000  222.241    3.704 retry.py:249(retry_wrapped_func)
    64/60    0.077    0.001  222.240    3.704 retry.py:140(retry_target)
       62    0.001    0.000  222.160    3.583 _http.py:214(api_request)
       57    0.000    0.000  218.804    3.839 page_iterator.py:218(_page_iter)
       57    0.001    0.000  218.804    3.839 page_iterator.py:341(_next_page)
       56    0.001    0.000  218.803    3.907 table.py:1127(_get_next_page_response)
       62    0.000    0.000  206.398    3.329 _http.py:142(_make_request)
       62    0.001    0.000  206.398    3.329 _http.py:185(_do_request)
       62    0.001    0.000  206.396    3.329 requests.py:181(request)
       63    0.001    0.000  206.368    3.276 sessions.py:441(request)
       63    0.003    0.000  205.819    3.267 sessions.py:589(send)
    42157    0.130    0.000  198.129    0.005 socket.py:572(readinto)
    42157    0.110    0.000  197.840    0.005 ssl.py:998(recv_into)
    42157    0.081    0.000  197.700    0.005 ssl.py:863(read)
    42157    0.056    0.000  197.612    0.005 ssl.py:624(read)
    42157  197.556    0.005  197.556    0.005 {method 'read' of '_ssl._SSLSocket' objects}
   102716    0.210    0.000  194.229    0.002 {method 'readline' of '_io.BufferedReader' objects}
       63    0.001    0.000  189.560    3.009 adapters.py:388(send)
       63    0.002    0.000  189.512    3.008 connectionpool.py:447(urlopen)
       63    0.002    0.000  189.476    3.008 connectionpool.py:322(_make_request)
       63    0.001    0.000  188.468    2.992 client.py:1287(getresponse)
       63    0.002    0.000  188.465    2.992 client.py:290(begin)
       63    0.003    0.000  188.380    2.990 client.py:257(_read_status)
  5552508    4.559    0.000   68.547    0.000 page_iterator.py:122(next)
  5552452    6.358    0.000   59.783    0.000 table.py:1321(_item_to_row)
  5552452   32.403    0.000   50.283    0.000 _helpers.py:197(_row_tuple_from_json)
        1    0.291    0.291   32.305   32.305 gbq.py:603(_parse_data)
        1    0.853    0.853   30.564   30.564 frame.py:334(__init__)
        1    0.000    0.000   26.148   26.148 frame.py:7453(_to_arrays)
        1    8.938    8.938   23.952   23.952 __init__.py:130(lmap)
      250    0.001    0.000   16.248    0.065 models.py:810(content)
202383/638    0.702    0.000   16.247    0.025 {method 'join' of 'bytes' objects}
       62    0.071    0.001   15.755    0.254 models.py:868(json)
    57160    0.034    0.000   15.588    0.000 models.py:741(generate)
    57160    0.035    0.000   15.554    0.000 response.py:415(stream)
    57160    0.296    0.000   15.519    0.000 response.py:571(read_chunked)
       67    0.001    0.000   15.042    0.225 __init__.py:302(loads)
       67    0.001    0.000   15.041    0.224 decoder.py:334(decode)
       67   15.039    0.224   15.039    0.224 decoder.py:345(raw_decode)
 33314712    9.321    0.000   13.310    0.000 table.py:1073(__getitem__)
 11104904    6.516    0.000    7.641    0.000 _helpers.py:38(_int_from_json)
   101190    0.280    0.000    6.193    0.000 response.py:535(_update_chunk_length)
   101127    0.249    0.000    5.511    0.000 response.py:549(_handle_chunk)
   201745    0.484    0.000    5.263    0.000 client.py:596(_safe_read)

We see very similar results from google-cloud-bigquery to_dataframe:

from google.cloud import bigquery
client = bigquery.Client()

table_ref = bigquery.TableReference.from_string(
    'bigquery-public-data.usa_names.usa_1910_2013')
table = client.get_table(table_ref)
rows = client.list_rows(table)
rows.to_dataframe()

image

Most of the time is spent actually waiting on the BigQuery API. JSON parsing is a much smaller fraction.

@max-sixty
Copy link
Contributor Author

max-sixty commented Oct 27, 2018

That's great @tswast , thanks for running that.

As a comparison, running the 'bulk query' path:

  • 17 seconds to run the query with a destination table
  • 17 seconds (also) to extract the table to a CSV
  • 10 seconds to download the CSV
  • 10 seconds to parse to a df

...total of 55 seconds

Running the same test you did took 419 seconds (and the 17 second query was cached, so we should add that back).
Looking at your stats, it looks like about half of that is reading from the API (though I couldn't see the absolute numbers). Looking at my profile, 170s is reading from the socket.
I'm not sure why reading the JSON over HTTP takes 170-200 seconds, while downloading the CSV takes 10s. But I know very little about this area, so I'm not surprised that I'm surprised.

My conclusion is that it's a combination of a) parsing the JSON and b) downloading inefficiently.
Is that right?

One note:

Most of the time is spent actually waiting on the BigQuery API

When you say "waiting", do you mean downloading? Or waiting for the query to finish?

         260937310 function calls (260707117 primitive calls) in 418.889 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    51634  171.764    0.003  171.764    0.003 {method 'read' of '_ssl._SSLSocket' objects}
  5552452   56.045    0.000   93.541    0.000 _helpers.py:197(_row_tuple_from_json)
       65   46.776    0.720   46.776    0.720 decoder.py:370(raw_decode)
 33314712   19.022    0.000   26.235    0.000 table.py:1073(__getitem__)
      378   17.039    0.045   46.087    0.122 {map}
 11104904   11.411    0.000   13.319    0.000 _helpers.py:38(_int_from_json)
  5552452   10.476    0.000   10.476    0.000 {zip}
  5552452    7.478    0.000  105.049    0.000 table.py:1321(_item_to_row)
33321903/33321778    7.217    0.000    7.229    0.000 {isinstance}
      194    7.113    0.037    7.113    0.037 {_codecs.utf_8_decode}
  5552508    5.432    0.000  114.877    0.000 page_iterator.py:122(next)
 27762292    4.400    0.000    4.400    0.000 schema.py:73(field_type)
  5552508    4.396    0.000    4.396    0.000 {next}
        5    4.228    0.846    4.228    0.846 {pandas._libs.lib.maybe_convert_objects}
 27762276    4.178    0.000    4.178    0.000 schema.py:82(mode)
  5552453    3.811    0.000  350.807    0.000 page_iterator.py:197(_items_iter)
 27994555    3.256    0.000    3.256    0.000 {method 'append' of 'list' objects}
5552514/5552513    3.148    0.000    3.148    0.000 {iter}
   114672    2.921    0.000    2.921    0.000 {built-in method decompress}
        1    2.526    2.526  417.437  417.437 gbq.py:24(read_gbq)
  5552453    2.406    0.000    2.406    0.000 table.py:1142(schema)
  5552452    2.270    0.000    2.813    0.000 table.py:1070(__len__)
 11104904    1.908    0.000    1.908    0.000 _helpers.py:33(_not_null)
 16657356    1.908    0.000    1.908    0.000 _helpers.py:62(_string_from_json)
  5552452    1.624    0.000    1.624    0.000 table.py:987(__init__)
  5552453    1.593    0.000    4.741    0.000 gbq.py:602(<genexpr>)
        3    1.560    0.520    1.560    0.520 {pandas._libs.missing.isnaobj}
        1    1.559    1.559    1.559    1.559 {pandas._libs.lib.to_object_array_tuples}
        1    1.452    1.452  418.889  418.889 <string>:1(<module>)
        1    1.151    1.151   58.203   58.203 frame.py:334(__init__)
        1    1.001    1.001  352.762  352.762 gbq.py:323(run_query)
   229202    0.820    0.000    2.691    0.000 socket.py:340(read)
5829895/5829880    0.582    0.000    0.583    0.000 {len}
       62    0.566    0.009   54.458    0.878 models.py:872(json)
   229202    0.555    0.000    3.412    0.000 httplib.py:667(_safe_read)
        1    0.454    0.454    0.454    0.454 {method 'read' of 'file' objects}
   115609    0.451    0.000  171.478    0.001 socket.py:410(readline)
        1    0.415    0.415   60.529   60.529 gbq.py:599(_parse_data)
        2    0.368    0.184    0.408    0.204 internals.py:5017(_stack_arrays)
   114735    0.286    0.000    5.938    0.000 response.py:564(_update_chunk_length)
   366738    0.280    0.000    0.280    0.000 {method 'write' of 'cStringIO.StringO' objects}
   114735    0.277    0.000    3.198    0.000 response.py:71(decompress)
231175/1910    0.267    0.000   13.432    0.007 {method 'join' of 'str' objects}
    65018    0.245    0.000   13.135    0.000 response.py:600(read_chunked)
   524477    0.242    0.000    0.242    0.000 {method 'read' of 'cStringIO.StringO' objects}
       10    0.223    0.022    1.785    0.178 missing.py:189(_isna_ndarraylike)
   114672    0.221    0.000    3.633    0.000 response.py:578(_handle_chunk)
   669812    0.180    0.000    0.180    0.000 {method 'seek' of 'cStringIO.StringO' objects}
   114672    0.113    0.000    3.311    0.000 response.py:289(_decode)
    51634    0.079    0.000  171.851    0.003 ssl.py:642(read)
   344874    0.077    0.000    0.077    0.000 {cStringIO.StringIO}
    51634    0.068    0.000  171.929    0.003 ssl.py:759(recv)
   229416    0.067    0.000    0.067    0.000 {min}

@tswast
Copy link
Collaborator

tswast commented Oct 28, 2018

When you say "waiting", do you mean downloading? Or waiting for the query to finish?

I mean downloading. The BigQuery tabledata.list API for downloading JSON rows is clearly not as optimized as reading bytes from GCS.

@tswast
Copy link
Collaborator

tswast commented Feb 22, 2019

Released today (https://cloud.google.com/bigquery/docs/release-notes#february_22_2019) the BigQuery Storage API. It should make getting data into pandas a lot faster: https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas

I'm thinking of adding a parameter: use_bqstorage=False to read_gbq() to optionally use this API.

@max-sixty
Copy link
Contributor Author

Amazing!! This is great news. I'll check it out.

@smith-m
Copy link

smith-m commented Mar 10, 2019

This is exciting. It even adds the ability to add simple row filters without creating a copy of the table. The only odd thing about the API is selected fields must be specified - there is no "select *" equivalent when reading directly from the storage api.

@tswast
Copy link
Collaborator

tswast commented Mar 16, 2019

I started working on adding the BigQuery Storage API to pandas-gbq (though I haven't gotten too far, yet). master...tswast:issue133-performance-bqstorage

How about we add a use_bqstorage_api parameter to read_gbq? It should default to False for a couple reasons:

  1. It increases the overall cost.
  2. It cannot read query results for small (<100 MB or so) result sets unless they are written to a destination table.

Should google-cloud-bigquery-storage be an optional dependency? I'm thinking yes, but I guess I should see how big the overall package is before making a decision. It'd probably be easier for folks to use if all that had to install was pandas-gbq.

@tswast
Copy link
Collaborator

tswast commented Apr 2, 2019

I got pulled into some other client work, but am slowly getting back to this. I'm intending to add a use_bqstorage_api parameter to the %%bigquery magic in googleapis/google-cloud-python#7640. Implementation for pandas-gbq will be very similar.

@sebbacon
Copy link

For reference it appears @twast merged this BQ Storage API support in #270

@tswast
Copy link
Collaborator

tswast commented Apr 25, 2019

For reference it appears @twast merged this BQ Storage API support in #270

And released in pandas-gbq version 0.10.0!

It's definitely not the last word in "performance", as I know the Avro blocks from BQ Storage -> DataFrame logic has a lot of room for optimization, but that's probably more for the google-cloud-bigquery package to fix. (Which is still me, but just tracked in a different place.) I've filed googleapis/google-cloud-python#7805 to track further speed-ups we can get with the BQ Storage API.

@tswast tswast closed this as completed Apr 25, 2019
@max-sixty
Copy link
Contributor Author

Thanks for everyone's contribution to the issue! Most of all @tswast for getting us most of the way there!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants