-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance #133
Comments
For some more stats, we can try the running the code in the "benchmark" folder for |
Need to look at this more, but TensorFlow looks like they've built something that may be fast, in C: tensorflow/tensorflow@1b881b7 |
Duplicate of #66? |
Yes. I've done some work here and can improve time by about 2x and stop the bloating of memory. I'll post something soon-ish. Let's leave this open for the stats above |
I haven't got around to getting this is a perfect state, but here's a dump for the moment: RequirementsIn order to go from BigQuery to Pandas we, at a minimum, need to:
Dos & Don'tsGiven the nature of python - where objects are expensive in both construction time and memory usage - there are some things we want to avoid:
Instead we want to:
Current stateBut currently neither
BenchmarkOn an approximate benchmark, both libraries take around a minute: query="""
SELECT year, value, date FROM `bigquery-public-data.bls.unemployment_cps`
LIMIT 1000000
"""
# I ran the query to ensure it was cached
pd.read_gbq(query, project_id='x', dialect='standard')
#Requesting query... ok.
#Job ID: c7d73162-afe2-4ebb-9f7a-b9503fd553fe
#Query running...
#Query done.
#Cache hit.
#Retrieving results...
#Got 1000000 rows.
"Total time taken 58.01 s."
# Finished at 2018-03-23 02:34:45.
c = Client()
q=c.query(query)
i=q.result()
%timeit -n 1 -r 1 i.to_dataframe()
"1 loop, best of 1: 1min 12s per loop" PrototypeI did a quick prototype to see if reducing the python overhead would help. (these are notes from a notebook from a while ago, so may not be fully repro) TBC, this is v hacky, and makes a bunch of assumptions that are not generalizable. def get_lists(page):
for d in page:
dicts, = d.values()
yield (x for y in dicts for x in y.values())
def dfs(i):
i.item_to_value = lambda x,y: y
for page in i.pages:
df= pd.DataFrame(get_lists(page))
# the foreknowledge is obv a cheat
df[[0,1]] = df[[0,1]].astype(float)
df[2] = pd.to_datetime(df[2])
yield df
%timeit -n 1 -r 1 pd.concat(dfs(i))
"1 loop, best of 1: 35.8 s per loop" The biggest two changes are:
Stepping backI think there's some case for working through whether this is generalizable. But, for larger datasets, we'd still be fairly slow relative to downloading CSVs:
And even CSVs aren't that great - everything is a string, no native formats Options
|
Foreknowledge isn't actually as big of a cheat as you might think. Every query creates a temporary table, from which we could get the total_rows (on RowIterator, only populated after the first page) / num_rows (on the Table) and the schema. From these we can construct the dataframe ahead of time and then populate. I'm still not sure how to get around all the JSON though. Export to GCS and download is really the only option right now. Unfortunately Parquet is supported for import to BigQuery but not export yet (I think they are working on it). |
I think an array-per-page would be a reasonable step. We could try blasting up the page_size to see if that helps, though I think the default might give the largest possible page already. |
From https://www.tidyverse.org/articles/2018/04/bigrquery-1-0-0/
|
We just moved internally from using a CSV loader to the
Would be good if we can move this library over to delegating down to that, too |
Great to hear! The one difference with the |
Do you know whether it'll upload a default index? (I can look if you don't know off hand) That would mean we need to make a change in that library if we want to maintain back-compat (and I also think forcing an index upload is bad). We could surmount this by setting a random column as the index, but I think that's going to result in a bunch of small bugs around index vs column differences. |
Yes, I believe it does upload the default index. I couldn't figure out a good way to avoid it when using Parquet. We probably can't always remove the index because I think we do want to upload the index when an explicit index is specified, as we do in the code sample: |
Yes - I had a look at the parquet libraries and it doesn't look like excluding the indexes is an option. I'd probably favor having parquet over optional indexes, though I'm not sure if it's worth the change. What do you think? |
There are a couple of other benefits to parquet, such as the ability to support struct and array columns (as well as we can, anyway). Let's try it out and see what tests break. My biggest concern is that it does require we add pyarrow as a dependency. |
Right - to be clear - (I think) the output will always be different - the existing version never uploads an index, and the parquet version always uploads an index. We can set the first column as the index as a workaround, though I worry that might cause some bugs. |
I found a resolution to this - if we construct the schema ourselves, we can choose to exclude the index: Here's a sample:
So we can have a param on the Bonus points: what should the default be? I think probably to discard the index, since that was the previous implementation. But we could alternatively attempt to assess whether the index is useful; e.g. if named (which we do internally) or if a non-default index. PS: Not sure this behavior (i.e. discarding columns not specified in the schema without raising an error) is reliable. @tswast lmk if you know |
Sounds like a bug in BigQuery. We should set configuration.load.ignoreUnknownValues to be safe. |
What if for the default we check if the index starts with double-underscores and drop it then? |
We could check if the name is |
Ah yeah, you're right. |
Thanks for the research on this. So has there been conclusion on this? What do you think is the best practice to query big datasets from google BigQuery using python then? I am experiencing very slow performance from the pandas function |
@tsando the latest version is better than prior versions, so it's worth trying to update to master. Depending on how product trajectory changes, I may finish off the |
@max-sixty thanks, can i do that via |
actually this doesn't work for me... do you know how i can 'update to master'? |
You had it correct the first time! Why doesn't that work? |
I don't you conda, so this is just a guess, but you may want to run |
I get the following error if I remove
|
could this be because i need to compile all of pandas again to pick up the dev version of |
My conda knowledge has capped out unfortunately. I imagine there's a standard way to install libraries via pip (and have them recognized!)? It may be worth asking on SO with a conda tag... |
It takes 2 hours to read a 2G table from BigQuery into pandas. I'm using pandas_gbq 0.7.0. So +1 to this issue. Do you think master will be faster? |
Master won't be faster than 0.7.0; you'd need a solution like the 'bulk load' one for that size (though 2 hours seems excessive! Maybe with more memory / CPU you can get 2-5x faster, but it's still going to be painful with JSON over HTTP) If exporting large datasets is important to you, please feel free to raise the prioritization of better APIs with your Google coverage |
Does anyone have a workaround for this? I have a table that's just 2gb but with 31 million rows and it's a pain in the ass to get that data into a dataframe |
Running the 'bulk load' code - exporting to CSV, downloading, importing; is the best way for a large dataset |
I decided to do some profiling today to see where all the time is spent, following this Python profiling guide. pandas_gbq_bench.py:
We see very similar results from google-cloud-bigquery to_dataframe:
Most of the time is spent actually waiting on the BigQuery API. JSON parsing is a much smaller fraction. |
That's great @tswast , thanks for running that. As a comparison, running the 'bulk query' path:
...total of 55 seconds Running the same test you did took 419 seconds (and the 17 second query was cached, so we should add that back). My conclusion is that it's a combination of a) parsing the JSON and b) downloading inefficiently. One note:
When you say "waiting", do you mean downloading? Or waiting for the query to finish?
|
I mean downloading. The BigQuery tabledata.list API for downloading JSON rows is clearly not as optimized as reading bytes from GCS. |
Released today (https://cloud.google.com/bigquery/docs/release-notes#february_22_2019) the BigQuery Storage API. It should make getting data into pandas a lot faster: https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas I'm thinking of adding a parameter: |
Amazing!! This is great news. I'll check it out. |
This is exciting. It even adds the ability to add simple row filters without creating a copy of the table. The only odd thing about the API is selected fields must be specified - there is no "select *" equivalent when reading directly from the storage api. |
I started working on adding the BigQuery Storage API to pandas-gbq (though I haven't gotten too far, yet). master...tswast:issue133-performance-bqstorage How about we add a
Should |
I got pulled into some other client work, but am slowly getting back to this. I'm intending to add a |
And released in pandas-gbq version 0.10.0! It's definitely not the last word in "performance", as I know the Avro blocks from BQ Storage -> DataFrame logic has a lot of room for optimization, but that's probably more for the |
Thanks for everyone's contribution to the issue! Most of all @tswast for getting us most of the way there! |
We're starting to use BigQuery heavily but becoming increasingly 'bottlenecked' with the performance of moving moderate amounts of data from BigQuery to python.
Here's a few stats:
google-cloud-bigquery
- i.e.client.query(query)..to_dataframe()
pd.read_sql
). That's on standard drivers, nothing liketurbodbc
involved...so using BigQuery with python is at least an order of magnitude slower than traditional DBs.
We've tried exporting tables to CSV on GCS and reading those in, which works fairly well for data processes, though not for exploration.
A few questions - feel free to jump in with partial replies:
pandas-gbq
invest resources into getting a format that's query-able in exploratory workflows that can deal with more reasonable datasets? (or at least encourage Google to)The text was updated successfully, but these errors were encountered: