-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(bigquery): to_dataframe
uses faster to_arrow
+ to_pandas
when pyarrow
is available
#10027
refactor(bigquery): to_dataframe
uses faster to_arrow
+ to_pandas
when pyarrow
is available
#10027
Conversation
12654c9
to
374608b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine code-wise IMO.
Two remarks:
- Do we have a common representative table at hand to verify the stated performance gains and compare the results? If not, I can still manually create a dummy table with 10M floats, just like in the bug description.
- The coverage check failure is legitimate and should be fixed.
In https://friendliness.dev/2019/07/29/bigquery-arrow/, we sampled the tlc_green_trips public data by running a SQL query like:
and writing to a destination table so that we can read directly from the table (making query time not part of the benchmark). |
…pyarrow is available
45ea992
to
de7af59
Compare
Using the same (n1-standard-4) instance from #9997, I tested the speedup in the same way.
benchmark_bq.py import sys
from google.cloud import bigquery
client = bigquery.Client()
table_id = "swast-scratch.to_dataframe_benchmark.tlc_green_{}pct".format(sys.argv[1])
dataframe = client.list_rows(table_id).to_dataframe(create_bqstorage_client=True)
print("Got {} rows.".format(len(dataframe.index))) After:
Before:
The speedup is a bit better than what we saw in #9997 at 1.542x here. Still not quite the 2x I was seeing in the summer, though. I think the reason we're seeing a larger difference here is that we're reading from multiple streams in parallel. This makes the download time go faster and maybe more dataframes to concatenate for the final dataframe result. |
to_dataframe
uses faster to_arrow
+ to_pandas
when pyarrow
is available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to #9997, the observed performance gains are not huge on my 50 Mbps internet connection (network I/O consumes most of the time), but nevertheless more or less consistently reproducible.
…pyarrow is available
Related to similar PR #9997 but for the
google-cloud-bigquery
library.Fixes https://issuetracker.google.com/140579733