ENH: Add use_bqstorage_api option to read_gbq #270

tswast · 2019-04-04T22:38:44Z

The BigQuery Storage API provides a way to read query results quickly
(and using multiple threads). It only works with large query results
(~125 MB), but as of 1.11.1, the google-cloud-bigquery library can
fallback to the BigQuery API to download results when a request to the
BigQuery Storage API fails.

As this API can increase costs (and may not be enabled on the user's
project), this option is disabled by default.

The BigQuery Storage API provides a way to read query results quickly (and using multiple threads). It only works with large query results (~125 MB), but as of 1.11.1, the google-cloud-bigquery library can fallback to the BigQuery API to download results when a request to the BigQuery Storage API fails. As this API can increase costs (and may not be enabled on the user's project), this option is disabled by default.

tswast · 2019-04-04T22:46:57Z

FYI: I ran the test I added with the BQ Storage API on a 4 CPU-node instance.

use_bqstorage_api=True : 76.62 seconds (63.48 seconds after query results are cached)
use_bqstorage_api=False : 265.41 seconds

max-sixty

🚀

max-sixty · 2019-04-04T23:25:16Z

pandas_gbq/gbq.py

+        <https://github.com/googleapis/google-cloud-python/pull/7633>`__
+        (fixed in version 1.11.0), you must write your query results to a
+        destination table. To do this with ``read_gbq``, supply a
+        ``configuration`` dictionary.


We could only allow importing this for versions >1.11.0

A few thoughts on this:

I'd be more comfortable bumping the required version once there's been a few more versions of google-cloud-bigquery released past 1.11. I like people to have a few options in case there's a problematic regression in 1.11.1.

The failure on small results will be quite obvious (InternalError stacktrace) if they do try to use with an older version. If someone really wants to try to use it with an older version, they'll find out pretty quickly why they should upgrade, or they really know what they are doing and that they are always going to get results that are large enough for the BigQuery Storage API to read from.

Someday we might want to make use_bqstorage_api=True the default (maybe after the BQ Storage API goes GA?). I think it'd definitely be worth bumping the minimum version at that time.

Yes that sounds v reasonable. Agree re making the default True when it's GA (or at least soliciting opinions)

tbc I was suggesting only for using bqstorage do we make the min version 1.11 (probably implemented above when attempting the import). Other versions would still run everything else

max-sixty · 2019-04-04T23:26:34Z

pandas_gbq/gbq.py

@@ -727,6 +740,21 @@ def _localize_df(schema_fields, df):
    return df


+def _make_bqstorage_client(use_bqstorage_api, credentials):


Should this take a project too?

The BigQuery Storage API client doesn't actually take a project. The project is set when creating a BQ Storage API read session by to_dataframe based on the BQ client's project.

Aside: The BQ Storage API client is partially auto-generated, and the client generator doesn't support a default project ID. Since I'm considering the BQ Storage API a lower-level client that's used by the normal BQ client from to_dataframe, I felt it wasn't worth adding in that kind of handwritten helper to the autogenned layer.

max-sixty · 2019-04-04T23:26:45Z

docs/source/changelog.rst

@@ -39,8 +39,16 @@ Enhancements
  (contributed by @johnpaton)
 - Read ``project_id`` in :func:`to_gbq` from provided ``credentials`` if
  available (contributed by @daureg)
+<<<<<<< Updated upstream


FYI merge conflict

Good catch, thanks!

max-sixty · 2019-04-05T15:13:22Z

Thank you @tswast ! 🚀

tswast added 2 commits April 4, 2019 17:43

Add to changelog. Remove comment about destination tables.

5b34b28

tswast force-pushed the issue133-performance-bqstorage branch from ed4aaf2 to 5b34b28 Compare April 4, 2019 22:45

max-sixty reviewed Apr 4, 2019

View reviewed changes

Add docs for using the BigQuery Storage API.

4d2a7d7

tswast marked this pull request as ready for review April 5, 2019 13:03

tswast merged commit 5e80908 into googleapis:master Apr 5, 2019

tswast deleted the issue133-performance-bqstorage branch April 5, 2019 13:13

tswast mentioned this pull request Apr 9, 2019

Bulk download #171

Closed

sebbacon mentioned this pull request Apr 25, 2019

Performance #133

Closed

tswast mentioned this pull request Aug 9, 2019

'Bulk' download #167

Closed

tswast mentioned this pull request Jun 17, 2021

BigQuery: support for the BigQuery Storage API for faster reads of large query results ibis-project/ibis-bigquery#80

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add use_bqstorage_api option to read_gbq #270

ENH: Add use_bqstorage_api option to read_gbq #270

tswast commented Apr 4, 2019

tswast commented Apr 4, 2019 •

edited

Loading

max-sixty left a comment

max-sixty Apr 4, 2019

tswast Apr 5, 2019

max-sixty Apr 5, 2019

max-sixty Apr 4, 2019

tswast Apr 5, 2019

max-sixty Apr 4, 2019

tswast Apr 5, 2019

max-sixty commented Apr 5, 2019

		@@ -727,6 +740,21 @@ def _localize_df(schema_fields, df):
		return df


		def _make_bqstorage_client(use_bqstorage_api, credentials):

ENH: Add use_bqstorage_api option to read_gbq #270

ENH: Add use_bqstorage_api option to read_gbq #270

Conversation

tswast commented Apr 4, 2019

tswast commented Apr 4, 2019 • edited Loading

max-sixty left a comment

Choose a reason for hiding this comment

max-sixty Apr 4, 2019

Choose a reason for hiding this comment

tswast Apr 5, 2019

Choose a reason for hiding this comment

max-sixty Apr 5, 2019

Choose a reason for hiding this comment

max-sixty Apr 4, 2019

Choose a reason for hiding this comment

tswast Apr 5, 2019

Choose a reason for hiding this comment

max-sixty Apr 4, 2019

Choose a reason for hiding this comment

tswast Apr 5, 2019

Choose a reason for hiding this comment

max-sixty commented Apr 5, 2019

tswast commented Apr 4, 2019 •

edited

Loading