BigQuery: `to_dataframe` respects `progress_bar_type` when used with BQ Storage API #7697

tswast · 2019-04-12T00:23:18Z

When the BigQuery Storage API was used in to_dataframe, the progress_bar_type argument was ignored. This PR fixes that by creating a concurrent queue that worker threads can send progress updates to.

A fake queue is created for the case with no progress bar to prevent filling a queue but never reading from it.

This fix depends on:

~~Add page iterator to ReadRowsStream #7680 adds .pages to BQ Storage. Must be merged and released.~~ Released in google-cloud-bigquery-storage version 0.4.0!
BigQuery: ensure that KeyboardInterrupt during to_dataframeno longer hangs. #7698 adds loop over .pages to fix issue with KeyboardInterrupt (discovered while working on this progress bar feature).

Closes #7654

tswast · 2019-04-17T18:09:46Z

Screenshots of this progress bar in action:

Terminal (IPython):

Notebook:

shollyman · 2019-04-22T19:38:25Z

bigquery/google/cloud/bigquery/table.py

@@ -1274,6 +1275,16 @@ def __repr__(self):
        return "Row({}, {})".format(self._xxx_values, f2i)


+class _FakeQueue(object):


Should this be more closely named according to it's use?

shollyman · 2019-04-22T20:02:01Z

bigquery/google/cloud/bigquery/table.py

@@ -1408,7 +1426,23 @@ def _to_dataframe_bqstorage_stream(
        # the end using manually-parsed schema.
        return pandas.concat(frames)[columns]

-    def _to_dataframe_bqstorage(self, bqstorage_client, dtypes):
+    def _process_progress_updates(self, progress_queue, progress_bar):


Does this work well for large tables as the number of updates grows, since you're potentially walking it every _PROGRESS_INTERVAL seconds?

I decreased _PROGRESS_INTERVAL to account for the fact that we get way too many updates in a second, but you're right that with large tables (many streams) this becomes worse.

Originally I tried having a constant loop of updates for tqdm but there were some locking issues with writing to stderr/stdout outside of the main thread.

Right now I handle this by very likely dropping updates when the queue fills up, meaning the progress bar is grossly inaccurate for large tables. Maybe what I should do instead is have 2 queues. 🤔 One queue used by the workers and a "sum" thread to add up and send to main thread every _PROGRESS_INTERVAL via a different queue.

I suppose the problem is also bounded by the max dataframe users can fit in ram. Our truly large tables aren't really going to be pulled into dataframes without excessive projection or filtering. Perhaps this is a non issue for now.

* Add unit test for progress bar. * Add test for full queue.

The worker queue runs in a background thread, so it's more likely to be able to keep up with the other workers that are adding to the worker queue.

googlebot added the cla: no This human has *not* signed the Contributor License Agreement. label Apr 12, 2019

googleapis deleted a comment from googlebot Apr 12, 2019

tswast changed the title ~~fix: KeyboardInterrupt during to_dataframe (with BQ Storage API) no longer hangs~~ fix: to_dataframe respects progress_bar_type with BQ Storage API Apr 12, 2019

sduskis added cla: yes This human has signed the Contributor License Agreement. and removed cla: no This human has *not* signed the Contributor License Agreement. labels Apr 15, 2019

tswast force-pushed the issue7654-bqstorage-progress-bar branch from 0d96dbd to 9812563 Compare April 17, 2019 17:39

tswast force-pushed the issue7654-bqstorage-progress-bar branch 4 times, most recently from ce0a302 to 6ed43f0 Compare April 17, 2019 23:01

tswast changed the title ~~fix: to_dataframe respects progress_bar_type with BQ Storage API~~ BigQuery: to_dataframe respects progress_bar_type when used with BQ Storage API Apr 18, 2019

tswast added api: bigquery Issues related to the BigQuery API. api: bigquerystorage Issues related to the BigQuery Storage API. labels Apr 18, 2019

yoshi-automation added the 🚨 This issue needs some love. label Apr 19, 2019

tswast force-pushed the issue7654-bqstorage-progress-bar branch from 6ed43f0 to b82eb6d Compare April 22, 2019 18:27

tswast requested a review from shollyman April 22, 2019 18:30

tswast marked this pull request as ready for review April 22, 2019 18:30

tswast requested a review from crwilcox as a code owner April 22, 2019 18:30

shollyman approved these changes Apr 22, 2019

View reviewed changes

tswast added 2 commits April 22, 2019 15:24

fix: to_dataframe respects progress_bar_type with BQ Storage API

6e3037c

* Add unit test for progress bar. * Add test for full queue.

Add worker queue for progress bar to prevent lost tqdm updates.

71112b0

The worker queue runs in a background thread, so it's more likely to be able to keep up with the other workers that are adding to the worker queue.

tswast force-pushed the issue7654-bqstorage-progress-bar branch from b82eb6d to 71112b0 Compare April 23, 2019 00:30

Test that progress bar updates more than once.

e69b1b4

tswast merged commit 4dc8c36 into googleapis:master Apr 23, 2019

tswast deleted the issue7654-bqstorage-progress-bar branch April 23, 2019 21:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery: `to_dataframe` respects `progress_bar_type` when used with BQ Storage API #7697

BigQuery: `to_dataframe` respects `progress_bar_type` when used with BQ Storage API #7697

tswast commented Apr 12, 2019 •

edited

Loading

tswast commented Apr 17, 2019

shollyman Apr 22, 2019

shollyman Apr 22, 2019

tswast Apr 22, 2019 •

edited

Loading

shollyman Apr 23, 2019

		@@ -1274,6 +1275,16 @@ def __repr__(self):
		return "Row({}, {})".format(self._xxx_values, f2i)


		class _FakeQueue(object):

BigQuery: to_dataframe respects progress_bar_type when used with BQ Storage API #7697

BigQuery: to_dataframe respects progress_bar_type when used with BQ Storage API #7697

Conversation

tswast commented Apr 12, 2019 • edited Loading

tswast commented Apr 17, 2019

shollyman Apr 22, 2019

Choose a reason for hiding this comment

shollyman Apr 22, 2019

Choose a reason for hiding this comment

tswast Apr 22, 2019 • edited Loading

Choose a reason for hiding this comment

shollyman Apr 23, 2019

Choose a reason for hiding this comment

BigQuery: `to_dataframe` respects `progress_bar_type` when used with BQ Storage API #7697

BigQuery: `to_dataframe` respects `progress_bar_type` when used with BQ Storage API #7697

tswast commented Apr 12, 2019 •

edited

Loading

tswast Apr 22, 2019 •

edited

Loading