Core: Mitigate busy reopen loop in ResumableBidiRpc consuming 100% CPU #8193

plamut · 2019-05-30T12:23:43Z

Closes #7910.

This PR fixes the issue with ResumableBidiRpc that can enter a busy re-open loop consuming lots of CPU in the process . The comment on the issue explains this in more detail.

How to test

Steps to reproduce:

(optional) Configure more verbose logging in the subscriber script:

import logging 
log_format = (
    "%(levelname)-8s [%(asctime)s] %(threadName)-33s "
    "[%(name)s] [%(filename)s:%(lineno)d][%(funcName)s] %(message)s"
)
logging.basicConfig(
    level=logging.DEBUG,
    format=log_format,
)

Disable your internet connection
Start the streaming pull using a susbscriber client instance:

subscriber.subscribe(SUBSCRIPTION_PATH, callback=my_callback)

while True:
    try:
        time.sleep(60)
    except KeyboardInterrupt:
        break

Actual result (before the fix):
The ResumableBidiRpc class tries to re-establish the stream many times in rapid succession, resulting in a 100% CPU spike and a ton of log output:

...
DEBUG    [2019-05-15 10:06:01,440] Thread-ConsumeBidirectionalStream [_recoverable] Call to retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7fb5eb91d630>> caused 503 DNS resolution failed.
INFO     [2019-05-15 10:06:01,440] Thread-ConsumeBidirectionalStream [_should_recover] Observed recoverable stream error 503 DNS resolution failed
DEBUG    [2019-05-15 10:06:01,440] Thread-ConsumeBidirectionalStream [_recoverable] Re-opening stream from retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7fb5eb91d630>>.
INFO     [2019-05-15 10:06:01,441] Thread-ConsumeBidirectionalStream [_reopen] Re-established stream
INFO     [2019-05-15 10:06:01,441] Thread-1                          [_should_recover] Observed recoverable stream error 503 DNS resolution failed
DEBUG    [2019-05-15 10:06:01,441] Thread-1                          [_on_call_done] Re-opening stream from gRPC callback.
DEBUG    [2019-05-15 10:06:01,441] Thread-1                          [_reopen] Stream was already re-established.
INFO     [2019-05-15 10:06:01,441] Thread-1                          [_should_recover] Observed recoverable stream error 503 DNS resolution failed
DEBUG    [2019-05-15 10:06:01,441] Thread-1                          [_on_call_done] Re-opening stream from gRPC callback.
INFO     [2019-05-15 10:06:01,441] Thread-1                          [_reopen] Re-established stream
DEBUG    [2019-05-15 10:06:01,441] Thread-ConsumeBidirectionalStream [_recoverable] Call to retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7fb5eb91d630>> caused 503 DNS resolution failed.
INFO     [2019-05-15 10:06:01,442] Thread-ConsumeBidirectionalStream [_should_recover] Observed recoverable stream error 503 DNS resolution failed
DEBUG    [2019-05-15 10:06:01,442] Thread-ConsumeBidirectionalStream [_recoverable] Re-opening stream from retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7fb5eb91d630>>.
INFO     [2019-05-15 10:06:01,442] Thread-ConsumeBidirectionalStream [_reopen] Re-established stream
INFO     [2019-05-15 10:06:01,442] Thread-1                          [_should_recover] Observed recoverable stream error 503 DNS resolution failed
DEBUG    [2019-05-15 10:06:01,442] Thread-1                          [_on_call_done] Re-opening stream from gRPC callback.
DEBUG    [2019-05-15 10:06:01,442] Thread-1                          [_reopen] Stream was already re-established.
INFO     [2019-05-15 10:06:01,442] Thread-1                          [_should_recover] Observed recoverable stream error 503 DNS resolution failed
DEBUG    [2019-05-15 10:06:01,442] Thread-1                          [_on_call_done] Re-opening stream from gRPC callback.
INFO     [2019-05-15 10:06:01,443] Thread-1                          [_reopen] Re-established stream
...

This also happens if the streaming pull is running normally, and the internet connection is shut down in the middle of it - eventually the busy re-open loop starts. What's worse, this happens more than a single thread - both the gRPC channel thread (?) and the consumer helper thread try to re-establish the stream.

Expected result (after the fix):
The re-open calls are throttled, and the CPU consumption is "normal".

Things to do/discuss:

Fix test coverage (one branch appears to be missing).
What should the default throttle params be set to?
ResumalbeBidiRpc is also used in the Firestore client, does this PR (negatively) affect it in any way?
ANSWER: No, because this PR does not change the former's default behavior.
Tests rely on timing, and even with a sizeable safety buffer, this is not ideal. Would there be any objections to adding freezegun as an API core dependency?

busunkim96 · 2019-05-31T20:33:33Z

@crwilcox Could you comment on the impact to Firestore?

plamut · 2019-06-11T15:47:14Z

FWIW, I modified the ResumalbeBidiRpc class to not use throttling by default, meaning that Firestore will not be affected by this PR.

sduskis · 2019-06-17T16:44:45Z

@crwilcox, what's the good word on this PR?

crwilcox · 2019-06-17T23:01:43Z

@sduskis I had left comments in the python chat for Peter last week. Let me copy them here.

crwilcox · 2019-06-17T23:03:06Z

api_core/google/api_core/bidi.py

+        self._entry_lock = threading.Lock()
+
+    def __enter__(self):
+        with self._entry_lock:


There is a lock, but deque should be threadsafe. Was this done to keep things more straightforward? There is reference to the first element and then removing it in a check, which would have to be changed if we moved to a non-locking impl I think?

The initial idea was that the lock would serve as a gatekeeper, and if any thread needs to sleep while holding it, other threads trying to enter would also be forced to wait. With a lock, achieving that correctly would be straightforward. The lock was placed there before implementing the rest of the logic - one could say to avoid accidentally overlooking an edge case withing the logic, yes.

(deque operations themselves are thread-safe, but the rest of the logic surrounding it might not be, thus I went with a conservative approach)

Here's one risky scenario, assume _past_entries == [1, 2, 3, 4] and access_limit == 3:

T 1 2 3 4 5 ----|----|----|----|----|-- |_____________| ↑A |____________| ↑B

Thread A enters the manager at T=4+ε, computes its cutoff_time, and determines that entry 1 is now irrelevant.

Just before left-popping an item, thread B enters at T=4+2ε, determines the same, and removes 1 from the queue. It then figures out that there are already three entries in the window, and goes to sleep.

Thread A is resumed, and pops the leftmost item, but that's actually entry 2 now! With only 3 and 4 left in the queue, thread proceeds without waiting, because it only saw two items in its window.

Since the queue length and the entries in it are essentially a shared state, having a lock around that avoids tricky scenarios. Besides, the lock is released almost immediately after new entries are again allowed, as the lock-holding thread is put to sleep for just the right amount of time.

Do you concur, or have I overlooked something?

(BTW, double checking concurrency logic is always appreciated, so thanks!)

The lock is by far the easier approach here. There would be some complicated bits to make this lock free. I just was just curious if this was considered :)

api_core/google/api_core/bidi.py

The commit renames the entry_cap parameter to access_limit, and changes the type of the time_window argument from float to timedelta.

lidizheng · 2019-06-18T17:39:05Z

I might be wrong about the usage of this feature, but I want to suggest an alternative algorithm.

Since _pastEntries is private, I guess users don't need to keep the timestamps. In this case, there is a well-known rate limiting algorithm, token bucket, might apply well to this situation.

It doesn't need to keep the timestamp, so there is no need for complex data structure;
It's fast, the time/space complexity are both O(1).

I found an Python implementation at rate_limit.py.

crwilcox · 2019-06-18T18:06:23Z

api_core/google/api_core/bidi.py

+        self._entry_lock = threading.Lock()
+
+    def __enter__(self):
+        with self._entry_lock:


The lock is by far the easier approach here. There would be some complicated bits to make this lock free. I just was just curious if this was considered :)

TheKevJames · 2019-06-18T20:10:51Z

Updating to google-cloud-pubsub==0.42.0 seems to be throwing a TypeError which looks to be related to this PR:

TypeError: __init__() got an unexpected keyword argument 'throttle_reopen'
  File "main.py", line 48, in main
    flow_control=get_flow_control())
  File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/client.py", line 212, in subscribe
    manager.open(callback=callback, on_callback_error=future.set_exception)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_protocol/streaming_pull_manager.py", line 386, in open
    throttle_reopen=True,

Not overly familiar with this code base, but looks like there might need to be a version bump to api-core as well, since pubsub 0.42.0 now relies on the new code added in this PR? For reference, installing google-cloud-pubsub==0.42.0 seems to have picked up the following relevant library versions:

google-api-core | 1.11.1
google-auth | 1.6.3
google-cloud-pubsub | 0.42.0
googleapis-common-protos | 1.6.0
grpc-google-iam-v1 | 0.11.4
grpcio | 1.21.1

busunkim96 · 2019-06-18T23:35:15Z

@TheKevJames We've made that bump and cut another release, please use 0.42.1. Thanks!

plamut added api: pubsub Issues related to the Pub/Sub API. api: core labels May 30, 2019

plamut requested a review from crwilcox as a code owner May 30, 2019 12:23

googlebot added the cla: yes This human has signed the Contributor License Agreement. label May 30, 2019

plamut changed the title ~~Core: ISS-7910~~ Core: Mitigate busy reopen loop in ResumableBidiRpc consuming 100% CPU May 30, 2019

sduskis requested review from busunkim96 and removed request for crwilcox May 30, 2019 12:39

plamut force-pushed the iss-7910 branch 2 times, most recently from 2839d0b to d201ab1 Compare May 30, 2019 14:10

busunkim96 requested a review from crwilcox May 31, 2019 20:33

sduskis requested review from crwilcox and removed request for crwilcox June 4, 2019 13:45

yoshi-automation added the 🚨 This issue needs some love. label Jun 6, 2019

plamut force-pushed the iss-7910 branch from d201ab1 to b0c3503 Compare June 11, 2019 15:43

plamut requested a review from anguillanneuf as a code owner June 11, 2019 15:43

plamut force-pushed the iss-7910 branch from b0c3503 to 9c9a10b Compare June 11, 2019 15:59

crwilcox reviewed Jun 17, 2019

View reviewed changes

plamut commented Jun 18, 2019

View reviewed changes

api_core/google/api_core/bidi.py Outdated Show resolved Hide resolved

plamut added 4 commits June 18, 2019 11:17

Add bidi._Throttle helper class

7368727

Add optional reopen throttling to ResumableBidiRpc

f2fd5b7

Enable Bidi reopen throttling in SPM

f8fc810

Change bidi._Throttle signature

2d9cc2a

The commit renames the entry_cap parameter to access_limit, and changes the type of the time_window argument from float to timedelta.

plamut force-pushed the iss-7910 branch from 3b6606c to 2d9cc2a Compare June 18, 2019 09:17

plamut requested a review from crwilcox June 18, 2019 09:17

crwilcox approved these changes Jun 18, 2019

View reviewed changes

busunkim96 approved these changes Jun 18, 2019

View reviewed changes

sduskis merged commit 6b0c5ad into googleapis:master Jun 18, 2019

jcox-dev mentioned this pull request Jul 3, 2019

Pull in fix from pubsub 0.42.x to reduce log output ONSdigital/census-rm-pubsub#25

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Mitigate busy reopen loop in ResumableBidiRpc consuming 100% CPU #8193

Core: Mitigate busy reopen loop in ResumableBidiRpc consuming 100% CPU #8193

plamut commented May 30, 2019 •

edited

Loading

busunkim96 commented May 31, 2019

plamut commented Jun 11, 2019

sduskis commented Jun 17, 2019

crwilcox commented Jun 17, 2019

crwilcox Jun 17, 2019

plamut Jun 18, 2019 •

edited

Loading

crwilcox Jun 18, 2019

lidizheng commented Jun 18, 2019

crwilcox Jun 18, 2019

TheKevJames commented Jun 18, 2019

busunkim96 commented Jun 18, 2019

Core: Mitigate busy reopen loop in ResumableBidiRpc consuming 100% CPU #8193

Core: Mitigate busy reopen loop in ResumableBidiRpc consuming 100% CPU #8193

Conversation

plamut commented May 30, 2019 • edited Loading

How to test

Things to do/discuss:

busunkim96 commented May 31, 2019

plamut commented Jun 11, 2019

sduskis commented Jun 17, 2019

crwilcox commented Jun 17, 2019

crwilcox Jun 17, 2019

Choose a reason for hiding this comment

plamut Jun 18, 2019 • edited Loading

Choose a reason for hiding this comment

crwilcox Jun 18, 2019

Choose a reason for hiding this comment

lidizheng commented Jun 18, 2019

crwilcox Jun 18, 2019

Choose a reason for hiding this comment

TheKevJames commented Jun 18, 2019

busunkim96 commented Jun 18, 2019

plamut commented May 30, 2019 •

edited

Loading

plamut Jun 18, 2019 •

edited

Loading