-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Mitigate busy reopen loop in ResumableBidiRpc consuming 100% CPU #8193
Conversation
2839d0b
to
d201ab1
Compare
@crwilcox Could you comment on the impact to Firestore? |
FWIW, I modified the |
@crwilcox, what's the good word on this PR? |
@sduskis I had left comments in the python chat for Peter last week. Let me copy them here. |
self._entry_lock = threading.Lock() | ||
|
||
def __enter__(self): | ||
with self._entry_lock: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lock, but deque should be threadsafe. Was this done to keep things more straightforward? There is reference to the first element and then removing it in a check, which would have to be changed if we moved to a non-locking impl I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial idea was that the lock would serve as a gatekeeper, and if any thread needs to sleep while holding it, other threads trying to enter would also be forced to wait. With a lock, achieving that correctly would be straightforward. The lock was placed there before implementing the rest of the logic - one could say to avoid accidentally overlooking an edge case withing the logic, yes.
(deque operations themselves are thread-safe, but the rest of the logic surrounding it might not be, thus I went with a conservative approach)
Here's one risky scenario, assume _past_entries == [1, 2, 3, 4]
and access_limit == 3
:
T 1 2 3 4 5
----|----|----|----|----|--
|_____________|
↑A
|____________|
↑B
- Thread A enters the manager at T=4+ε, computes its
cutoff_time
, and determines that entry1
is now irrelevant. - Just before left-popping an item, thread B enters at T=4+2ε, determines the same, and removes
1
from the queue. It then figures out that there are already three entries in the window, and goes to sleep. - Thread A is resumed, and pops the leftmost item, but that's actually entry
2
now! With only3
and4
left in the queue, thread proceeds without waiting, because it only saw two items in its window.
Since the queue length and the entries in it are essentially a shared state, having a lock around that avoids tricky scenarios. Besides, the lock is released almost immediately after new entries are again allowed, as the lock-holding thread is put to sleep for just the right amount of time.
Do you concur, or have I overlooked something?
(BTW, double checking concurrency logic is always appreciated, so thanks!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock is by far the easier approach here. There would be some complicated bits to make this lock free. I just was just curious if this was considered :)
The commit renames the entry_cap parameter to access_limit, and changes the type of the time_window argument from float to timedelta.
I might be wrong about the usage of this feature, but I want to suggest an alternative algorithm. Since
I found an Python implementation at rate_limit.py. |
self._entry_lock = threading.Lock() | ||
|
||
def __enter__(self): | ||
with self._entry_lock: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock is by far the easier approach here. There would be some complicated bits to make this lock free. I just was just curious if this was considered :)
Updating to
Not overly familiar with this code base, but looks like there might need to be a version bump to
|
@TheKevJames We've made that bump and cut another release, please use 0.42.1. Thanks! |
Closes #7910.
This PR fixes the issue with
ResumableBidiRpc
that can enter a busy re-open loop consuming lots of CPU in the process . The comment on the issue explains this in more detail.How to test
Steps to reproduce:
Actual result (before the fix):
The
ResumableBidiRpc
class tries to re-establish the stream many times in rapid succession, resulting in a 100% CPU spike and a ton of log output:This also happens if the streaming pull is running normally, and the internet connection is shut down in the middle of it - eventually the busy re-open loop starts. What's worse, this happens more than a single thread - both the gRPC channel thread (?) and the consumer helper thread try to re-establish the stream.
Expected result (after the fix):
The re-open calls are throttled, and the CPU consumption is "normal".
Things to do/discuss:
ResumalbeBidiRpc
is also used in the Firestore client, does this PR (negatively) affect it in any way?ANSWER: No, because this PR does not change the former's default behavior.