kvevent: Ensure out of quota events correctly handled #87464

miretskiy · 2022-09-06T22:20:11Z

Ensure that out of quota events are not lost and propagated if necessary to the consumer.

Prior to this change, it was possible for an out of quota notification to be "lost" because "blocked" bit would be cleared out when an event was enqueued.
Instead of relying on a boolean bit, we now keep track of the number of consumers currently blocked, and issue flush request if there are non-zero blocked consumers with zero events currently queued.

Fixes #86828

Release justification: bug fix
Release note: None

cockroach-teamcity · 2022-09-06T22:20:20Z

This change is

HonoreDB

Nice!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

ajwerner · 2022-09-07T02:14:38Z

pkg/ccl/changefeedccl/kvevent/blocking_buffer.go

+			b.mu.Lock()
+			b.mu.numBlocked++
+			b.mu.Unlock()
+			b.notifyOutOfQuota()


The question I have is whether this can lead to back-to-back emission of flush events. I suppose that's fine, especially now, but in a world with some processing parallelism it might get weird.

Can we come up with something simple to prevent that? Either bookkeeping or more careful synchronization?

The question I have is whether this can lead to back-to-back emission of flush events. I suppose that's fine, especially now, but in a world with some processing parallelism it might get weird.

Can we come up with something simple to prevent that? Either bookkeeping or more careful synchronization?

I don't think this can lead to back-to-back emission of flush events, with exception of one corner case.
I think there are few important observations:

signalCh has a buffer of 1; this buffer is to ensure that notification to single-threaded consumer (Get) never misses notification.

Producers (there can be multiple, e.g. during backfill) may concurrently run out of quota; however only 1 of those produces will be at the head of the (quota pool) queue, and thus only 1 consumer will invoke notify out of quota

(This really flows from 1 & 2) -- Producer either puts an event into blocking buffer queue, OR it puts a notification onto the channel that it is out of quota.

With this in mind, the condition to trigger Flush is: the queue is empty and we have some producers blocked
(if !ok && b.mu.numBlocked > 0 ...). That means that producer is blocked and there is nothing else in blocking buffer queue to cause consumer to try to release any resources -- i.e. every allocated resource is buffered.
So, we trigger flush. Because this flush was triggered when there were no outstanding events in blocking buffer
queue, that means that every outstanding event/allocations must have been released.
The original producer that was blocked will attempt to acquire resources, now that everything has been released.
If the original producer failed to acquire resource (that means that our parent memory monitor is very tight), then that's fine -- we will trigger a no-op flush -- and we'll do that once a second (that's the corner case).

If it acquired resource -- great, it queues the event. We have made forward progress; if the next producer also blocked -- well, that means that perhaps our memory buffer is tiny, or events are huge -- either way flush will be triggered correctly -- even if that means that we flush 1 event.

Is my analysis sound? Do you still think more book-keeping needed? Or perhaps just copying above as a comment because it's not obvious at all.

Another corner case, I suppose is: between original consumer acquiring quota, but before it enqueued the event...
A no-op flush can happen. Is this the corner case you were worried about @ajwerner ?

Actually, that doesn't happen because we don't notify when we acquire quota -- we notify when we enqueue.
So, I don't think this can happen either.

Publisher enqueues and is blocked, so sends on the channel

Consumer is signaled by the send, locks the mutex, notices that there's a blocked producer, creates a flush event

Second publisher enqueues and sends on the channel (or not)

Consumer now sends another flush event

Publisher enqueues and is blocked, so sends on the channel

Consumer is signaled by the send, locks the mutex, notices that there's a blocked producer, creates a flush event

Second publisher enqueues and sends on the channel (or not)

Consumer now sends another flush event

I don't think 3 can happen; because all blocked producers are blocked in a queue of their own (in quota pool); thus;
after step 2 (Flush) the only producer that can wake up is the first producer -- and that producer either produces
an event or is blocked again (as described in the corner case above).

I stand corrected; I guess a no-op flush is possible.

miretskiy · 2022-09-07T21:09:22Z

Notify out of quota now takes in a boolean indicating if flush is possible...
This is to address the following scenario:

* two goroutines are blocked
* flush gets emitted
* flush occurs adn memory gets freed such that many, many messages could be enqueued
* head of queue gets unblocked and enqueues 1 message
* consumer consumes the one message
* consumer asks for next message, but second blocked goroutine has not yet become unblocked
* producer sees blocked goroutines and flushes again, but only now 1 message

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @miretskiy)

pkg/ccl/changefeedccl/kvevent/blocking_buffer.go line 334 at r2 (raw file):

	fulfilled, tryAgainAfter = r.acquireQuota(ctx, quota)
	if !fulfilled {
		quota.notifyOutOfQuota(quota.allocated == 0 || quota.canAllocateBelow > 0)

Can you explain this in commentary? I don't get it.

Ensure that out of quota events are not lost and propagated if necessary to the consumer. Prior to this change, it was possible for an out of quota notification to be "lost" because "blocked" bit would be cleared out when an event was enqueued. Instead of relying on a boolean bit, we now keep track of the number of consumers currently blocked, and issue flush request if there are non-zero blocked consumers with zero events currently queued. Fixes cockroachdb#86828 Release justification: bug fix Release note: None

ajwerner

LGTM

I wish we had testing. Can we make the buffer size metamorphic?

miretskiy · 2022-09-08T12:44:06Z

bors r+

craig · 2022-09-08T12:46:54Z

Build failed:

Bazel Essential CI (Cockroach)

miretskiy · 2022-09-08T13:09:57Z

bors r+

craig · 2022-09-08T15:32:21Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2022-09-08T15:32:36Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 12a1b04 to blathers/backport-release-21.2-87464: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

Previous PR cockroachdb#87464 erroneously removed code to ensure that consumer is notified about out of quota events once. Rectify this issue. Release Justification: bug fix Release note: None

87611: authors: add Ganeshprasad Rajashekhar Biradar to authors. r=biradarganesh25 a=biradarganesh25 Release note: None Release justification: non-production code change 87737: kvevent: Avoid busy loop during out of quota r=miretskiy a=miretskiy Previous PR #87464 erroneously removed code to ensure that consumer is notified about out of quota events once. Rectify this issue. Release Justification: bug fix Release note: None Co-authored-by: Ganeshprasad Rajashekhar Biradar <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]>

Previous PR cockroachdb#87464 erroneously removed code to ensure that consumer is notified about out of quota events once. Rectify this issue. Release Justification: bug fix Release note: None

miretskiy marked this pull request as ready for review September 6, 2022 22:31

miretskiy requested a review from a team as a code owner September 6, 2022 22:31

miretskiy requested review from HonoreDB and ajwerner and removed request for a team September 6, 2022 22:31

miretskiy added the backport-21.2.x label Sep 6, 2022

HonoreDB approved these changes Sep 6, 2022

View reviewed changes

ajwerner reviewed Sep 7, 2022

View reviewed changes

miretskiy force-pushed the unblock branch from c120e17 to ed685b8 Compare September 7, 2022 21:08

ajwerner reviewed Sep 7, 2022

View reviewed changes

miretskiy force-pushed the unblock branch from ed685b8 to 12a1b04 Compare September 7, 2022 22:33

ajwerner approved these changes Sep 7, 2022

View reviewed changes

craig bot merged commit 05b4853 into cockroachdb:master Sep 8, 2022

This was referenced Sep 9, 2022

kvevent: Avoid busy loop during out of quota #87737

Merged

changefeedccl: Better tests for blocking buffer #87744

Open

miretskiy mentioned this pull request Sep 10, 2022

release-22.2: kvevent: Ensure out of quota events correctly handled #87765

Merged

miretskiy mentioned this pull request Oct 19, 2022

release-21.2: TODO #90240

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvevent: Ensure out of quota events correctly handled #87464

kvevent: Ensure out of quota events correctly handled #87464

miretskiy commented Sep 6, 2022

cockroach-teamcity commented Sep 6, 2022

HonoreDB left a comment

ajwerner Sep 7, 2022

miretskiy Sep 7, 2022

miretskiy Sep 7, 2022 •

edited

Loading

ajwerner Sep 7, 2022

miretskiy Sep 7, 2022

miretskiy Sep 7, 2022

miretskiy commented Sep 7, 2022

ajwerner left a comment

ajwerner left a comment

miretskiy commented Sep 8, 2022

craig bot commented Sep 8, 2022

miretskiy commented Sep 8, 2022

craig bot commented Sep 8, 2022

blathers-crl bot commented Sep 8, 2022

kvevent: Ensure out of quota events correctly handled #87464

kvevent: Ensure out of quota events correctly handled #87464

Conversation

miretskiy commented Sep 6, 2022

cockroach-teamcity commented Sep 6, 2022

HonoreDB left a comment

Choose a reason for hiding this comment

ajwerner Sep 7, 2022

Choose a reason for hiding this comment

miretskiy Sep 7, 2022

Choose a reason for hiding this comment

miretskiy Sep 7, 2022 • edited Loading

Choose a reason for hiding this comment

ajwerner Sep 7, 2022

Choose a reason for hiding this comment

miretskiy Sep 7, 2022

Choose a reason for hiding this comment

miretskiy Sep 7, 2022

Choose a reason for hiding this comment

miretskiy commented Sep 7, 2022

ajwerner left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

miretskiy commented Sep 8, 2022

craig bot commented Sep 8, 2022

miretskiy commented Sep 8, 2022

craig bot commented Sep 8, 2022

blathers-crl bot commented Sep 8, 2022

miretskiy Sep 7, 2022 •

edited

Loading