Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvevent: Ensure out of quota events correctly handled #87464

Merged
merged 1 commit into from
Sep 8, 2022

Conversation

miretskiy
Copy link
Contributor

Ensure that out of quota events are not lost and propagated if necessary to the consumer.

Prior to this change, it was possible for an out of quota notification to be "lost" because "blocked" bit would be cleared out when an event was enqueued.
Instead of relying on a boolean bit, we now keep track of the number of consumers currently blocked, and issue flush request if there are non-zero blocked consumers with zero events currently queued.

Fixes #86828

Release justification: bug fix
Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@miretskiy miretskiy marked this pull request as ready for review September 6, 2022 22:31
@miretskiy miretskiy requested a review from a team as a code owner September 6, 2022 22:31
@miretskiy miretskiy requested review from HonoreDB and ajwerner and removed request for a team September 6, 2022 22:31
Copy link
Contributor

@HonoreDB HonoreDB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

b.mu.Lock()
b.mu.numBlocked++
b.mu.Unlock()
b.notifyOutOfQuota()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question I have is whether this can lead to back-to-back emission of flush events. I suppose that's fine, especially now, but in a world with some processing parallelism it might get weird.

Can we come up with something simple to prevent that? Either bookkeeping or more careful synchronization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question I have is whether this can lead to back-to-back emission of flush events. I suppose that's fine, especially now, but in a world with some processing parallelism it might get weird.

Can we come up with something simple to prevent that? Either bookkeeping or more careful synchronization?

I don't think this can lead to back-to-back emission of flush events, with exception of one corner case.
I think there are few important observations:

  1. signalCh has a buffer of 1; this buffer is to ensure that notification to single-threaded consumer (Get) never misses notification.
  2. Producers (there can be multiple, e.g. during backfill) may concurrently run out of quota; however only 1 of those produces will be at the head of the (quota pool) queue, and thus only 1 consumer will invoke notify out of quota
  3. (This really flows from 1 & 2) -- Producer either puts an event into blocking buffer queue, OR it puts a notification onto the channel that it is out of quota.

With this in mind, the condition to trigger Flush is: the queue is empty and we have some producers blocked
(if !ok && b.mu.numBlocked > 0 ...). That means that producer is blocked and there is nothing else in blocking buffer queue to cause consumer to try to release any resources -- i.e. every allocated resource is buffered.
So, we trigger flush. Because this flush was triggered when there were no outstanding events in blocking buffer
queue, that means that every outstanding event/allocations must have been released.
The original producer that was blocked will attempt to acquire resources, now that everything has been released.
If the original producer failed to acquire resource (that means that our parent memory monitor is very tight), then that's fine -- we will trigger a no-op flush -- and we'll do that once a second (that's the corner case).

If it acquired resource -- great, it queues the event. We have made forward progress; if the next producer also blocked -- well, that means that perhaps our memory buffer is tiny, or events are huge -- either way flush will be triggered correctly -- even if that means that we flush 1 event.

Is my analysis sound? Do you still think more book-keeping needed? Or perhaps just copying above as a comment because it's not obvious at all.

Copy link
Contributor Author

@miretskiy miretskiy Sep 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another corner case, I suppose is: between original consumer acquiring quota, but before it enqueued the event...
A no-op flush can happen. Is this the corner case you were worried about @ajwerner ?

Actually, that doesn't happen because we don't notify when we acquire quota -- we notify when we enqueue.
So, I don't think this can happen either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Publisher enqueues and is blocked, so sends on the channel
  2. Consumer is signaled by the send, locks the mutex, notices that there's a blocked producer, creates a flush event
  3. Second publisher enqueues and sends on the channel (or not)
  4. Consumer now sends another flush event

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Publisher enqueues and is blocked, so sends on the channel
  • Consumer is signaled by the send, locks the mutex, notices that there's a blocked producer, creates a flush event
  • Second publisher enqueues and sends on the channel (or not)
  • Consumer now sends another flush event

I don't think 3 can happen; because all blocked producers are blocked in a queue of their own (in quota pool); thus;
after step 2 (Flush) the only producer that can wake up is the first producer -- and that producer either produces
an event or is blocked again (as described in the corner case above).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stand corrected; I guess a no-op flush is possible.

@miretskiy
Copy link
Contributor Author

Notify out of quota now takes in a boolean indicating if flush is possible...
This is to address the following scenario:

* two goroutines are blocked
* flush gets emitted
* flush occurs adn memory gets freed such that many, many messages could be enqueued
* head of queue gets unblocked and enqueues 1 message
* consumer consumes the one message
* consumer asks for next message, but second blocked goroutine has not yet become unblocked
* producer sees blocked goroutines and flushes again, but only now 1 message

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @miretskiy)


pkg/ccl/changefeedccl/kvevent/blocking_buffer.go line 334 at r2 (raw file):

	fulfilled, tryAgainAfter = r.acquireQuota(ctx, quota)
	if !fulfilled {
		quota.notifyOutOfQuota(quota.allocated == 0 || quota.canAllocateBelow > 0)

Can you explain this in commentary? I don't get it.

Ensure that out of quota events are not lost and propagated
if necessary to the consumer.

Prior to this change, it was possible for an out of quota
notification to be "lost" because "blocked" bit would be cleared
out when an event was enqueued.
Instead of relying on a boolean bit, we now keep track of the
number of consumers currently blocked, and issue flush request
if there are non-zero blocked consumers with zero events
currently queued.

Fixes cockroachdb#86828

Release justification: bug fix
Release note: None
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I wish we had testing. Can we make the buffer size metamorphic?

@miretskiy
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Sep 8, 2022

Build failed:

@miretskiy
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Sep 8, 2022

Build succeeded:

@craig craig bot merged commit 05b4853 into cockroachdb:master Sep 8, 2022
@blathers-crl
Copy link

blathers-crl bot commented Sep 8, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 12a1b04 to blathers/backport-release-21.2-87464: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Sep 9, 2022
Previous PR cockroachdb#87464 erroneously removed code to ensure
that consumer is notified about out of quota events
once.  Rectify this issue.

Release Justification: bug fix
Release note: None
craig bot pushed a commit that referenced this pull request Sep 10, 2022
87611: authors: add Ganeshprasad Rajashekhar Biradar to authors. r=biradarganesh25 a=biradarganesh25

Release note: None

Release justification: non-production code change

87737: kvevent: Avoid busy loop during out of quota r=miretskiy a=miretskiy

Previous PR #87464 erroneously removed code to ensure that consumer is notified about out of quota events once.  Rectify this issue.

Release Justification: bug fix
Release note: None

Co-authored-by: Ganeshprasad Rajashekhar Biradar <[email protected]>
Co-authored-by: Yevgeniy Miretskiy <[email protected]>
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Sep 10, 2022
Previous PR cockroachdb#87464 erroneously removed code to ensure
that consumer is notified about out of quota events
once.  Rectify this issue.

Release Justification: bug fix
Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Sep 10, 2022
Previous PR cockroachdb#87464 erroneously removed code to ensure
that consumer is notified about out of quota events
once.  Rectify this issue.

Release Justification: bug fix
Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CDC as Export / Job Never Completes
4 participants