-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
keyring: fixes for keyring replication on cluster join #14987
Conversation
The rate limiter returns an error and unblocks early if its burst limit is exceeded (unless the burst limit is Inf). Ensure we're not unblocking early, otherwise we'll only slow down the cases where we're already pausing to make external RPC requests.
Aside: there's a couple other existing cases of this in |
|
When keyring replication makes a stale query to non-leader peers to find a key the leader doesn't have, we need to make sure the peer we're querying has had a chance to catch up to the most current index for that key. Otherwise it's possible for newly-added servers to query another newly-added server and get a non-error nil response for that key ID. Ensure that we're setting the correct reply index in the blocking query. Note that the "not found" case does not return an error, just an empty key. So as a belt-and-suspenders, update the handling of empty responses so that we don't break the loop early if we hit a server that doesn't have the key.
12902d4
to
e8a0ec1
Compare
Wait until we're sure the FSM is current before we try to initialize the keyring. Also, if a key is rotated immediately following a leader election, plans that are in-flight may get signed before the new leader has the key. Allow for a short timeout-and-retry to avoid rejecting plans
ed7bafb
to
45f7352
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just thinking about Limiters
// Rate limit how often we attempt replication | ||
limiter.Wait(ctx) | ||
err := limiter.Wait(ctx) | ||
if err != nil { | ||
goto ERR_WAIT // rate limit exceeded | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like if we have our own blocking mechanism, we should be using Allow
instead? Otherwise we are kind of double-waiting, which is fine but awkward given the error. (no need to change anything now)
https://pkg.go.dev/golang.org/x/time/rate#Limiter
edit: actually no, Wait
has the nice property of waiting a minimal amount of time, unblocking once a resource becomes free
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this API is weirdly hard to use correctly and the error case is only for exceeding the burst. I first thought we probably wanted to use Reserve
but that has the same double-waiting problem. Maybe we should just remove the burst (set it to Inf
) so that we guarantee we'll wait and not have to handle this error so awkwardly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misread the API again 😊 . Only when the rate is set to rate.Inf
is the burst-limit ignored. So this is good as-is, awkward as it seems.
I've got a clue that the core problem here is actually related to bugs in garbage collection, but I'm going to merge this in and push up a new PR with that fix once I've got that pinned down. |
Oops, forgot the changelog entry. Will add that to #15009 |
Hello! Any idea when |
I can't commit to exact dates but I can tell you we're in the process of finalizing the release now, so it'll be "very soon." |
getActiveKeyset := func() (*keyset, error) { | ||
e.lock.RLock() | ||
defer e.lock.RUnlock() | ||
keyset, err := e.activeKeySetLocked() | ||
return keyset, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the after-the-fact drive-by review:
Looks like we could replace activeKeySetLocked
with this func as the only other place the key is read is Encrypt
and it similarly only needs the lock for the duration of fetching the key.
The fact neither reader of this field need the lock other than for this field signals to me that the lock should be encapsulated in the getter and not the caller's concern.
Not a big deal though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good catch. activeKeySetLocked
ends up holding onto the lock longer than it needs to as well -- it doesn't need to hold the lock while querying the state store. I'll open another quickie PR for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up PR: #15026
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Fixes #14981. Note to reviewers: these are definitely bugs but I don't have 100% confidence we've solved the problem because there's a log line missing in the reports from both the issue author and our internal user that I would expect to see. So I'm still trying to repro exactly, but it's worth getting some early eyes on this PR anyways.
Don't unblock early if rate limit burst exceeded. The rate limiter returns an error and unblocks early if its burst limit is exceeded (unless the burst limit is Inf). Ensure we're not unblocking early, otherwise we'll only slow down the cases where we're already pausing to make external RPC requests.
Set
MinQueryIndex
on stale queries. When keyring replication makes a stale query to non-leader peers to find a key the leader doesn't have, we need to make sure the peer we're querying has had a chance to catch up to the most current index for that key. Otherwise it's possible for newly-added servers to query another newly-added server and get a non-error nil response for that key ID.Note that the "not found" case does not return an error, just an empty key. Update the handling of empty responses so that we don't break the loop early if we hit a server that doesn't have the key. (Peers aren't shuffled so we'd expect to hit the same server repeatedly.)
Move the keyring initialize step to wait until we're sure the FSM is current.
If a key is rotated immediately following a leader election, plans that are in-flight may get signed before the new leader has the key. Allow for a short timeout-and-retry to avoid rejecting plans