-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when etcd3 watch finds delete event with nil prevKV #76675
Error when etcd3 watch finds delete event with nil prevKV #76675
Conversation
Hi @ryanmcnamara. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@liggitt, should we do the same thing for "update" events, I believe in this part of the code that would be events that |
I'll be testing this change out on our stack that initially ran into #76624 using this branch: https://github.com/ryanmcnamara/kubernetes/tree/rm/error-on-nil-event-demo |
I believe so, yes |
@@ -53,5 +54,9 @@ func parseEvent(e *clientv3.Event) *event { | |||
if e.PrevKv != nil { | |||
ret.prevValue = e.PrevKv.Value | |||
} | |||
return ret | |||
if ret.isDeleted && ret.prevValue == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the previous value existed, but was a zero-length byte array, does this show up as nil here? do we know if e.PrevKv
as nil as well when we encountered this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the previous value existed, but was a zero-length byte array, does this show up as nil here?
I expect it would show up as the zero length byte array
do we know if e.PrevKv as nil as well when we encountered this?
All I can be certain is that when I encountered this either e.PrevKv was nil or e.PrevKv.Value was nil. I expect that e.PrevKv was nil though, given the info I had on compaction taking place and this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect it would show up as the zero length byte array
expectations are funny things... if it uses protobuf, zero-length and nil arrays can be serialized identically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I was mostly trying to fix the specific problem I encountered, to error here if and only if the prev value was nil, and not erroring on the empty array as I'm not sure we have that guarantee here.
Do you think this should error for both? error if len(ret.preValue)==0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if e.PrevKv is nil, that seems like a much clearer indicator of a problem
Will deploy this change as is now, wait for the repro (which could take several days) and confirm that it handles it better |
/assign @jpbetz |
@fedebongio: GitHub didn't allow me to request PR reviews from the following users: jingyih. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
nice graph! do you have any insight as to what happened between 1:55 and 2:00 (based on apiserver or etcd logs)? do you know which apiserver/etcd member was affected? anything suspect in the logs for that member? |
The event started exactly when the kube controller manager leader switched (this happens at 1:55). The apiserver it's now hitting is the impacted one. That's how this issue has always reproduced. Looking at etcd logs and metrics I don't see anything interesting. Each time this happens I notice load going quite high on the master host, but only around I don't really think this is the issue though, because the load stays well below I can take another look at the logs and see if I find anything Do you want to continue debugging on the issue? Seems like this pr is useful regardless of what's causing the delay in events. |
func parseEvent(e *clientv3.Event) (*event, error) { | ||
if !e.IsCreate() && e.PrevKv == nil { | ||
// If the previous value is nil, error. One example of how this is possible is if the previous value has been compacted already. | ||
return nil, fmt.Errorf("etcd event was not a create and has a nil prevKV, key: %s, modRevision: %d, isDelete %v", string(e.Kv.Key), e.Kv.ModRevision, e.Type == clientv3.EventTypeDelete) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpbetz can you sanity check that all non-create events should have a PrevKv set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(we only use this method in one place, on a watch, started with the PrevKv option)
opts := []clientv3.OpOption{clientv3.WithRev(wc.initialRev + 1), clientv3.WithPrevKV()}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like etcd not providing a prevKV
if it was compacted was intentional. See my comment here: #76624 (comment)
I'm betting this really is due to compaction. It looks like we're using clientv3.WithPrevKV
correctly.
21a74cd
to
7406fcc
Compare
@liggitt done |
/priority critical-urgent |
7406fcc
to
bf65c0a
Compare
looks like |
/lgtm |
bf65c0a
to
c6eb078
Compare
c6eb078
to
5043806
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: liggitt, ryanmcnamara The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
once this merges, can you pick it to 1.12, 1.13, and 1.14 (using |
picks opened |
…5-upstream-release-1.12 Automated cherry pick of #76675: Error when etcd3 watch finds delete event with nil prevKV
…5-upstream-release-1.14 Automated cherry pick of #76675: Error when etcd3 watch finds delete event with nil prevKV
…5-upstream-release-1.13 Automated cherry pick of #76675: Error when etcd3 watch finds delete event with nil prevKV
What type of PR is this?
/kind bug
What this PR does / why we need it:
See #76624 (comment), taking half of that suggestion (for now), which is to end the watch when a delete event does not have a prevKV
Does this PR introduce a user-facing change?: