-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gossip: provide online method to clear leaked gossip infos #85505
gossip: provide online method to clear leaked gossip infos #85505
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 10 of 10 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/gossip/gossip.go
line 1097 at r1 (raw file):
if err != nil { if errors.HasType(err, KeyNotPresentError{}) { // Info object does not exist. Nothing to do.
nit: it may not be present on this node, yet but already populated on some other node.
Don't you want to issue the replacement key anyway, as a sort of "poison"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)
pkg/gossip/gossip.go
line 1097 at r1 (raw file):
Previously, knz (kena) wrote…
nit: it may not be present on this node, yet but already populated on some other node.
Don't you want to issue the replacement key anyway, as a sort of "poison"?
Yeah, I think that's the more ideal thing to do. However, if the info is not present on this node yet, we don't know what value to associate with it when poisoning. We could give it a nil value, but I have concerns about taking a gossip key that code reasonably expects will always have a value if present (e.g. "store"
-> roachpb.StoreDescriptor
), and then giving it no key. If any code is making assumptions about the value, we could cause issues (e.g. crash loops) which would then be hard to work around without a full cluster power cycle.
Gossiping a poison entry with the existing key-value avoids this issue, but then has the problem you mention.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/gossip/gossip.go
line 1097 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Yeah, I think that's the more ideal thing to do. However, if the info is not present on this node yet, we don't know what value to associate with it when poisoning. We could give it a nil value, but I have concerns about taking a gossip key that code reasonably expects will always have a value if present (e.g.
"store"
->roachpb.StoreDescriptor
), and then giving it no key. If any code is making assumptions about the value, we could cause issues (e.g. crash loops) which would then be hard to work around without a full cluster power cycle.Gossiping a poison entry with the existing key-value avoids this issue, but then has the problem you mention.
What do you think?
yeah I didn't think about the fact that consumers likely make assumption about the format of the value. I guess this is fine.
But then please adjust the comment to "Info object not known on this node. We can't force a deletion preemptively, e.g. with a poison entry, because we do not have a valid value object to populate."
521f7b3
to
42f5537
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz)
pkg/gossip/gossip.go
line 1097 at r1 (raw file):
Previously, knz (kena) wrote…
yeah I didn't think about the fact that consumers likely make assumption about the format of the value. I guess this is fine.
But then please adjust the comment to "Info object not known on this node. We can't force a deletion preemptively, e.g. with a poison entry, because we do not have a valid value object to populate."
Done.
3693a15
to
24fac70
Compare
Fixes cockroachdb#85013. Needed for cockroachlabs/support#1709. This commit introduces a new `crdb_internal.unsafe_clear_gossip_info` builtin function which allows admin users to manually clear info objects from the cluster's gossip network. The function does so by re-gossiping an identical value for the specified key but with a TTL that is long enough to reasonably ensure full propagation to all nodes in the cluster but short enough to expire quickly once propagated. The function is best-effort. It is possible for the info object with the low TTL to fail to reach full propagation before reaching its TTL. For instance, this is possible during a transient network partition. The effect of this is that the existing gossip info object with a higher (or no) TTL would remain in the gossip network on some nodes and may eventually propagate back out to other nodes once the partition heals. Release note: None
24fac70
to
dfcb6ef
Compare
TFTR! bors r=knz |
Build failed (retrying...): |
Build failed (retrying...): |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from dfcb6ef to blathers/backport-release-21.2-85505: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 21.2.x failed. See errors above. error creating merge commit from dfcb6ef to blathers/backport-release-22.1-85505: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Fixes #85013.
Needed (in v21.2.X) for cockroachlabs/support#1709.
This commit introduces a new
crdb_internal.unsafe_clear_gossip_info
builtinfunction which allows admin users to manually clear info objects from the
cluster's gossip network. The function does so by re-gossiping an identical
value for the specified key but with a TTL that is long enough to reasonably
ensure full propagation to all nodes in the cluster but short enough to expire
quickly once propagated.
The function is best-effort. It is possible for the info object with the low
TTL to fail to reach full propagation before reaching its TTL. For instance,
this is possible during a transient network partition. The effect of this is
that the existing gossip info object with a higher (or no) TTL would remain
in the gossip network on some nodes and may eventually propagate back out to
other nodes once the partition heals.
@knz: I'm assigning this to you for a review both because you're as good a
person as any to look at gossip-related changes, and because limited SQL
access to the cluster's gossip network is a nuanced subject in the
context of multi-tenancy.
Release note: None