-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VAULT-708] Zero out request counter on preSeal #11970
Conversation
Dropping a note, I'm not able to figure out yet the best test to add a couple checks for this, (there aren't any direct unit tests of preSeal afaict) any suggestion on that would be super helpful! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a jira issue or something that this came from? I would just like to get a bit more context about this (and maybe see how the counter is used)
@@ -2183,6 +2183,9 @@ func (c *Core) preSeal() error { | |||
if err := c.unloadMounts(context.Background()); err != nil { | |||
result = multierror.Append(result, fmt.Errorf("error unloading mounts: %w", err)) | |||
} | |||
|
|||
atomic.StoreUint64(c.counters.requests, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say, at a minimum, this should have some subset of the PR comment explaining why we're going to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
roger, can do that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so has c.counters.requests
already been persisted by this point? or, how does that stuff work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, so c.counter.requests
is being persisted every syncinternval here https://github.com/hashicorp/vault/blob/main/vault/counters.go#L145 by the active node ( and synced to the active if a standby ). The default on syncInterval is every 30 seconds. This is the block where this logic happens https://github.com/hashicorp/vault/blob/main/vault/core_metrics.go#L101-L126
hey @swayne275 ! Yeah, so this PR is intended to fix this bug https://hashicorp.atlassian.net/browse/VAULT-708 |
I haven't reviewed the code itself yet, but the Jira explicitly mentions that this bug does not occur in OSS Vault, but only in ENT Vault. That makes me think the bug is in an ENT specific file (i.e. one ending in |
@raskchanky ah yeah I should've clarified this bit, my bad. So the bug only happens in ENT version, but that is due to a different reason. In OSS, the syncCounters grpc call does not happen, cause in OSS that call is a no-op https://github.com/hashicorp/vault/blob/main/vault/request_handling_util.go#L29. So when stepDown happens, the new active node only loads the request counters from memory but does not get a syncCounters RPC call from the old active node with the same value, so it does not double. This is the ENT file where this rpc call is implemented https://github.com/hashicorp/vault-enterprise/blob/main/vault/request_handling_util_ent.go#L166, but there I don't think Core knows if this is the first such call after unseal or not. |
Take a look at TestInteg_Counters_With_Sync, which should have most of the building blocks you'll need. You'll want to issue a few requests to the active node so that the active node has a nonzero counter, verify that by reading the current count, then do a step-down and read the count from the new active node. Prior to your change it should return a nonzero count, after your change, zero. |
@ncabatoff awesome, thank you! Will take a look at that test! That one seems to be in the ent repo, which makes sense, so I should probably open this PR against the ENT repo, and use this one for merging the OSS change. Also, wanted to clarify one point. When I do this bit |
Yup, I was confused. |
@ncabatoff @swayne275 @raskchanky moving this conversation to this PR https://github.com/hashicorp/vault-enterprise/pull/2020 |
The issue that is currently happening, is that when the active node steps down and becomes a standby node, the first time it syncs its requests to the new active node, it sends the total count of requests that it holds in memory to it. Since it was the active node before, that value is the total requests across all nodes in the cluster for the current month. This causes the request counter on the new active node to effectively double, as the new active node also had loaded the same value during unseal from storage.
The fix implemented here, is to zero out the internal request counters during the preSeal function. This function is invoked as part of the step down operation of the active node, hence before the node resumes emitting metrics this will have happened, preventing the request counts from doubling on the new active node.
As this is being placed in the preSeal operation, this will be invoked even when the node is not changing from active to standby mode. However, that should be a safe operation to do, and while that may result in losing a few requests that should've been emitted to the active node ( or persisted to storage by the active node ), the request counter does not guarantee exactness so this should be a tolerable loss. The duration for which data might be lost is determined by https://github.com/hashicorp/vault/blob/main/vault/counters.go#L37, and defaults to 30 seconds. Also, there does not appear to be a better alternative than
preSeal
for adding this check, in this stepdown block https://github.com/hashicorp/vault/blob/main/vault/ha.go#L592-L639, but alternate suggestions are welcome!