Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VAULT-708] Zero out request counter on preSeal #11970

Merged
merged 3 commits into from
Jul 7, 2021

Conversation

pmmukh
Copy link
Contributor

@pmmukh pmmukh commented Jun 30, 2021

The issue that is currently happening, is that when the active node steps down and becomes a standby node, the first time it syncs its requests to the new active node, it sends the total count of requests that it holds in memory to it. Since it was the active node before, that value is the total requests across all nodes in the cluster for the current month. This causes the request counter on the new active node to effectively double, as the new active node also had loaded the same value during unseal from storage.

The fix implemented here, is to zero out the internal request counters during the preSeal function. This function is invoked as part of the step down operation of the active node, hence before the node resumes emitting metrics this will have happened, preventing the request counts from doubling on the new active node.

As this is being placed in the preSeal operation, this will be invoked even when the node is not changing from active to standby mode. However, that should be a safe operation to do, and while that may result in losing a few requests that should've been emitted to the active node ( or persisted to storage by the active node ), the request counter does not guarantee exactness so this should be a tolerable loss. The duration for which data might be lost is determined by https://github.com/hashicorp/vault/blob/main/vault/counters.go#L37, and defaults to 30 seconds. Also, there does not appear to be a better alternative than preSeal for adding this check, in this stepdown block https://github.com/hashicorp/vault/blob/main/vault/ha.go#L592-L639, but alternate suggestions are welcome!

@pmmukh
Copy link
Contributor Author

pmmukh commented Jun 30, 2021

Dropping a note, I'm not able to figure out yet the best test to add a couple checks for this, (there aren't any direct unit tests of preSeal afaict) any suggestion on that would be super helpful!

@vercel vercel bot temporarily deployed to Preview – vault-storybook June 30, 2021 23:41 Inactive
@vercel vercel bot temporarily deployed to Preview – vault June 30, 2021 23:41 Inactive
Copy link
Contributor

@swayne275 swayne275 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a jira issue or something that this came from? I would just like to get a bit more context about this (and maybe see how the counter is used)

@@ -2183,6 +2183,9 @@ func (c *Core) preSeal() error {
if err := c.unloadMounts(context.Background()); err != nil {
result = multierror.Append(result, fmt.Errorf("error unloading mounts: %w", err))
}

atomic.StoreUint64(c.counters.requests, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say, at a minimum, this should have some subset of the PR comment explaining why we're going to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roger, can do that!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so has c.counters.requests already been persisted by this point? or, how does that stuff work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, so c.counter.requests is being persisted every syncinternval here https://github.com/hashicorp/vault/blob/main/vault/counters.go#L145 by the active node ( and synced to the active if a standby ). The default on syncInterval is every 30 seconds. This is the block where this logic happens https://github.com/hashicorp/vault/blob/main/vault/core_metrics.go#L101-L126

@pmmukh
Copy link
Contributor Author

pmmukh commented Jul 1, 2021

Is there a jira issue or something that this came from? I would just like to get a bit more context about this (and maybe see how the counter is used)

hey @swayne275 ! Yeah, so this PR is intended to fix this bug https://hashicorp.atlassian.net/browse/VAULT-708

@vercel vercel bot temporarily deployed to Preview – vault July 1, 2021 16:50 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook July 1, 2021 16:50 Inactive
@raskchanky
Copy link
Contributor

I haven't reviewed the code itself yet, but the Jira explicitly mentions that this bug does not occur in OSS Vault, but only in ENT Vault. That makes me think the bug is in an ENT specific file (i.e. one ending in _ent.go) and which only resides in the ENT Vault repo.

@pmmukh
Copy link
Contributor Author

pmmukh commented Jul 1, 2021

@raskchanky ah yeah I should've clarified this bit, my bad. So the bug only happens in ENT version, but that is due to a different reason. In OSS, the syncCounters grpc call does not happen, cause in OSS that call is a no-op https://github.com/hashicorp/vault/blob/main/vault/request_handling_util.go#L29. So when stepDown happens, the new active node only loads the request counters from memory but does not get a syncCounters RPC call from the old active node with the same value, so it does not double. This is the ENT file where this rpc call is implemented https://github.com/hashicorp/vault-enterprise/blob/main/vault/request_handling_util_ent.go#L166, but there I don't think Core knows if this is the first such call after unseal or not.

@ncabatoff
Copy link
Collaborator

Dropping a note, I'm not able to figure out yet the best test to add a couple checks for this, (there aren't any direct unit tests of preSeal afaict) any suggestion on that would be super helpful!

Take a look at TestInteg_Counters_With_Sync, which should have most of the building blocks you'll need. You'll want to issue a few requests to the active node so that the active node has a nonzero counter, verify that by reading the current count, then do a step-down and read the count from the new active node. Prior to your change it should return a nonzero count, after your change, zero.

@pmmukh
Copy link
Contributor Author

pmmukh commented Jul 6, 2021

Take a look at TestInteg_Counters_With_Sync, which should have most of the building blocks you'll need. You'll want to issue a few requests to the active node so that the active node has a nonzero counter, verify that by reading the current count, then do a step-down and read the count from the new active node. Prior to your change it should return a nonzero count, after your change, zero.

@ncabatoff awesome, thank you! Will take a look at that test! That one seems to be in the ent repo, which makes sense, so I should probably open this PR against the ENT repo, and use this one for merging the OSS change. Also, wanted to clarify one point. When I do this bit read the count from the new active node, prior to my change I think it'll return 2X ( if I made X requests to the active before ) and after my change it'll return X, as X is being loaded in from storage during unseal. Does that logic make sense to you ?

@ncabatoff
Copy link
Collaborator

When I do this bit read the count from the new active node, prior to my change I think it'll return 2X ( if I made X requests to the active before ) and after my change it'll return X, as X is being loaded in from storage during unseal. Does that logic make sense to you ?

Yup, I was confused.

@pmmukh
Copy link
Contributor Author

pmmukh commented Jul 7, 2021

@pmmukh pmmukh requested a review from swayne275 July 7, 2021 18:39
@pmmukh pmmukh merged commit 6b8e4dc into main Jul 7, 2021
@pmmukh pmmukh deleted the vault-708-request-counter-sync-fix branch July 7, 2021 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants