Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When starting or changing the active pod, the Vault does not work for a very long time (many pki certs in backend storage) #21042

Closed
ser6iy opened this issue Jun 7, 2023 · 13 comments
Assignees
Labels

Comments

@ser6iy
Copy link

ser6iy commented Jun 7, 2023

When starting or changing the active pod, the Vault does not work for a very long time (many pki certs in backend storage).
Checked on versions 1.12.4, 1.12.6, 1.13.2.
Auto cleaning is enabled, it's just that certificates must be stored for a certain time and there can be a lot of them.

Dynamodb is used as a backend, at the time of starting or switching the active pod, it uses all on demand resources to the maximum, and active pod use memory ~x10 from normal state.

      storage "dynamodb" {
        ha_enabled = "true"
        max_parallel = 128
        region = "us-east-1"
        table  = "vault.storage"
      }

Checked with a small number of PKI certificates up to 100 everything loads and switches instantly.
And from 20 000 000 PKI certs the active sub hangs in a non-working state for about ~ 25 minutes.
Debug logs from active pod:

2023-06-01T18:37:40.536Z [INFO]  core: restoring leases
2023-06-01T18:37:40.536Z [DEBUG] expiration: collecting leases
2023-06-01T18:37:40.544Z [DEBUG] identity: loading entities
2023-06-01T18:37:40.551Z [DEBUG] identity: entities collected: num_existing=34
2023-06-01T18:37:40.551Z [DEBUG] identity: entities loading: progress=0
2023-06-01T18:37:40.589Z [DEBUG] expiration: leases collected: num_existing=147
2023-06-01T18:37:40.893Z [INFO]  expiration: lease restore complete
2023-06-01T18:37:40.907Z [INFO]  identity: entities restored
2023-06-01T18:37:40.907Z [DEBUG] identity: identity loading groups
2023-06-01T18:37:40.912Z [DEBUG] identity: groups collected: num_existing=0
2023-06-01T18:37:40.912Z [INFO]  identity: groups restored
2023-06-01T18:37:40.912Z [DEBUG] identity: identity loading OIDC clients
2023-06-01T18:37:40.935Z [DEBUG] core: request forwarding setup function
2023-06-01T18:37:40.935Z [DEBUG] core: clearing forwarding clients
2023-06-01T18:37:40.935Z [DEBUG] core: done clearing forwarding clients
2023-06-01T18:37:40.935Z [DEBUG] core: leaving request forwarding setup function
2023-06-01T18:37:40.935Z [INFO]  core: usage gauge collection is disabled
2023-06-01T18:37:41.109Z [DEBUG] core.cluster-listener: performing server cert lookup
2023-06-01T18:37:41.138Z [DEBUG] core.request-forward: got request forwarding connection
2023-06-01T18:37:44.636Z [DEBUG] core.cluster-listener: performing server cert lookup
2023-06-01T18:37:44.665Z [DEBUG] core.request-forward: got request forwarding connection
2023-06-01T18:47:41.041Z [DEBUG] core.autoseal: seal health test passed
2023-06-01T18:57:41.065Z [DEBUG] core.autoseal: seal health test passed
2023-06-01T19:02:15.905Z [INFO]  core: post-unseal setup complete

Vault has enough resources.
The problem is that it reads all PKI certificates into memory from backend storage, and until it finishes it, it does not work, does not respond to requests, it just hangs.

This behavior of the service in high availability mode is at least strange, the service should start and work and then index or check certificates for expire in background.

@kubawi kubawi added core Issues and Pull-Requests specific to Vault Core secret/pki cryptosec labels Jun 7, 2023
@cipherboy
Copy link
Contributor

I do not understand this point:

The problem is that it reads all PKI certificates into memory from backend storage, and until it finishes it, it does not work, does not respond to requests, it just hangs.

When does PKI do this? We delay tidy until one full interval_duration has passed after startup. This does not leave the certificates in memory, but cleans them up (with the list itself being the expensive portion as discussed on #21041). The PKI engine does not otherwise store certificates in memory or attempt to fetch them all as part of startup or regular operations.

Can you share your config/auto-tidy?

Is this a limitation of your backend-end data store? Note that Hashicorp recommends using the Raft backend, which to my knowledge, does not have this limitation.

@maxb
Copy link
Contributor

maxb commented Jun 7, 2023

Oh dear... it looks like listing all the certificates in the PKI store has been added in a place that blocks post-unseal processing:

err = b.initializeStoredCertificateCounts(ctx)

@cipherboy
Copy link
Contributor

Ahh yes, the count. I thought this was disabled by default:

if config.MaintainCount == false {

[cipherboy@xps15 vault]$ vault secrets enable pki
Success! Enabled the pki secrets engine at: pki/
[cipherboy@xps15 vault]$ vault read pki/config/auto-tidy
Key                                         Value
---                                         -----
acme_account_safety_buffer                  2592000
enabled                                     false
interval_duration                           12h
issuer_safety_buffer                        31536000
maintain_stored_certificate_counts          false
pause_duration                              0s
publish_stored_certificate_count_metrics    false
revocation_queue_safety_buffer              172800
safety_buffer                               259200
tidy_acme                                   false
tidy_cert_store                             false
tidy_cross_cluster_revoked_certs            false
tidy_expired_issuers                        false
tidy_move_legacy_ca_bundle                  false
tidy_revocation_queue                       false
tidy_revoked_cert_issuer_associations       false
tidy_revoked_certs                          false

@ser6iy
Copy link
Author

ser6iy commented Jun 7, 2023

Can you share your config/auto-tidy?

$ vault read pki/config/auto-tidy
Key                                      Value
---                                      -----
enabled                                  false
interval_duration                        168h
issuer_safety_buffer                     31536000
pause_duration                           0s
revocation_queue_safety_buffer           0
safety_buffer                            172800
tidy_cert_store                          true
tidy_cross_cluster_revoked_certs         false
tidy_expired_issuers                     false
tidy_move_legacy_ca_bundle               false
tidy_revocation_queue                    false
tidy_revoked_cert_issuer_associations    true
tidy_revoked_certs 

I tried both options for "enabled" and false and true,
with the config above, the service reads all PKI certificates from the storage before starting to work.
If my memory serves me on version 1.10.3, I did not notice this behavior and encountered it only from version 1.12.x+.
But in the old version, auto-tidy was not yet implemented.
Should I expect a fix for this problem in the next releases?

Thanks again everyone for the quick response ))

@cipherboy
Copy link
Contributor

@ser6iy Hmm looks like the config option to disable counting wasn't backported to 1.12. Could you upgrade and see if that fixes it? It seems like you tested on 1.13.2, where I would've expected it to be respected (regardless of enabled status)...

And yes, we got customer requests for metrics around number of certificates issued, which unfortunately has resulted in issues like this...

@cipherboy cipherboy added the bug Used to indicate a potential bug label Jun 7, 2023
@ser6iy
Copy link
Author

ser6iy commented Jun 7, 2023

@cipherboy
I not see "maintain_stored_certificate_counts" in documentation:
https://developer.hashicorp.com/vault/api-docs/secret/pki#configure-automatic-tidy
The config that I posted to you is taken from the current version 1.13.2,
but I added it in version 1.12.4 and after that update this setup on version 1.12.6 and than on 1.13.2.
So config not changed after upgrade on new version.

I will try to add this parameter tomorrow and check if it works on version 1.13.2

@cipherboy
Copy link
Contributor

Ahh I wonder if it got dropped in my reorg of that section for 1.13. Damn. I'll update that tomorrow, thank you!

@ser6iy
Copy link
Author

ser6iy commented Jun 8, 2023

@cipherboy
Updated config:

{
  "enabled": false,
  "maintain_stored_certificate_counts": false,
  "tidy_revoked_cert_issuer_associations": true,
  "tidy_cert_store": true,
  "tidy_revoked_certs": true,
  "interval_duration": "168h",
  "safety_buffer": "48h"
}

after apply:
"warnings":["Endpoint ignored these unrecognized parameters: [maintain_stored_certificate_counts]"]
Current version:
Version 1.13.2

@maxb
Copy link
Contributor

maxb commented Jun 8, 2023

maintain_stored_certificate_counts was only added in 1.14.

@ser6iy
Copy link
Author

ser6iy commented Jun 8, 2023

@maxb
In which release a fix is planned for versions 1.12.x and 1.13.x ?
Or how else can I fix this problem on current versions?

@maxb
Copy link
Contributor

maxb commented Jun 8, 2023

I'm just a community contributor, and have no insight into HashiCorp's future intentions.

It looks like at the moment, based on the public information in #18186 and linked PRs, it seems currently unplanned.

@cipherboy cipherboy removed the core Issues and Pull-Requests specific to Vault Core label Jun 8, 2023
@cipherboy
Copy link
Contributor

cipherboy commented Jun 8, 2023

@maxb is a very trusted community contributor who keeps all of us, and especially me, honest and we appreciate him for that. :-)

That said, I've added 1.13 and 1.12 backport labels so hopefully it should be in the next set of releases (1.13.3 was just cut like, yesterday or today, so this would be 1.13.4 I believe). But it looks like it doesn't apply cleanly to 1.12 so I'm curious how much work that will be...

1.14.0-rc1 seems to be available now, which as Max points out, actually has the fix, but also apparently lacks documentation, so I shall fix that too.

@cipherboy
Copy link
Contributor

OK, a known issue has been added for 1.12 and 1.13, and the fix backported to those branches. The fix should be present in the next set of releases for 1.12, 1.13, and 1.14+ Thank you for reporting this! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants