When starting or changing the active pod, the Vault does not work for a very long time (many pki certs in backend storage) #21042

ser6iy · 2023-06-07T10:46:22Z

When starting or changing the active pod, the Vault does not work for a very long time (many pki certs in backend storage).
Checked on versions 1.12.4, 1.12.6, 1.13.2.
Auto cleaning is enabled, it's just that certificates must be stored for a certain time and there can be a lot of them.

Dynamodb is used as a backend, at the time of starting or switching the active pod, it uses all on demand resources to the maximum, and active pod use memory ~x10 from normal state.

      storage "dynamodb" {
        ha_enabled = "true"
        max_parallel = 128
        region = "us-east-1"
        table  = "vault.storage"
      }

Checked with a small number of PKI certificates up to 100 everything loads and switches instantly.
And from 20 000 000 PKI certs the active sub hangs in a non-working state for about ~ 25 minutes.
Debug logs from active pod:

2023-06-01T18:37:40.536Z [INFO]  core: restoring leases
2023-06-01T18:37:40.536Z [DEBUG] expiration: collecting leases
2023-06-01T18:37:40.544Z [DEBUG] identity: loading entities
2023-06-01T18:37:40.551Z [DEBUG] identity: entities collected: num_existing=34
2023-06-01T18:37:40.551Z [DEBUG] identity: entities loading: progress=0
2023-06-01T18:37:40.589Z [DEBUG] expiration: leases collected: num_existing=147
2023-06-01T18:37:40.893Z [INFO]  expiration: lease restore complete
2023-06-01T18:37:40.907Z [INFO]  identity: entities restored
2023-06-01T18:37:40.907Z [DEBUG] identity: identity loading groups
2023-06-01T18:37:40.912Z [DEBUG] identity: groups collected: num_existing=0
2023-06-01T18:37:40.912Z [INFO]  identity: groups restored
2023-06-01T18:37:40.912Z [DEBUG] identity: identity loading OIDC clients
2023-06-01T18:37:40.935Z [DEBUG] core: request forwarding setup function
2023-06-01T18:37:40.935Z [DEBUG] core: clearing forwarding clients
2023-06-01T18:37:40.935Z [DEBUG] core: done clearing forwarding clients
2023-06-01T18:37:40.935Z [DEBUG] core: leaving request forwarding setup function
2023-06-01T18:37:40.935Z [INFO]  core: usage gauge collection is disabled
2023-06-01T18:37:41.109Z [DEBUG] core.cluster-listener: performing server cert lookup
2023-06-01T18:37:41.138Z [DEBUG] core.request-forward: got request forwarding connection
2023-06-01T18:37:44.636Z [DEBUG] core.cluster-listener: performing server cert lookup
2023-06-01T18:37:44.665Z [DEBUG] core.request-forward: got request forwarding connection
2023-06-01T18:47:41.041Z [DEBUG] core.autoseal: seal health test passed
2023-06-01T18:57:41.065Z [DEBUG] core.autoseal: seal health test passed
2023-06-01T19:02:15.905Z [INFO]  core: post-unseal setup complete

Vault has enough resources.
The problem is that it reads all PKI certificates into memory from backend storage, and until it finishes it, it does not work, does not respond to requests, it just hangs.

This behavior of the service in high availability mode is at least strange, the service should start and work and then index or check certificates for expire in background.

The text was updated successfully, but these errors were encountered:

cipherboy · 2023-06-07T17:12:13Z

I do not understand this point:

The problem is that it reads all PKI certificates into memory from backend storage, and until it finishes it, it does not work, does not respond to requests, it just hangs.

When does PKI do this? We delay tidy until one full interval_duration has passed after startup. This does not leave the certificates in memory, but cleans them up (with the list itself being the expensive portion as discussed on #21041). The PKI engine does not otherwise store certificates in memory or attempt to fetch them all as part of startup or regular operations.

Can you share your config/auto-tidy?

Is this a limitation of your backend-end data store? Note that Hashicorp recommends using the Raft backend, which to my knowledge, does not have this limitation.

maxb · 2023-06-07T18:04:58Z

Oh dear... it looks like listing all the certificates in the PKI store has been added in a place that blocks post-unseal processing:

vault/builtin/logical/pki/backend.go

Line 441 in 6fa423e

err = b.initializeStoredCertificateCounts(ctx)

cipherboy · 2023-06-07T18:11:14Z

Ahh yes, the count. I thought this was disabled by default:

vault/builtin/logical/pki/backend.go

Line 755 in 6fa423e

if config.MaintainCount == false {

[cipherboy@xps15 vault]$ vault secrets enable pki
Success! Enabled the pki secrets engine at: pki/
[cipherboy@xps15 vault]$ vault read pki/config/auto-tidy
Key                                         Value
---                                         -----
acme_account_safety_buffer                  2592000
enabled                                     false
interval_duration                           12h
issuer_safety_buffer                        31536000
maintain_stored_certificate_counts          false
pause_duration                              0s
publish_stored_certificate_count_metrics    false
revocation_queue_safety_buffer              172800
safety_buffer                               259200
tidy_acme                                   false
tidy_cert_store                             false
tidy_cross_cluster_revoked_certs            false
tidy_expired_issuers                        false
tidy_move_legacy_ca_bundle                  false
tidy_revocation_queue                       false
tidy_revoked_cert_issuer_associations       false
tidy_revoked_certs                          false

ser6iy · 2023-06-07T20:56:00Z

Can you share your config/auto-tidy?

$ vault read pki/config/auto-tidy
Key                                      Value
---                                      -----
enabled                                  false
interval_duration                        168h
issuer_safety_buffer                     31536000
pause_duration                           0s
revocation_queue_safety_buffer           0
safety_buffer                            172800
tidy_cert_store                          true
tidy_cross_cluster_revoked_certs         false
tidy_expired_issuers                     false
tidy_move_legacy_ca_bundle               false
tidy_revocation_queue                    false
tidy_revoked_cert_issuer_associations    true
tidy_revoked_certs

I tried both options for "enabled" and false and true,
with the config above, the service reads all PKI certificates from the storage before starting to work.
If my memory serves me on version 1.10.3, I did not notice this behavior and encountered it only from version 1.12.x+.
But in the old version, auto-tidy was not yet implemented.
Should I expect a fix for this problem in the next releases?

Thanks again everyone for the quick response ))

cipherboy · 2023-06-07T21:10:44Z

@ser6iy Hmm looks like the config option to disable counting wasn't backported to 1.12. Could you upgrade and see if that fixes it? It seems like you tested on 1.13.2, where I would've expected it to be respected (regardless of enabled status)...

And yes, we got customer requests for metrics around number of certificates issued, which unfortunately has resulted in issues like this...

ser6iy · 2023-06-07T21:25:58Z

@cipherboy
I not see "maintain_stored_certificate_counts" in documentation:
https://developer.hashicorp.com/vault/api-docs/secret/pki#configure-automatic-tidy
The config that I posted to you is taken from the current version 1.13.2,
but I added it in version 1.12.4 and after that update this setup on version 1.12.6 and than on 1.13.2.
So config not changed after upgrade on new version.

I will try to add this parameter tomorrow and check if it works on version 1.13.2

cipherboy · 2023-06-07T21:29:45Z

Ahh I wonder if it got dropped in my reorg of that section for 1.13. Damn. I'll update that tomorrow, thank you!

ser6iy · 2023-06-08T10:00:24Z

@cipherboy
Updated config:

{
  "enabled": false,
  "maintain_stored_certificate_counts": false,
  "tidy_revoked_cert_issuer_associations": true,
  "tidy_cert_store": true,
  "tidy_revoked_certs": true,
  "interval_duration": "168h",
  "safety_buffer": "48h"
}

after apply:
"warnings":["Endpoint ignored these unrecognized parameters: [maintain_stored_certificate_counts]"]
Current version:
Version 1.13.2

maxb · 2023-06-08T10:40:41Z

maintain_stored_certificate_counts was only added in 1.14.

ser6iy · 2023-06-08T10:51:24Z

@maxb
In which release a fix is planned for versions 1.12.x and 1.13.x ?
Or how else can I fix this problem on current versions?

maxb · 2023-06-08T11:50:19Z

I'm just a community contributor, and have no insight into HashiCorp's future intentions.

It looks like at the moment, based on the public information in #18186 and linked PRs, it seems currently unplanned.

cipherboy · 2023-06-08T12:07:03Z

@maxb is a very trusted community contributor who keeps all of us, and especially me, honest and we appreciate him for that. :-)

That said, I've added 1.13 and 1.12 backport labels so hopefully it should be in the next set of releases (1.13.3 was just cut like, yesterday or today, so this would be 1.13.4 I believe). But it looks like it doesn't apply cleanly to 1.12 so I'm curious how much work that will be...

1.14.0-rc1 seems to be available now, which as Max points out, actually has the fix, but also apparently lacks documentation, so I shall fix that too.

cipherboy · 2023-06-15T18:47:28Z

OK, a known issue has been added for 1.12 and 1.13, and the fix backported to those branches. The fix should be present in the next set of releases for 1.12, 1.13, and 1.14+ Thank you for reporting this! :-)

kubawi added core Issues and Pull-Requests specific to Vault Core secret/pki cryptosec labels Jun 7, 2023

cipherboy added the storage/dynamodb label Jun 7, 2023

cipherboy added the bug Used to indicate a potential bug label Jun 7, 2023

cipherboy removed the core Issues and Pull-Requests specific to Vault Core label Jun 8, 2023

cipherboy mentioned this issue Jun 8, 2023

Add missing documentation on cert metrics #21073

Merged

hc-github-team-secure-vault-core mentioned this issue Jun 8, 2023

Backport of Add missing documentation on cert metrics into release/1.14.x #21075

Merged

jasonodonnell mentioned this issue Jun 8, 2023

secret/pki: add known issue for slow startup times #21083

Merged

victorr assigned victorr and cipherboy and unassigned victorr Jun 13, 2023

cipherboy closed this as completed Jun 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When starting or changing the active pod, the Vault does not work for a very long time (many pki certs in backend storage) #21042

When starting or changing the active pod, the Vault does not work for a very long time (many pki certs in backend storage) #21042

ser6iy commented Jun 7, 2023

cipherboy commented Jun 7, 2023

maxb commented Jun 7, 2023

cipherboy commented Jun 7, 2023

ser6iy commented Jun 7, 2023 •

edited

Loading

cipherboy commented Jun 7, 2023

ser6iy commented Jun 7, 2023

cipherboy commented Jun 7, 2023

ser6iy commented Jun 8, 2023

maxb commented Jun 8, 2023

ser6iy commented Jun 8, 2023

maxb commented Jun 8, 2023

cipherboy commented Jun 8, 2023 •

edited

Loading

cipherboy commented Jun 15, 2023

When starting or changing the active pod, the Vault does not work for a very long time (many pki certs in backend storage) #21042

When starting or changing the active pod, the Vault does not work for a very long time (many pki certs in backend storage) #21042

Comments

ser6iy commented Jun 7, 2023

cipherboy commented Jun 7, 2023

maxb commented Jun 7, 2023

cipherboy commented Jun 7, 2023

ser6iy commented Jun 7, 2023 • edited Loading

cipherboy commented Jun 7, 2023

ser6iy commented Jun 7, 2023

cipherboy commented Jun 7, 2023

ser6iy commented Jun 8, 2023

maxb commented Jun 8, 2023

ser6iy commented Jun 8, 2023

maxb commented Jun 8, 2023

cipherboy commented Jun 8, 2023 • edited Loading

cipherboy commented Jun 15, 2023

ser6iy commented Jun 7, 2023 •

edited

Loading

cipherboy commented Jun 8, 2023 •

edited

Loading