-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
identitystore entity leak causes major performance issues #8761
identitystore entity leak causes major performance issues #8761
Comments
Can we safely delete these entities? Preferably by deleting the bucket keys directly. |
Resolves hashicorp/vault#8761 Allows users to specify an alias type field for IAM and GCE logins which will then switch between the current behavior (a unique ID for IAM, and the instance ID for GCE) and a newly created role_id field. The role_id is a UUID generated when the role is created. All existing roles without a role_id will have one generated and saved when the role is read or written.
Is there any broader discussion happening on managing identities more sustainably? For our use cases it would work nicely to have the detailed identity info (gcp instance id, etc) stored inside of a batch token, and have the batch tokens be truly stateless. I'm not sure if that would be compatible with the broader identities api though since we don't use that. Generally we try to use batch tokens wherever possible for better performance. |
I think this got automatically closed because of the GCP auth plugin reference. Can it be reopened? I am unable to reopen it. |
Resolves hashicorp/vault#8761 Allows users to specify an alias type field for IAM and GCE logins which will then switch between the current behavior (a unique ID for IAM, and the instance ID for GCE) and a newly created role_id field. The role_id is a UUID generated when the role is created. All existing roles without a role_id will have one generated and saved when the role is read or written.
Resolves hashicorp/vault#8761 Allows users to specify an alias type field for IAM and GCE logins which will then switch between the current behavior (a unique ID for IAM, and the instance ID for GCE) and a newly created role_id field. The role_id is a UUID generated when the role is created. All existing roles without a role_id will have one generated and saved when the role is read or written.
Resolves hashicorp/vault#8761 Allows users to specify an alias type field for IAM and GCE logins which will then switch between the current behavior (a unique ID for IAM, and the instance ID for GCE) and a newly created role_id field. The role_id is a UUID generated when the role is created. All existing roles without a role_id will have one generated and saved when the role is read or written.
After cleaning up the stale entities our instance is now allocating about 0.5% of the memory as it did during the peak, and our failovers are now taking seconds instead of >10 mins. Identity store operations are also now fast again so we expect GCE auth to also be performant. Thank you for your work in addressing the causes and improving metrics around this @briankassouf, @pcman312, @tyrannosaurus-becks, and all others involved. For future and ongoing remediations, maybe it makes sense to add some vacuum/tidy background process or API for purge stale entities? Please let me know your thoughts, I might be able to assist. |
Yes, this is something we want to do as part of a larger effort, though I can't speak as to timeline. I'm going to close this issue because I don't think there's anything left to be done here, and the tidy work will be tracked on our internal roadmap. |
Wow, I think this might be the root cause of performance/memory issues we've been having with Vault with GCP Auth + GCS Storage backend (we have multiple batch compute workloads with spot instances hitting Vault with 1000s of new nodes a day). Is there any more guidance on how to deal with large #'s of entities/performance implications of the alias configuration? @duplo83 Do you have any tips for cleanup/how you identified stale entities to be deleted? Or did you just go ahead and delete most of them and let them get regenerated? |
It almost certainly is from the situation that you described @sidewinder12s. If you are on a recent enough version of Vault, you can obtain entity count metrics. You can also configure the GCP auth backend now to not generate so many unique identities. We had a script run through and delete the old identities. There are a few considerations though. I could share the script if it's okay with @briankassouf. We modified it from what Brian gave us.
|
I am now seeing huge latency spikes with GCS writes (due to apparent object writes being limited to once per second) for a few storage packer buckets. What's weird is that this is impacting Vault login/availability for a few users, but those users are using Batch type tokens that I thought didn't touch the storage backend. Might you have any idea of what would be trying to write the storage packer entries multiple times a second days after I stopped deleting entities? I don't see a ton of information in commits, issues or discussion pages about how storage packer behaves. |
Batch token creation still creates an identity in the identity store (which uses the storage packer buckets). |
Yup, I think I was seeing the high rate of change for metadata keys that is called out in the Docs. Specifically GCS has a update limitation of once per second for a single object 1 Thanks for your help/responses. |
@dustin-decker @sidewinder12s Any idea on what happens to the stale identities if I try to upgrade the vault from 1.1.2 to the latest or the version where this issue is fixed? My entity store is full-packed with stale identities and I can't issue a simple vault list without overwhelming the cluster. |
Unless something changed and if I recall correctly, upgrading will stop the leak but not clean up the existing stale entities. You'd have to script that operation along these lines: #8761 (comment) |
I wish someone would add a periodic function to run the cleanup.
|
@dustin-decker We also see similar issues in our Vault cluster where the /v1/identity/entity We want to clean up the entities. I saw you mentioned about a script. Could you share the script? We would like to know if we can reuse or build upon it. |
You should ask Hashi for a script, IIRC their engineering team created it. FWIW neither Dustin nor I work at the company where this was used (do note that this specific issue has been closed for 4 years) |
Describe the bug
The identitystore entities only grow and never shrink, causing long failover times (more than 10 minutes) due to entity loading and a storagepacker bottleneck during authentication. Additionally it causes high memory usage (increasing to 110GB when we list entitites).
We currently have over 5.7 million entities and growing:
This means each of the 256 storagepacker buckets that contains them each have around 22.3k entities, and each time a login occurs CreateOrFetchEntity() is called, it reads a storagepacker bucket, decompresses it, unmarshalls the protobuf, modifies it, marshals it, compresses it, and writes the bucket back. As a result we see Vault fail at about ~500 GCP auth requests per MINUTE.
This also caused a Consul storage migration to fail due to the large key sizes of the storagepacker buckets.
To Reproduce
Authenticate millions of times with unique identities.
Expected behavior
I expect entities to be maintained to be a reasonable size.
Environment:
vault status
):vault version
):Linux, AMD64
Vault server configuration file(s):
Additional context
We've been fighting these perf issues for over a year.
It seems the only place Vault deletes any dangling entities from storagepacker are entities and group entitites if the namespace is nil:
https://github.com/hashicorp/vault/blob/rel-1.4.0/vault/identity_store_util.go#L112
https://github.com/hashicorp/vault/blob/rel-1.4.0/vault/identity_store_util.go#L280
The text was updated successfully, but these errors were encountered: