identitystore entity leak causes major performance issues #8761

dustin-decker · 2020-04-17T16:28:10Z

Describe the bug
The identitystore entities only grow and never shrink, causing long failover times (more than 10 minutes) due to entity loading and a storagepacker bottleneck during authentication. Additionally it causes high memory usage (increasing to 110GB when we list entitites).

We currently have over 5.7 million entities and growing:

vault read -format=json sys/internal/counters/entities 
{
  "request_id": "c332e801-8d1d-1d7a-3ab2-9540b9bf280a",
  "lease_id": "",
  "lease_duration": 0,
  "renewable": false,
  "data": {
    "counters": {
      "entities": {
        "total": 5708878
      }
    }
  },
  "warnings": null
}

This means each of the 256 storagepacker buckets that contains them each have around 22.3k entities, and each time a login occurs CreateOrFetchEntity() is called, it reads a storagepacker bucket, decompresses it, unmarshalls the protobuf, modifies it, marshals it, compresses it, and writes the bucket back. As a result we see Vault fail at about ~500 GCP auth requests per MINUTE.

This also caused a Consul storage migration to fail due to the large key sizes of the storagepacker buckets.

To Reproduce
Authenticate millions of times with unique identities.

Expected behavior
I expect entities to be maintained to be a reasonable size.

Environment:

Vault Server Version (retrieve with vault status):

$ vault status
Key                    Value
---                    -----
Version                1.4.0+prem

Vault CLI Version (retrieve with vault version):

$ vault version
Vault v1.1.1 (cgo)

Server Operating System/Architecture:
Linux, AMD64

Vault server configuration file(s):

# Paste your Vault config here.
# Be sure to scrub any sensitive values

Additional context
We've been fighting these perf issues for over a year.

It seems the only place Vault deletes any dangling entities from storagepacker are entities and group entitites if the namespace is nil:
https://github.com/hashicorp/vault/blob/rel-1.4.0/vault/identity_store_util.go#L112
https://github.com/hashicorp/vault/blob/rel-1.4.0/vault/identity_store_util.go#L280

The text was updated successfully, but these errors were encountered:

dustin-decker · 2020-04-17T16:29:00Z

Can we safely delete these entities? Preferably by deleting the bucket keys directly.

technologik · 2020-04-17T16:35:25Z

cc @briankassouf

Resolves hashicorp/vault#8761 Allows users to specify an alias type field for IAM and GCE logins which will then switch between the current behavior (a unique ID for IAM, and the instance ID for GCE) and a newly created role_id field. The role_id is a UUID generated when the role is created. All existing roles without a role_id will have one generated and saved when the role is read or written.

dustin-decker · 2020-04-23T22:19:44Z

Is there any broader discussion happening on managing identities more sustainably?

For our use cases it would work nicely to have the detailed identity info (gcp instance id, etc) stored inside of a batch token, and have the batch tokens be truly stateless. I'm not sure if that would be compatible with the broader identities api though since we don't use that. Generally we try to use batch tokens wherever possible for better performance.

dustin-decker · 2020-04-24T21:14:30Z

I think this got automatically closed because of the GCP auth plugin reference. Can it be reopened? I am unable to reopen it.

Resolves hashicorp/vault#8761 Allows users to specify an alias type field for IAM and GCE logins which will then switch between the current behavior (a unique ID for IAM, and the instance ID for GCE) and a newly created role_id field. The role_id is a UUID generated when the role is created. All existing roles without a role_id will have one generated and saved when the role is read or written.

dustin-decker · 2020-05-05T15:24:45Z

After cleaning up the stale entities our instance is now allocating about 0.5% of the memory as it did during the peak, and our failovers are now taking seconds instead of >10 mins. Identity store operations are also now fast again so we expect GCE auth to also be performant.

Thank you for your work in addressing the causes and improving metrics around this @briankassouf, @pcman312, @tyrannosaurus-becks, and all others involved.

For future and ongoing remediations, maybe it makes sense to add some vacuum/tidy background process or API for purge stale entities? Please let me know your thoughts, I might be able to assist.

ncabatoff · 2020-06-09T15:53:22Z

For future and ongoing remediations, maybe it makes sense to add some vacuum/tidy background process or API for purge stale entities? Please let me know your thoughts, I might be able to assist.

Yes, this is something we want to do as part of a larger effort, though I can't speak as to timeline.

I'm going to close this issue because I don't think there's anything left to be done here, and the tidy work will be tracked on our internal roadmap.

sidewinder12s · 2020-09-02T20:53:49Z

Wow, I think this might be the root cause of performance/memory issues we've been having with Vault with GCP Auth + GCS Storage backend (we have multiple batch compute workloads with spot instances hitting Vault with 1000s of new nodes a day).

Is there any more guidance on how to deal with large #'s of entities/performance implications of the alias configuration?
It's also not clear to me that upgrading to the versions of Vault with updated entity mapping logic would fix performance issues since it sounds like Vault will still be looking up all this identity information.

@duplo83 Do you have any tips for cleanup/how you identified stale entities to be deleted? Or did you just go ahead and delete most of them and let them get regenerated?

dustin-decker · 2020-09-03T18:38:39Z

It almost certainly is from the situation that you described @sidewinder12s. If you are on a recent enough version of Vault, you can obtain entity count metrics. You can also configure the GCP auth backend now to not generate so many unique identities.

We had a script run through and delete the old identities. There are a few considerations though. I could share the script if it's okay with @briankassouf. We modified it from what Brian gave us.

You have to dump all of the entities first. You'll probably need to increase your request timeouts because it may take a few minutes. You will also need sufficient free memory (ours increased by 10s of gigabytes IIRC).
You can then walk through the entities and delete the ones that are older than a TTL (this requires an additional lookup)
You'll need to start slow because it deleting is intensive on the identity store. It will speed up as you work through them. It'll probably take days from start to finish if it is anything like what we had to do.
GCS is a relatively high latency backend so it may slow things down more.

sidewinder12s · 2020-09-29T01:10:17Z

I am now seeing huge latency spikes with GCS writes (due to apparent object writes being limited to once per second) for a few storage packer buckets. What's weird is that this is impacting Vault login/availability for a few users, but those users are using Batch type tokens that I thought didn't touch the storage backend.

Might you have any idea of what would be trying to write the storage packer entries multiple times a second days after I stopped deleting entities? I don't see a ton of information in commits, issues or discussion pages about how storage packer behaves.

dustin-decker · 2020-09-29T16:35:40Z

Batch token creation still creates an identity in the identity store (which uses the storage packer buckets).
If GCS has an object write rate limitation of once per second, then it probably won't be compatible with your rate of login since there are only 256 storage packer buckets.

sidewinder12s · 2020-09-29T19:02:58Z

Yup, I think I was seeing the high rate of change for metadata keys that is called out in the Docs.
I was basically slamming 1 packer bucket during login of one of our batch compute services. I fairly rapidly upgraded from 1.3 to 1.5 and I think it just got lost in all the cleanup I had been doing.

Specifically GCS has a update limitation of once per second for a single object 1

Thanks for your help/responses.

abbasalizaidi · 2021-08-19T09:07:15Z

@dustin-decker @sidewinder12s Any idea on what happens to the stale identities if I try to upgrade the vault from 1.1.2 to the latest or the version where this issue is fixed? My entity store is full-packed with stale identities and I can't issue a simple vault list without overwhelming the cluster.

dustin-decker · 2021-08-19T19:48:51Z

Unless something changed and if I recall correctly, upgrading will stop the leak but not clean up the existing stale entities. You'd have to script that operation along these lines: #8761 (comment)

taitelman · 2023-08-02T06:20:32Z

I wish someone would add a periodic function to run the cleanup.
just add in config.hcl some defaultIdentityMaxAge config to run something like this:
I wonder why its not a part of vault core .

func selectIdentities2delete(identityIds []string) (identityToDelete []string, err error) {
	identityToDelete = make([]string, 0, len(identityIds))

	for _, identityID := range identityIds {
		r, err := http.NewRequest(http.MethodGet, getURL(vaultEndpoint, "/v1/identity/entity
/id?list=true”), nil)
		if err != nil {
			return nil, fmt.Errorf("failed creating http request for get an entity ID: %s", err.Error())
		}

		r.Header.Set(vaultTokenHeader, Token)
		r.Header.Set("Content-Type", applicationJsonContentType)
		r.Header.Set("Accept", applicationJsonContentType)

		response, err := httpRequest(r, http.StatusOK)
		if err != nil {
			return nil, fmt.Errorf("failed http request for get an entity ID: %s", err.Error())
		}
		res := &logical.HTTPResponse{}
		if err := json.Unmarshal(response, res); err != nil {
			return nil, fmt.Errorf(“failed to unmarshal response body for get an entity ID: %s", err.Error())
		}

		now := time.Now()
		if lastUpdateRaw, found := res.Data["last_update_time"]; found {
			if lastUpdate, casted := lastUpdateRaw.(string); casted {
				if len(lastUpdate) > 0 {
					identityLastUpdateTime, err := time.Parse(time.RFC3339, lastUpdate)
					if err != nil {
						return nil, err
					}
					currentAge := time.Now().Sub(identityLastUpdateTime)
					if currentAge > defaultIdentityMaxAge {
						logger.Debug(fmt.Sprintf("found old identity to delete: %s age: %s", identityID, now.Sub(identityLastUpdateTime)))
						identityToDelete = append(identityToDelete, identityID)
					}
				}
			} else {
				logger.Error(fmt.Sprintf("failed to cast last update time to string . identityID: %s", identityID))
			}
		} else {
			logger.Error(fmt.Sprintf("failed to find last_update_time element in response . identityID: %s", identityID))
		}
	}
		return identityToDelete, nil
}

func deleteBatch(identitIDs []string) (err error) {
	if len(identitIDs) < 1 {
		return nil
	}
	bodyMap := map[string][]string{"entity_ids": identitIDs}
	bodyBytes, err := json.Marshal(bodyMap)
	if err != nil {
		return err
	}
	body := bytes.NewBuffer(bodyBytes)
	r, err := http.NewRequest(http.MethodPost, getURL(vaultEndpoint, "/v1/identity/entity
/batch-delete"), body)
	if err != nil {
		return fmt.Errorf(“failed to create http request to batch delete %s", err.Error())
	}

	r.Header.Set(vaultTokenHeader, Token)
	r.Header.Set("Content-Type", applicationJsonContentType)
	r.Header.Set("Accept", applicationJsonContentType)

	_, err = httpRequest(r, http.StatusNoContent)
	if err != nil {
		return fmt.Errorf(“failed http request to batch delete entitiy: %s", err.Error())
	}
	return nil
}

revanthalampally · 2024-08-27T21:41:14Z

@dustin-decker We also see similar issues in our Vault cluster where the /v1/identity/entity
/id?list=true is throwing 500 and is unresponsive in one of our non prod env. We want to fix this, in the prod, as soon as possible.

We want to clean up the entities. I saw you mentioned about a script. Could you share the script? We would like to know if we can reuse or build upon it.
Also, what exactly is last_update_time? If a client logs in with the role pertaining to the entity, will the last_update_time become the login time?

technologik · 2024-08-27T22:18:35Z

@dustin-decker We also see similar issues in our Vault cluster where the /v1/identity/entity /id?list=true is throwing 500 and is unresponsive in one of our non prod env. We want to fix this, in the prod, as soon as possible.

We want to clean up the entities. I saw you mentioned about a script. Could you share the script? We would like to know if we can reuse or build upon it. Also, what exactly is last_update_time? If a client logs in with the role pertaining to the entity, will the last_update_time become the login time?

You should ask Hashi for a script, IIRC their engineering team created it. FWIW neither Dustin nor I work at the company where this was used (do note that this specific issue has been closed for 4 years)

pcman312 added bug Used to indicate a potential bug core/identity labels Apr 17, 2020

dustin-decker mentioned this issue Apr 20, 2020

Make identities less unique to improve performance hashicorp/vault-plugin-auth-gcp#88

Closed

briankassouf mentioned this issue Apr 20, 2020

identity: Add batch entity deletion endpoint #8785

Merged

pcman312 mentioned this issue Apr 22, 2020

Add alias types for IAM and GCE logins hashicorp/vault-plugin-auth-gcp#89

Merged

5 tasks

pcman312 closed this as completed in hashicorp/vault-plugin-auth-gcp@49de61a Apr 24, 2020

This was referenced Apr 27, 2020

Replace role-based alias config with mount based alias config hashicorp/vault-plugin-auth-gcp#91

Merged

GCP Auth docs - Move iam_alias and gce_alias to config instead of role #8862

Merged

briankassouf reopened this Apr 27, 2020

pcman312 closed this as completed in hashicorp/vault-plugin-auth-gcp#91 Apr 28, 2020

briankassouf reopened this Apr 30, 2020

ncabatoff closed this as completed Jun 9, 2020

sidewinder12s mentioned this issue Sep 2, 2020

Potential for an open connection leak #8654

Closed

grzechukol mentioned this issue Oct 23, 2023

Bug: failed to persist packed storage entry: ValidationException: Item size has exceeded the maximum allowed size #23793

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

identitystore entity leak causes major performance issues #8761

identitystore entity leak causes major performance issues #8761

dustin-decker commented Apr 17, 2020 •

edited

Loading

dustin-decker commented Apr 17, 2020

technologik commented Apr 17, 2020

dustin-decker commented Apr 23, 2020

dustin-decker commented Apr 24, 2020

dustin-decker commented May 5, 2020

ncabatoff commented Jun 9, 2020

sidewinder12s commented Sep 2, 2020 •

edited

Loading

dustin-decker commented Sep 3, 2020

sidewinder12s commented Sep 29, 2020

dustin-decker commented Sep 29, 2020 •

edited

Loading

sidewinder12s commented Sep 29, 2020

abbasalizaidi commented Aug 19, 2021

dustin-decker commented Aug 19, 2021

taitelman commented Aug 2, 2023 •

edited

Loading

revanthalampally commented Aug 27, 2024 •

edited

Loading

technologik commented Aug 27, 2024

identitystore entity leak causes major performance issues #8761

identitystore entity leak causes major performance issues #8761

Comments

dustin-decker commented Apr 17, 2020 • edited Loading

dustin-decker commented Apr 17, 2020

technologik commented Apr 17, 2020

dustin-decker commented Apr 23, 2020

dustin-decker commented Apr 24, 2020

dustin-decker commented May 5, 2020

ncabatoff commented Jun 9, 2020

sidewinder12s commented Sep 2, 2020 • edited Loading

dustin-decker commented Sep 3, 2020

sidewinder12s commented Sep 29, 2020

dustin-decker commented Sep 29, 2020 • edited Loading

sidewinder12s commented Sep 29, 2020

abbasalizaidi commented Aug 19, 2021

dustin-decker commented Aug 19, 2021

taitelman commented Aug 2, 2023 • edited Loading

revanthalampally commented Aug 27, 2024 • edited Loading

technologik commented Aug 27, 2024

dustin-decker commented Apr 17, 2020 •

edited

Loading

sidewinder12s commented Sep 2, 2020 •

edited

Loading

dustin-decker commented Sep 29, 2020 •

edited

Loading

taitelman commented Aug 2, 2023 •

edited

Loading

revanthalampally commented Aug 27, 2024 •

edited

Loading