Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

identitystore entity leak causes major performance issues #8761

Closed
dustin-decker opened this issue Apr 17, 2020 · 16 comments · Fixed by hashicorp/vault-plugin-auth-gcp#91
Closed
Labels
bug Used to indicate a potential bug core/identity

Comments

@dustin-decker
Copy link
Contributor

dustin-decker commented Apr 17, 2020

Describe the bug
The identitystore entities only grow and never shrink, causing long failover times (more than 10 minutes) due to entity loading and a storagepacker bottleneck during authentication. Additionally it causes high memory usage (increasing to 110GB when we list entitites).

We currently have over 5.7 million entities and growing:

vault read -format=json sys/internal/counters/entities 
{
  "request_id": "c332e801-8d1d-1d7a-3ab2-9540b9bf280a",
  "lease_id": "",
  "lease_duration": 0,
  "renewable": false,
  "data": {
    "counters": {
      "entities": {
        "total": 5708878
      }
    }
  },
  "warnings": null
}

This means each of the 256 storagepacker buckets that contains them each have around 22.3k entities, and each time a login occurs CreateOrFetchEntity() is called, it reads a storagepacker bucket, decompresses it, unmarshalls the protobuf, modifies it, marshals it, compresses it, and writes the bucket back. As a result we see Vault fail at about ~500 GCP auth requests per MINUTE.

This also caused a Consul storage migration to fail due to the large key sizes of the storagepacker buckets.

To Reproduce
Authenticate millions of times with unique identities.

Expected behavior
I expect entities to be maintained to be a reasonable size.

Environment:

  • Vault Server Version (retrieve with vault status):
$ vault status
Key                    Value
---                    -----
Version                1.4.0+prem
  • Vault CLI Version (retrieve with vault version):
$ vault version
Vault v1.1.1 (cgo)
  • Server Operating System/Architecture:
    Linux, AMD64

Vault server configuration file(s):

# Paste your Vault config here.
# Be sure to scrub any sensitive values

Additional context
We've been fighting these perf issues for over a year.

It seems the only place Vault deletes any dangling entities from storagepacker are entities and group entitites if the namespace is nil:
https://github.com/hashicorp/vault/blob/rel-1.4.0/vault/identity_store_util.go#L112
https://github.com/hashicorp/vault/blob/rel-1.4.0/vault/identity_store_util.go#L280

@dustin-decker
Copy link
Contributor Author

Can we safely delete these entities? Preferably by deleting the bucket keys directly.

@technologik
Copy link
Contributor

cc @briankassouf

@pcman312 pcman312 added bug Used to indicate a potential bug core/identity labels Apr 17, 2020
pcman312 added a commit to hashicorp/vault-plugin-auth-gcp that referenced this issue Apr 22, 2020
Resolves hashicorp/vault#8761

Allows users to specify an alias type field for IAM and GCE logins which
will then switch between the current behavior (a unique ID for IAM, and
the instance ID for GCE) and a newly created role_id field. The role_id
is a UUID generated when the role is created. All existing roles without
a role_id will have one generated and saved when the role is read or
written.
@dustin-decker
Copy link
Contributor Author

Is there any broader discussion happening on managing identities more sustainably?

For our use cases it would work nicely to have the detailed identity info (gcp instance id, etc) stored inside of a batch token, and have the batch tokens be truly stateless. I'm not sure if that would be compatible with the broader identities api though since we don't use that. Generally we try to use batch tokens wherever possible for better performance.

@dustin-decker
Copy link
Contributor Author

I think this got automatically closed because of the GCP auth plugin reference. Can it be reopened? I am unable to reopen it.

@briankassouf briankassouf reopened this Apr 27, 2020
pcman312 added a commit to hashicorp/vault-plugin-auth-gcp that referenced this issue Apr 28, 2020
Resolves hashicorp/vault#8761

Allows users to specify an alias type field for IAM and GCE logins which
will then switch between the current behavior (a unique ID for IAM, and
the instance ID for GCE) and a newly created role_id field. The role_id
is a UUID generated when the role is created. All existing roles without
a role_id will have one generated and saved when the role is read or
written.
pcman312 added a commit to hashicorp/vault-plugin-auth-gcp that referenced this issue Apr 28, 2020
Resolves hashicorp/vault#8761

Allows users to specify an alias type field for IAM and GCE logins which
will then switch between the current behavior (a unique ID for IAM, and
the instance ID for GCE) and a newly created role_id field. The role_id
is a UUID generated when the role is created. All existing roles without
a role_id will have one generated and saved when the role is read or
written.
pcman312 added a commit to hashicorp/vault-plugin-auth-gcp that referenced this issue Apr 28, 2020
Resolves hashicorp/vault#8761

Allows users to specify an alias type field for IAM and GCE logins which
will then switch between the current behavior (a unique ID for IAM, and
the instance ID for GCE) and a newly created role_id field. The role_id
is a UUID generated when the role is created. All existing roles without
a role_id will have one generated and saved when the role is read or
written.
@briankassouf briankassouf reopened this Apr 30, 2020
@dustin-decker
Copy link
Contributor Author

After cleaning up the stale entities our instance is now allocating about 0.5% of the memory as it did during the peak, and our failovers are now taking seconds instead of >10 mins. Identity store operations are also now fast again so we expect GCE auth to also be performant.

Thank you for your work in addressing the causes and improving metrics around this @briankassouf, @pcman312, @tyrannosaurus-becks, and all others involved.

For future and ongoing remediations, maybe it makes sense to add some vacuum/tidy background process or API for purge stale entities? Please let me know your thoughts, I might be able to assist.

@ncabatoff
Copy link
Collaborator

For future and ongoing remediations, maybe it makes sense to add some vacuum/tidy background process or API for purge stale entities? Please let me know your thoughts, I might be able to assist.

Yes, this is something we want to do as part of a larger effort, though I can't speak as to timeline.

I'm going to close this issue because I don't think there's anything left to be done here, and the tidy work will be tracked on our internal roadmap.

@sidewinder12s
Copy link

sidewinder12s commented Sep 2, 2020

Wow, I think this might be the root cause of performance/memory issues we've been having with Vault with GCP Auth + GCS Storage backend (we have multiple batch compute workloads with spot instances hitting Vault with 1000s of new nodes a day).

Is there any more guidance on how to deal with large #'s of entities/performance implications of the alias configuration?
It's also not clear to me that upgrading to the versions of Vault with updated entity mapping logic would fix performance issues since it sounds like Vault will still be looking up all this identity information.

@duplo83 Do you have any tips for cleanup/how you identified stale entities to be deleted? Or did you just go ahead and delete most of them and let them get regenerated?

@dustin-decker
Copy link
Contributor Author

It almost certainly is from the situation that you described @sidewinder12s. If you are on a recent enough version of Vault, you can obtain entity count metrics. You can also configure the GCP auth backend now to not generate so many unique identities.

We had a script run through and delete the old identities. There are a few considerations though. I could share the script if it's okay with @briankassouf. We modified it from what Brian gave us.

  • You have to dump all of the entities first. You'll probably need to increase your request timeouts because it may take a few minutes. You will also need sufficient free memory (ours increased by 10s of gigabytes IIRC).
  • You can then walk through the entities and delete the ones that are older than a TTL (this requires an additional lookup)
  • You'll need to start slow because it deleting is intensive on the identity store. It will speed up as you work through them. It'll probably take days from start to finish if it is anything like what we had to do.
  • GCS is a relatively high latency backend so it may slow things down more.

@sidewinder12s
Copy link

I am now seeing huge latency spikes with GCS writes (due to apparent object writes being limited to once per second) for a few storage packer buckets. What's weird is that this is impacting Vault login/availability for a few users, but those users are using Batch type tokens that I thought didn't touch the storage backend.

Might you have any idea of what would be trying to write the storage packer entries multiple times a second days after I stopped deleting entities? I don't see a ton of information in commits, issues or discussion pages about how storage packer behaves.

@dustin-decker
Copy link
Contributor Author

dustin-decker commented Sep 29, 2020

Batch token creation still creates an identity in the identity store (which uses the storage packer buckets).
If GCS has an object write rate limitation of once per second, then it probably won't be compatible with your rate of login since there are only 256 storage packer buckets.

@sidewinder12s
Copy link

Yup, I think I was seeing the high rate of change for metadata keys that is called out in the Docs.
I was basically slamming 1 packer bucket during login of one of our batch compute services. I fairly rapidly upgraded from 1.3 to 1.5 and I think it just got lost in all the cleanup I had been doing.

Specifically GCS has a update limitation of once per second for a single object 1

Thanks for your help/responses.

@abbasalizaidi
Copy link

@dustin-decker @sidewinder12s Any idea on what happens to the stale identities if I try to upgrade the vault from 1.1.2 to the latest or the version where this issue is fixed? My entity store is full-packed with stale identities and I can't issue a simple vault list without overwhelming the cluster.

@dustin-decker
Copy link
Contributor Author

Unless something changed and if I recall correctly, upgrading will stop the leak but not clean up the existing stale entities. You'd have to script that operation along these lines: #8761 (comment)

@taitelman
Copy link

taitelman commented Aug 2, 2023

I wish someone would add a periodic function to run the cleanup.
just add in config.hcl some defaultIdentityMaxAge config to run something like this:
I wonder why its not a part of vault core .

func selectIdentities2delete(identityIds []string) (identityToDelete []string, err error) {
	identityToDelete = make([]string, 0, len(identityIds))

	for _, identityID := range identityIds {
		r, err := http.NewRequest(http.MethodGet, getURL(vaultEndpoint, "/v1/identity/entity
/id?list=true”), nil)
		if err != nil {
			return nil, fmt.Errorf("failed creating http request for get an entity ID: %s", err.Error())
		}

		r.Header.Set(vaultTokenHeader, Token)
		r.Header.Set("Content-Type", applicationJsonContentType)
		r.Header.Set("Accept", applicationJsonContentType)

		response, err := httpRequest(r, http.StatusOK)
		if err != nil {
			return nil, fmt.Errorf("failed http request for get an entity ID: %s", err.Error())
		}
		res := &logical.HTTPResponse{}
		if err := json.Unmarshal(response, res); err != nil {
			return nil, fmt.Errorf(“failed to unmarshal response body for get an entity ID: %s", err.Error())
		}

		now := time.Now()
		if lastUpdateRaw, found := res.Data["last_update_time"]; found {
			if lastUpdate, casted := lastUpdateRaw.(string); casted {
				if len(lastUpdate) > 0 {
					identityLastUpdateTime, err := time.Parse(time.RFC3339, lastUpdate)
					if err != nil {
						return nil, err
					}
					currentAge := time.Now().Sub(identityLastUpdateTime)
					if currentAge > defaultIdentityMaxAge {
						logger.Debug(fmt.Sprintf("found old identity to delete: %s age: %s", identityID, now.Sub(identityLastUpdateTime)))
						identityToDelete = append(identityToDelete, identityID)
					}
				}
			} else {
				logger.Error(fmt.Sprintf("failed to cast last update time to string . identityID: %s", identityID))
			}
		} else {
			logger.Error(fmt.Sprintf("failed to find last_update_time element in response . identityID: %s", identityID))
		}
	}
		return identityToDelete, nil
}

func deleteBatch(identitIDs []string) (err error) {
	if len(identitIDs) < 1 {
		return nil
	}
	bodyMap := map[string][]string{"entity_ids": identitIDs}
	bodyBytes, err := json.Marshal(bodyMap)
	if err != nil {
		return err
	}
	body := bytes.NewBuffer(bodyBytes)
	r, err := http.NewRequest(http.MethodPost, getURL(vaultEndpoint, "/v1/identity/entity
/batch-delete"), body)
	if err != nil {
		return fmt.Errorf(“failed to create http request to batch delete %s", err.Error())
	}

	r.Header.Set(vaultTokenHeader, Token)
	r.Header.Set("Content-Type", applicationJsonContentType)
	r.Header.Set("Accept", applicationJsonContentType)

	_, err = httpRequest(r, http.StatusNoContent)
	if err != nil {
		return fmt.Errorf(“failed http request to batch delete entitiy: %s", err.Error())
	}
	return nil
}

@revanthalampally
Copy link

revanthalampally commented Aug 27, 2024

@dustin-decker We also see similar issues in our Vault cluster where the /v1/identity/entity
/id?list=true is throwing 500 and is unresponsive in one of our non prod env. We want to fix this, in the prod, as soon as possible.

We want to clean up the entities. I saw you mentioned about a script. Could you share the script? We would like to know if we can reuse or build upon it.
Also, what exactly is last_update_time? If a client logs in with the role pertaining to the entity, will the last_update_time become the login time?

@technologik
Copy link
Contributor

@dustin-decker We also see similar issues in our Vault cluster where the /v1/identity/entity /id?list=true is throwing 500 and is unresponsive in one of our non prod env. We want to fix this, in the prod, as soon as possible.

We want to clean up the entities. I saw you mentioned about a script. Could you share the script? We would like to know if we can reuse or build upon it. Also, what exactly is last_update_time? If a client logs in with the role pertaining to the entity, will the last_update_time become the login time?

You should ask Hashi for a script, IIRC their engineering team created it. FWIW neither Dustin nor I work at the company where this was used (do note that this specific issue has been closed for 4 years)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug core/identity
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants