Upserting Keys via Transit seems to leak Memory #5746

craftey · 2018-11-09T10:16:15Z

Describe the bug
We are calling the transit's backends encrypt method (i.e. encrypt/<key>) in batch mode with a large amount of keys (millions). The memory consumption of vault just increases monotonically and never decreases again. (See attached picture from our monitoring)

To Reproduce
Steps to reproduce the behavior:

call the encrypt endpoint of the transit backend in a tight loop (ideally in parallel) with key upsert (i.e. a new unknown key per call)

Expected behavior
Memory goes down again or plateaus.

Environment:

Vault Server Version 0.10.1, but we also tested with 0.11.4

Vault server configuration file(s):

default config with Google Storage as storage backend. No other setting (like cache setting) have been changed from the default.

The text was updated successfully, but these errors were encountered:

jefferai · 2018-11-13T16:31:08Z

We cache keys. We don't expire the cache, except to update entries. You could either use mount tuning to turn off caching (at a serious performane penalty), or you may want to look into derived keys. Curious what your use case is here as the policy management for that must be pretty tricky.

mjungsbluth · 2018-11-16T10:40:26Z

By mount tuning you are referring to setting something like the force-no-cache param for the mount? I don't see that you can change that setting after the mount is enabled (vault secrets tune has no -force-no-cache parameter). As far as I can see the only possible solution is to recreate the mount?

What is the expected memory footprint per key? The graph above shows 10GB of RAM used for 1 million keys which means approx 10KB per 256bit key.

The use case is GDPR related and creates one key per user so we can permanently delete it once the user requests it. We are already using key derivation to cover different risk classes of PII so we already cut the number of keys by 20. The policies are actually pretty simple as they act on the context parameter only and not on specific keys (for this specific transit mount).

The alternative is to use a data key and handle the encryption with specific keys outside of vault but we would really not like to go that route as this meant reimplementing convergent encryption, safely generating keys and essentially solving the problems again that Vault already solves..

From a functional point of view as well as performance wise Vault is fully sufficient and we are well aware that the transit backend was not specifically designed for this amount of keys, yet from Vault's architecture nothing prohibits supporting many keys on the transit backend as far as we can see.

jefferai · 2018-11-16T13:27:58Z

10KB per key is a lot. I wonder if you're hitting an open bug that affects various things in Go but specifically people have seen with GCS where connections can leak, which could mean that allocated memory isn't being cleared. See #5419 (comment)

One thing to try from that angle is setting GODEBUG=http2client=0.

I also noticed that there are some plateaus and spikes in your graph, especially as time goes on. This could be consistent with the map we're using for the cache being called to grow. (We are using a sync.Map if you're Go-inclined, which specifically isn't great for this use-case but because we didn't expect it.)

The use case makes sense. I see two options basically:

I can tell you how to flip the cache bit on the mount via using sys/raw.
We can figure out some better caching strategy, such as an time-based expiring cache, or an LRU. The main nervousness here is that it would be good to not end up causing problems for other users in terms of performance degredation due to cache keys expiring. Which is probably why a time-based cache makes more sense than a strict LRU, because with an LRU you have to pick a size and we don't know what that size should be; with a time-based cache we can store as many as you're actively using without having to decide on a hardcoded size.

Unfortunately the cache we usually use has no way to do a Get + Extend function. I'm going to have some internal discussions about whether we want to maintain a fork or not.

craftey · 2019-01-10T10:14:51Z

We tested creating a new mount with force-no-cache option:

vault secrets enable -path=<mount> -force-no-cache=true transit

The result was that the memory usage now plateaus below 1GB. Tested by upserting 40 Million keys.

We also found that version 0.10.1 did not plateau with force-no-cache. Maybe because of the mentioned GCS leak. But memory usage increased much slower, but eventually hit any limit we set.
The flag GODEBUG=http2client=0 did not mitigate the GCS memory leak behavior in 0.10.1.
It also made vault quite instable with our workload. We saw connection problems to GCS on a regular basis.

However, with force-no-cache option for a new mount and vault versions 0.11.5 or 1.0.0 memory plateaus and duration per 10000 keys doubled (compared to key cache enabled) or something along that line.

I plan to give more detailed performance test data in a follow up post. (I hope I find some time for tests)

Thx for the help @jefferai. And of course we would like to see some options for caching strategies. Would be fantastic if you consider a use case similar to ours in your testing and development.

jefferai · 2019-01-14T16:34:03Z

@craftey I think we could make the cache an interface such that it can either be unlimited or we can make it an LRU based on options configured on the mount.

jefferai added this to the 1.0.3 milestone Jan 14, 2019

jefferai modified the milestones: 1.0.3, 1.0.4 Feb 1, 2019

ghost self-assigned this Feb 7, 2019

ghost mentioned this issue Feb 13, 2019

transit cache is an Interface implemented by wrapped versions of sync… #6225

Merged

briankassouf modified the milestones: 1.0.4, 1.1.1 Mar 14, 2019

jefferai modified the milestones: 1.1.1, 1.1.2 Apr 10, 2019

briankassouf modified the milestones: 1.1.2, 1.1.3 Apr 29, 2019

briankassouf modified the milestones: 1.1.3, 1.2 May 21, 2019

briankassouf closed this as completed in #6225 Jun 4, 2019

ErKaramSingh mentioned this issue Sep 12, 2023

[ERROR] rollback: error rolling back: path= #22995

Open

craftey mentioned this issue Nov 1, 2023

transit mount, storage gcs, error: rollback: error rolling back - context deadline exceeded #23566

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upserting Keys via Transit seems to leak Memory #5746

Upserting Keys via Transit seems to leak Memory #5746

craftey commented Nov 9, 2018 •

edited

Loading

jefferai commented Nov 13, 2018

mjungsbluth commented Nov 16, 2018

jefferai commented Nov 16, 2018

craftey commented Jan 10, 2019 •

edited

Loading

jefferai commented Jan 14, 2019

Upserting Keys via Transit seems to leak Memory #5746

Upserting Keys via Transit seems to leak Memory #5746

Comments

craftey commented Nov 9, 2018 • edited Loading

jefferai commented Nov 13, 2018

mjungsbluth commented Nov 16, 2018

jefferai commented Nov 16, 2018

craftey commented Jan 10, 2019 • edited Loading

jefferai commented Jan 14, 2019

craftey commented Nov 9, 2018 •

edited

Loading

craftey commented Jan 10, 2019 •

edited

Loading