Recover from EDEK decryption failures and improve KMS resilience measures #823

robobario · 2023-12-14T04:52:07Z

Type of change

Enhancement / new feature

Description

Hardcoded retry when KMS operations fail. Previously a retry was only done for the EDEK generation. Now decryption and alias resolution both would be retried 3 times in the face of errors.

The retries around KMS operations now implement some degree of exponential backoff with jitter to try and protect the KMS a little from thundering herd problems.

Recover if EDEK decryption stage fails. Previously a failed decryption stage would have been cached and future fetch requests for that EDEK would also have no chance of being decrypted.

Also adds a more specific exception for the case where we couldn't obtain an Encryptor with enough capacity to encrypt the batch after retrying. This is hard to name, it's not in the same class as say a failure to connect to the KMS. We could be happily obtaining valid EDEKs but for some reason never having capacity to encrypt this batch. It will be more of a concern when/if we start sharing encryption resources across channels as some other channel might have exhausted it.

Closes #793

Checklist

Please go through this checklist and make sure all applicable tasks have been done

Write tests
Make sure all tests pass
Review performance test results. Ensure that any degradations to performance numbers are understood and justified.
Make sure all Sonar-Lint warnings are addressed or are justifiably ignored.
Update documentation
Reference relevant issue(s) and close them after merging
For user facing changes, update CHANGELOG.md (remember to include changes affecting the API of the test artefacts too).

robobario · 2023-12-14T04:54:42Z

.../src/main/java/io/kroxylicious/filter/encryption/inband/EdekReservationFailureException.java

+ * after some amount of retries this exception is thrown.
+ * </p>
+ */
+public class EdekReservationFailureException extends EncryptionException {


Struggled to name this. I want it to be clear that this is a logical failure, the KMS is not throwing exceptions, we just haven't been able to obtain an EDEK with capacity after repeated tries.

Why does it need to be Edek specific ? Seems like a specific incantation of a generic problem. UnsatisfyableRequest?

👍 Have refactored to an RequestNotSatisfiable exception. You are correct we could dream up other cases. Like in the future if/when we make the max-encryptions-per-DEK configurable it's more likely that a user could end up with recordsInBatch > maxEncryptionsPerDek which we could fail fast on.

(Sam pointed out that we should be able to refactor to use multiple DEKs to encrypt one batch, spoilsport)

...on/src/test/java/io/kroxylicious/filter/encryption/ExponentialJitterBackoffStrategyTest.java

robobario · 2023-12-14T04:57:29Z

...-encryption/src/test/java/io/kroxylicious/filter/encryption/inband/InBandKeyManagerTest.java

@@ -270,6 +318,47 @@ void afterWeFailToLoadADekTheNextEncryptionAttemptCanSucceed() {
        assertThat(encrypted).hasSize(2);
    }

+    @Test
+    void afterWeFailToDecryptAnEDekTheNextEncryptionAttemptCanSucceed() {


From my commit:

Note: the testing required is ugly, Sam commented on this for the EDEK
PR too. It's tough because the InBandKeyManager knows the parcel format
and we don't have test utilities to independently generate examples of
parcels yet. If we had the ability to generate serialized record examples
we could avoid going through an encrypt cycle with a real KMS implementation
just to get properly serialized records.

But keen to keep it separate

robobario · 2023-12-14T22:12:19Z

...yption/src/main/java/io/kroxylicious/filter/encryption/ExponentialJitterBackoffStrategy.java

+    private Duration getRandomJitter(int failures, Duration backoff) {
+        Duration prior = getExponentialBackoff(failures - 1);
+        long maxJitter = backoff.toMillis() - prior.toMillis();
+        return Duration.ofMillis(this.random.nextLong() % maxJitter);


So, if this backoff is 1.5 seconds, the previous backoff was 1 second, then the jitter will be plus or minus 500ms (1.5 - 1).

So the jittering range gets exponentially bigger along with the backoff.

Another approach is to have a randomization factor, so that you say "jitter within +-10%" so a 1 second delay would be jittered to be randomized within 0.9s and 1.1s.

...yption/src/main/java/io/kroxylicious/filter/encryption/ExponentialJitterBackoffStrategy.java

k-wall · 2023-12-18T18:06:50Z

...xylicious-encryption/src/main/java/io/kroxylicious/filter/encryption/EnvelopeEncryption.java

@@ -6,6 +6,9 @@

 package io.kroxylicious.filter.encryption;

+import java.time.Duration;
+import java.util.concurrent.ThreadLocalRandom;


I didn't know about this.

k-wall · 2023-12-18T18:08:42Z

...kroxylicious-encryption/src/main/java/io/kroxylicious/filter/encryption/BackoffStrategy.java

+
+import java.time.Duration;
+
+public interface BackoffStrategy {


I wonder if this might end up being something with utility outside encryption, but I'm happy that we wait and see what emerges as we get into other use-cases.

Yeah I think so, but needs some thought, similar to the thought about whether having some ByteBuffer manipulation classes somewhere common would be useful to Filter authors. I'm cautious about having some commons-lang type lib that could end up being used in Filters and the Proxy implementation with all these things depending on it's APIs.

Maybe we could have a lib named like filter-support, it could be targeted at Filter Authors and we would specify that it's not on the classpath of Kroxy by default but is to be user managed.

k-wall · 2023-12-18T18:11:10Z

...kroxylicious-encryption/src/main/java/io/kroxylicious/filter/encryption/BackoffStrategy.java

+     * @param failures count of failures
+     * @return how long to delay
+     */
+    Duration getDelay(int failures);


Do you foresee a backoff strategy needing to communicate "give up"?

I don't think the BackoffStrategy should be what makes that decision, I imagine something else knowing about retry limits and combining that with the backoff strategy.

I think the Retry logic could be more involved with CompletionStages and trigger the exceptionallyCompose type logic. So it could look something more like:

Retrier retrier = Retrier.create(maxAttempts, executor, backoffStrategy) .. Supplier<CompletionStage<X>> heavyWork; CompletionStage<X> = retrier.apply(heavyWork)

I'm trying not to pull in the world of dependencies but I think the API of resilience4j is something to imitate here.

On that, I'll come back in another PR and see how it looks with a resilience lib dropped in, can debate if we want to add a dependency there.

...yption/src/main/java/io/kroxylicious/filter/encryption/ExponentialJitterBackoffStrategy.java

...kroxylicious-encryption/src/main/java/io/kroxylicious/filter/encryption/BackoffStrategy.java

k-wall

Looks good to me. Couple of questions/comments left.

Why: By the nature of our design we want KMS systems to be remote, running within a different trust perimeter than the proxy. So we need to deal with failures to communicate with the KMS, transient failures during startup/shutdown of the KMS etc. One aspect of this is to retry failed operations in case we have hit on one-off networking issue or other brief issued. This change extends the retries to all the asynchronous KMS operations, moving the logic out of the already overburdened InBandKeyManager. We also want to have some protection from the thundering herd. As our proxies start up, or in a disaster situation we want some limits to prevent them thrashing the KMS as fast as possible asking for alias resolution or dek generation in a tight loop. For this we have added per-operation backoff. So attempts to do a given operation will backoff, but there is not yet any global protection for the KMS that will control the aggregate rate of operations a proxy instance is allowed to do.

Previously if the cached future failed it would have remained cached forever in the failed state and meant nothing could be decrypted for that EDEK. What we want to happen is, if an EDEK encoded into a wrapped parcel cannot be decrypted (due to KMS outage etc) then the fetch request should fail and subsequent requests should be able to succeed. The EDEK decryption should be attempted again on those future requests. Using caffeine aligns it with the EDEK cache and gives us the useful property that there should be only one pending future in flight per key. Note: the testing required is ugly, Sam commented on this for the EDEK PR too. It's tough because the InBandKeyManager knows the parcel format and we don't have test utilities to independently generate examples of parcels yet. If we had the ability to generate serialized record examples we could avoid going through an encrypt cycle with a real KMS implementation just to get properly serialized records.

Prevents having to consider negative/zero durations and a multiplier of 1, which would lead to a fixed delay.

This case shouldn't be hit due to the constraints on the delays and multiplier, but ... scared of 0.

We foresee a class of exceptions where encryption/decryption operations fail due to some logical reason. There is not a failure as such, like a connection error to the KMS or your everyday NullPointerException that has crept into the encryption code, but it should still point at an exceptional case. One example would be a request to encrypt more records than the maximum allowed per Data Encryption Key. That request could fail fast because we cannot satisfy the request.

Why: There was a race condition where the test code fired a second request while a mapping was being asynchronously removed from the cache due to a failed completion stage. For now, let's just test that the failure behaves as we expect and revisit the abstractions later to see if theres some way to separate these behaviours and make them easier to test.

sonarqubecloud · 2023-12-19T20:15:57Z

Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

1 New issue
0 Security Hotspots
98.9% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

robobario commented Dec 14, 2023

View reviewed changes

...on/src/test/java/io/kroxylicious/filter/encryption/ExponentialJitterBackoffStrategyTest.java Outdated Show resolved Hide resolved

robobario commented Dec 14, 2023

View reviewed changes

...yption/src/main/java/io/kroxylicious/filter/encryption/ExponentialJitterBackoffStrategy.java Outdated Show resolved Hide resolved

robobario force-pushed the resilience-decrypt-low-bar branch from ceb0afc to 0988275 Compare December 15, 2023 00:54

k-wall reviewed Dec 18, 2023

View reviewed changes

...yption/src/main/java/io/kroxylicious/filter/encryption/ExponentialJitterBackoffStrategy.java Show resolved Hide resolved

k-wall reviewed Dec 18, 2023

View reviewed changes

...kroxylicious-encryption/src/main/java/io/kroxylicious/filter/encryption/BackoffStrategy.java Outdated Show resolved Hide resolved

k-wall approved these changes Dec 18, 2023

View reviewed changes

robobario force-pushed the resilience-decrypt-low-bar branch 2 times, most recently from 1e88371 to e8eda88 Compare December 19, 2023 19:51

robobario added 11 commits December 20, 2023 08:51

Add more specific exception for EDEK reservation failure

2990885

Improve backoff parameter requirements

6854a1c

Prevents having to consider negative/zero durations and a multiplier of 1, which would lead to a fixed delay.

Fix divide-by-zero guard to return 0 jitter

1999452

This case shouldn't be hit due to the constraints on the delays and multiplier, but ... scared of 0.

Suppress sonar warning and fixup sonar issues

f480d8c

encryption filter: reject negative failures during backoff

64ffbc7

encryption filter: reject null random in constructor

ed8b47e

Fix javadoc

27a9ed4

robobario merged commit 9e41235 into kroxylicious:main Dec 19, 2023
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover from EDEK decryption failures and improve KMS resilience measures #823

Recover from EDEK decryption failures and improve KMS resilience measures #823

robobario commented Dec 14, 2023

robobario Dec 14, 2023

SamBarker Dec 14, 2023

robobario Dec 18, 2023

robobario Dec 18, 2023

robobario Dec 14, 2023

robobario Dec 14, 2023 •

edited

Loading

k-wall Dec 18, 2023

k-wall Dec 18, 2023

robobario Dec 18, 2023

k-wall Dec 18, 2023

robobario Dec 18, 2023

robobario Dec 18, 2023

k-wall left a comment

sonarqubecloud bot commented Dec 19, 2023


		import java.time.Duration;

		public interface BackoffStrategy {

Recover from EDEK decryption failures and improve KMS resilience measures #823

Recover from EDEK decryption failures and improve KMS resilience measures #823

Conversation

robobario commented Dec 14, 2023

Type of change

Description

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robobario Dec 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k-wall left a comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Dec 19, 2023

Quality Gate passed

robobario Dec 14, 2023 •

edited

Loading