-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recover from EDEK decryption failures and improve KMS resilience measures #823
Recover from EDEK decryption failures and improve KMS resilience measures #823
Conversation
* after some amount of retries this exception is thrown. | ||
* </p> | ||
*/ | ||
public class EdekReservationFailureException extends EncryptionException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Struggled to name this. I want it to be clear that this is a logical failure, the KMS is not throwing exceptions, we just haven't been able to obtain an EDEK with capacity after repeated tries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it need to be Edek specific ? Seems like a specific incantation of a generic problem. UnsatisfyableRequest
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Have refactored to an RequestNotSatisfiable
exception. You are correct we could dream up other cases. Like in the future if/when we make the max-encryptions-per-DEK configurable it's more likely that a user could end up with recordsInBatch > maxEncryptionsPerDek
which we could fail fast on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Sam pointed out that we should be able to refactor to use multiple DEKs to encrypt one batch, spoilsport)
...on/src/test/java/io/kroxylicious/filter/encryption/ExponentialJitterBackoffStrategyTest.java
Outdated
Show resolved
Hide resolved
@@ -270,6 +318,47 @@ void afterWeFailToLoadADekTheNextEncryptionAttemptCanSucceed() { | |||
assertThat(encrypted).hasSize(2); | |||
} | |||
|
|||
@Test | |||
void afterWeFailToDecryptAnEDekTheNextEncryptionAttemptCanSucceed() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my commit:
Note: the testing required is ugly, Sam commented on this for the EDEK
PR too. It's tough because the InBandKeyManager knows the parcel format
and we don't have test utilities to independently generate examples of
parcels yet. If we had the ability to generate serialized record examples
we could avoid going through an encrypt cycle with a real KMS implementation
just to get properly serialized records.
But keen to keep it separate
private Duration getRandomJitter(int failures, Duration backoff) { | ||
Duration prior = getExponentialBackoff(failures - 1); | ||
long maxJitter = backoff.toMillis() - prior.toMillis(); | ||
return Duration.ofMillis(this.random.nextLong() % maxJitter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if this backoff is 1.5 seconds, the previous backoff was 1 second, then the jitter will be plus or minus 500ms (1.5 - 1).
So the jittering range gets exponentially bigger along with the backoff.
Another approach is to have a randomization factor, so that you say "jitter within +-10%" so a 1 second delay would be jittered to be randomized within 0.9s and 1.1s.
...yption/src/main/java/io/kroxylicious/filter/encryption/ExponentialJitterBackoffStrategy.java
Outdated
Show resolved
Hide resolved
ceb0afc
to
0988275
Compare
@@ -6,6 +6,9 @@ | |||
|
|||
package io.kroxylicious.filter.encryption; | |||
|
|||
import java.time.Duration; | |||
import java.util.concurrent.ThreadLocalRandom; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know about this.
|
||
import java.time.Duration; | ||
|
||
public interface BackoffStrategy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this might end up being something with utility outside encryption, but I'm happy that we wait and see what emerges as we get into other use-cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think so, but needs some thought, similar to the thought about whether having some ByteBuffer
manipulation classes somewhere common would be useful to Filter authors. I'm cautious about having some commons-lang
type lib that could end up being used in Filters and the Proxy implementation with all these things depending on it's APIs.
Maybe we could have a lib named like filter-support
, it could be targeted at Filter Authors and we would specify that it's not on the classpath of Kroxy by default but is to be user managed.
* @param failures count of failures | ||
* @return how long to delay | ||
*/ | ||
Duration getDelay(int failures); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you foresee a backoff strategy needing to communicate "give up"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the BackoffStrategy should be what makes that decision, I imagine something else knowing about retry limits and combining that with the backoff strategy.
I think the Retry logic could be more involved with CompletionStages
and trigger the exceptionallyCompose
type logic. So it could look something more like:
Retrier retrier = Retrier.create(maxAttempts, executor, backoffStrategy)
..
Supplier<CompletionStage<X>> heavyWork;
CompletionStage<X> = retrier.apply(heavyWork)
I'm trying not to pull in the world of dependencies but I think the API of resilience4j is something to imitate here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On that, I'll come back in another PR and see how it looks with a resilience lib dropped in, can debate if we want to add a dependency there.
...yption/src/main/java/io/kroxylicious/filter/encryption/ExponentialJitterBackoffStrategy.java
Show resolved
Hide resolved
...kroxylicious-encryption/src/main/java/io/kroxylicious/filter/encryption/BackoffStrategy.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Couple of questions/comments left.
1e88371
to
e8eda88
Compare
Why: By the nature of our design we want KMS systems to be remote, running within a different trust perimeter than the proxy. So we need to deal with failures to communicate with the KMS, transient failures during startup/shutdown of the KMS etc. One aspect of this is to retry failed operations in case we have hit on one-off networking issue or other brief issued. This change extends the retries to all the asynchronous KMS operations, moving the logic out of the already overburdened InBandKeyManager. We also want to have some protection from the thundering herd. As our proxies start up, or in a disaster situation we want some limits to prevent them thrashing the KMS as fast as possible asking for alias resolution or dek generation in a tight loop. For this we have added per-operation backoff. So attempts to do a given operation will backoff, but there is not yet any global protection for the KMS that will control the aggregate rate of operations a proxy instance is allowed to do.
Previously if the cached future failed it would have remained cached forever in the failed state and meant nothing could be decrypted for that EDEK. What we want to happen is, if an EDEK encoded into a wrapped parcel cannot be decrypted (due to KMS outage etc) then the fetch request should fail and subsequent requests should be able to succeed. The EDEK decryption should be attempted again on those future requests. Using caffeine aligns it with the EDEK cache and gives us the useful property that there should be only one pending future in flight per key. Note: the testing required is ugly, Sam commented on this for the EDEK PR too. It's tough because the InBandKeyManager knows the parcel format and we don't have test utilities to independently generate examples of parcels yet. If we had the ability to generate serialized record examples we could avoid going through an encrypt cycle with a real KMS implementation just to get properly serialized records.
Prevents having to consider negative/zero durations and a multiplier of 1, which would lead to a fixed delay.
This case shouldn't be hit due to the constraints on the delays and multiplier, but ... scared of 0.
We foresee a class of exceptions where encryption/decryption operations fail due to some logical reason. There is not a failure as such, like a connection error to the KMS or your everyday NullPointerException that has crept into the encryption code, but it should still point at an exceptional case. One example would be a request to encrypt more records than the maximum allowed per Data Encryption Key. That request could fail fast because we cannot satisfy the request.
Why: There was a race condition where the test code fired a second request while a mapping was being asynchronously removed from the cache due to a failed completion stage. For now, let's just test that the failure behaves as we expect and revisit the abstractions later to see if theres some way to separate these behaviours and make them easier to test.
Quality Gate passedThe SonarCloud Quality Gate passed, but some issues were introduced. 1 New issue |
Type of change
Description
Hardcoded retry when KMS operations fail. Previously a retry was only done for the EDEK generation. Now decryption and alias resolution both would be retried 3 times in the face of errors.
The retries around KMS operations now implement some degree of exponential backoff with jitter to try and protect the KMS a little from thundering herd problems.
Recover if EDEK decryption stage fails. Previously a failed decryption stage would have been cached and future fetch requests for that EDEK would also have no chance of being decrypted.
Also adds a more specific exception for the case where we couldn't obtain an Encryptor with enough capacity to encrypt the batch after retrying. This is hard to name, it's not in the same class as say a failure to connect to the KMS. We could be happily obtaining valid EDEKs but for some reason never having capacity to encrypt this batch. It will be more of a concern when/if we start sharing encryption resources across channels as some other channel might have exhausted it.
Closes #793
Checklist
Please go through this checklist and make sure all applicable tasks have been done