Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover from EDEK decryption failures and improve KMS resilience measures #823

Merged
merged 11 commits into from
Dec 19, 2023

Conversation

robobario
Copy link
Contributor

Type of change

  • Enhancement / new feature

Description

Hardcoded retry when KMS operations fail. Previously a retry was only done for the EDEK generation. Now decryption and alias resolution both would be retried 3 times in the face of errors.

The retries around KMS operations now implement some degree of exponential backoff with jitter to try and protect the KMS a little from thundering herd problems.

Recover if EDEK decryption stage fails. Previously a failed decryption stage would have been cached and future fetch requests for that EDEK would also have no chance of being decrypted.

Also adds a more specific exception for the case where we couldn't obtain an Encryptor with enough capacity to encrypt the batch after retrying. This is hard to name, it's not in the same class as say a failure to connect to the KMS. We could be happily obtaining valid EDEKs but for some reason never having capacity to encrypt this batch. It will be more of a concern when/if we start sharing encryption resources across channels as some other channel might have exhausted it.

Closes #793

Checklist

Please go through this checklist and make sure all applicable tasks have been done

  • Write tests
  • Make sure all tests pass
  • Review performance test results. Ensure that any degradations to performance numbers are understood and justified.
  • Make sure all Sonar-Lint warnings are addressed or are justifiably ignored.
  • Update documentation
  • Reference relevant issue(s) and close them after merging
  • For user facing changes, update CHANGELOG.md (remember to include changes affecting the API of the test artefacts too).

* after some amount of retries this exception is thrown.
* </p>
*/
public class EdekReservationFailureException extends EncryptionException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Struggled to name this. I want it to be clear that this is a logical failure, the KMS is not throwing exceptions, we just haven't been able to obtain an EDEK with capacity after repeated tries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it need to be Edek specific ? Seems like a specific incantation of a generic problem. UnsatisfyableRequest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Have refactored to an RequestNotSatisfiable exception. You are correct we could dream up other cases. Like in the future if/when we make the max-encryptions-per-DEK configurable it's more likely that a user could end up with recordsInBatch > maxEncryptionsPerDek which we could fail fast on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sam pointed out that we should be able to refactor to use multiple DEKs to encrypt one batch, spoilsport)

@@ -270,6 +318,47 @@ void afterWeFailToLoadADekTheNextEncryptionAttemptCanSucceed() {
assertThat(encrypted).hasSize(2);
}

@Test
void afterWeFailToDecryptAnEDekTheNextEncryptionAttemptCanSucceed() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my commit:

Note: the testing required is ugly, Sam commented on this for the EDEK
PR too. It's tough because the InBandKeyManager knows the parcel format
and we don't have test utilities to independently generate examples of
parcels yet. If we had the ability to generate serialized record examples
we could avoid going through an encrypt cycle with a real KMS implementation
just to get properly serialized records.

But keen to keep it separate

private Duration getRandomJitter(int failures, Duration backoff) {
Duration prior = getExponentialBackoff(failures - 1);
long maxJitter = backoff.toMillis() - prior.toMillis();
return Duration.ofMillis(this.random.nextLong() % maxJitter);
Copy link
Contributor Author

@robobario robobario Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if this backoff is 1.5 seconds, the previous backoff was 1 second, then the jitter will be plus or minus 500ms (1.5 - 1).

So the jittering range gets exponentially bigger along with the backoff.

Another approach is to have a randomization factor, so that you say "jitter within +-10%" so a 1 second delay would be jittered to be randomized within 0.9s and 1.1s.

@robobario robobario force-pushed the resilience-decrypt-low-bar branch from ceb0afc to 0988275 Compare December 15, 2023 00:54
@@ -6,6 +6,9 @@

package io.kroxylicious.filter.encryption;

import java.time.Duration;
import java.util.concurrent.ThreadLocalRandom;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about this.


import java.time.Duration;

public interface BackoffStrategy {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this might end up being something with utility outside encryption, but I'm happy that we wait and see what emerges as we get into other use-cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think so, but needs some thought, similar to the thought about whether having some ByteBuffer manipulation classes somewhere common would be useful to Filter authors. I'm cautious about having some commons-lang type lib that could end up being used in Filters and the Proxy implementation with all these things depending on it's APIs.

Maybe we could have a lib named like filter-support, it could be targeted at Filter Authors and we would specify that it's not on the classpath of Kroxy by default but is to be user managed.

* @param failures count of failures
* @return how long to delay
*/
Duration getDelay(int failures);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you foresee a backoff strategy needing to communicate "give up"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the BackoffStrategy should be what makes that decision, I imagine something else knowing about retry limits and combining that with the backoff strategy.

I think the Retry logic could be more involved with CompletionStages and trigger the exceptionallyCompose type logic. So it could look something more like:

Retrier retrier = Retrier.create(maxAttempts, executor, backoffStrategy)
..
Supplier<CompletionStage<X>> heavyWork;
CompletionStage<X> = retrier.apply(heavyWork)

I'm trying not to pull in the world of dependencies but I think the API of resilience4j is something to imitate here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On that, I'll come back in another PR and see how it looks with a resilience lib dropped in, can debate if we want to add a dependency there.

Copy link
Contributor

@k-wall k-wall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Couple of questions/comments left.

@robobario robobario force-pushed the resilience-decrypt-low-bar branch 2 times, most recently from 1e88371 to e8eda88 Compare December 19, 2023 19:51
Why:
By the nature of our design we want KMS systems to be remote,
running within a different trust perimeter than the proxy. So
we need to deal with failures to communicate with the KMS,
transient failures during startup/shutdown of the KMS etc. One
aspect of this is to retry failed operations in case we have hit
on one-off networking issue or other brief issued. This change
extends the retries to all the asynchronous KMS operations, moving
the logic out of the already overburdened InBandKeyManager.

We also want to have some protection from the thundering herd. As
our proxies start up, or in a disaster situation we want some limits
to prevent them thrashing the KMS as fast as possible asking for
alias resolution or dek generation in a tight loop. For this we
have added per-operation backoff. So attempts to do a given operation
will backoff, but there is not yet any global protection for the KMS
that will control the aggregate rate of operations a proxy instance
is allowed to do.
Previously if the cached future failed it would have remained cached
forever in the failed state and meant nothing could be decrypted for
that EDEK. What we want to happen is, if an EDEK encoded into a wrapped
parcel cannot be decrypted (due to KMS outage etc) then the fetch request
should fail and subsequent requests should be able to succeed. The EDEK
decryption should be attempted again on those future requests.

Using caffeine aligns it with the EDEK cache and gives us the useful
property that there should be only one pending future in flight per key.

Note: the testing required is ugly, Sam commented on this for the EDEK
PR too. It's tough because the InBandKeyManager knows the parcel format
and we don't have test utilities to independently generate examples of
parcels yet. If we had the ability to generate serialized record examples
we could avoid going through an encrypt cycle with a real KMS implementation
just to get properly serialized records.
Prevents having to consider negative/zero durations and a multiplier
of 1, which would lead to a fixed delay.
This case shouldn't be hit due to the constraints on the delays
and multiplier, but ... scared of 0.
We foresee a class of exceptions where encryption/decryption operations
fail due to some logical reason. There is not a failure as such, like
a connection error to the KMS or your everyday NullPointerException that
has crept into the encryption code, but it should still point at an exceptional
case.

One example would be a request to encrypt more records than the maximum
allowed per Data Encryption Key. That request could fail fast because we cannot
satisfy the request.
Why:
There was a race condition where the test code fired a second request
while a mapping was being asynchronously removed from the cache due
to a failed completion stage. For now, let's just test that the failure
behaves as we expect and revisit the abstractions later to see if theres
some way to separate these behaviours and make them easier to test.
Copy link

Quality Gate Passed Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

1 New issue
0 Security Hotspots
98.9% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

@robobario robobario merged commit 9e41235 into kroxylicious:main Dec 19, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Topic-encrypton: Handle EDEK decryption failures
3 participants