Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If signature validation fails, reload JWKs and retry if new JWKs are found #88023

Conversation

justincr-elastic
Copy link
Contributor

@justincr-elastic justincr-elastic commented Jun 24, 2022

The goal of this PR is to detect JWT signature validation failure, and attempt to reload JWKs. If reload succeeds, retry JWT signature validation for the same request.

Support for reloads is being added for PKC JWKSet only, either HTTPS URL or local file. In other words, support for HMAC reloads are not being added at this time; HMAC reloads from elasticsearch-keystore are no-ops, so they do not trigger a retry of JWT signature validation.

If different JWKs are found, and they align with existing realm configuration, then JWT signature validation is retried. This functionality is useful when RSA/EC public keys in a HTTPS URL are rotated on a regular basis (ex: 24h), or completely replaced, or accidentally removed and restored.

If new JWKs are loaded, but they don't align the realm's allowed algorithms, then it is possible for a JWT realm to end up with no usable PKC JWKs. Subsequent signature validations all fail, and trigger reloads until one of them gets a new working JWKSet. No restart is required if elasticsearch.yml or elasticsearch-keystore do not change.

Out of scope for this PR:

  • HMAC JWK reloads not supported at this time.
  • Periodic background JWKSet reloads not supported at this time.
  • Serializing URL reloads in case multiple JWT signature failures overlap in the same realm in the same node at the same time.
  • Scheduled JWKSet reloads based on HTTP response Expires header not supported at this time.

Issue: #87608
Co-authored by nielsbox via PR #87609.

@justincr-elastic justincr-elastic added >enhancement :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) Team:Security Meta label for security team v8.4.0 labels Jun 24, 2022
@elasticsearchmachine
Copy link
Collaborator

Hi @justincr-elastic, I've created a changelog YAML for you.

@justincr-elastic
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The major issue is that I don't think the latest change with ListenableFuture works. I suspect there is not enough test coverage because there is at least one obvious failure (when reloadAction != null, see below comments) that should have been caught.

I'd like to propose a new version for this part of the code:

final ActionListener<Boolean> reloadListener = ActionListener.wrap(isSame -> {
    if (false == isSame) {
        // Perform validation
    } else {
        listener.onFailure(e);
    }
}, listener::onFailure);

final ListenableFuture<Boolean> thisListenableFuture = new ListenableFuture<>();
final ListenableFuture<Boolean> actualListenableFuture = futureRef.updateAndGet(f -> f == null ? thisListenableFuture : f);
actualListenableFuture.addListener(reloadListener);

if (thisListenableFuture == actualListenableFuture) {
    try {
        // Perform reload
        // ...
        actualListenableFuture.onResponse();
    } catch () {
        actualListenableFuture.onFailure();
    } finally {
        futureRef.Set(null);
    }
}

(Some of the below comments can be skipped if above change is adopted)

I also suggest we extract this part of code to a separate method so it can be tested more easily and rigorously.

The other issue is that parseJwksAlgsPkc is syncrhonous (see below also). It needs to be async as well.

@justincr-elastic
Copy link
Contributor Author

Hi @ywangd, I pushed a commit to roll back the URL concurrency. That is nice to have, not required. It can be done in a future PR.

Is there anything remaining that blocks merging this PR for 8.4.0?

@ywangd
Copy link
Member

ywangd commented Jul 21, 2022

I pushed a commit to roll back the URL concurrency. That is nice to have, not required. It can be done in a future PR.

Personally, I am hesitate to call it "nice to have" given we know there is a racing issue without handling concurrency. It is a rare case, but most racing conditions are rare yet still important. If getting this change into 8.4.0 is of great importance (e.g. we made a committment to the users), this racing issue may not be significant enough to hold this PR off. I feel this is a decision that can use input from others.

Is there anything remaining that blocks merging this PR for 8.4.0?

Yes. parseJwksAlgsPkc() (and the underlying JwtUtil.readBytes) is a blocking call. It can block the transport_worker thread and leads to cluster stability issue as commented here. This is more likely to happen because blocking even a single transport_worker thread is enough to cause cluster health issue (doc).

Copy link
Contributor

@tvernum tvernum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few comments.
I'm not done, but you're both too quick, and it's changed on me while I was reviewing last night so it's probably better to submit something that have it get stale again.

Overall:

  • We have to fix the blocking HTTP calls. It's not OK to block on the main authentication thread.
  • I think we should limit the number of concurrent HTTP calls we make. I had the same impression that @ywangd had originally - it's weird to reload on signature failure rather than as a background polling process. But that's what OIDC does, and there's value in consistency - however by that train of reasoning this code should work like OIDC does, and run a single non-blocking HTTP call. We can't really hold a position that JWT works like OIDC, but without the safety controls.
  • Part of the challenge with converting the JWT version to be non-blocking and single-outbound-connection is that it's structured differently to the OIDC code. There were design choices that went into the OIDC realm that allow it to handle the concurrency cleanly without blocking. The JWT realm has gone down a different path, and that makes it harder to add in what we need here. It can be hard to find the right balance of "Let's make this code better than what we did last time" but also "Let's leverage the work that went into this last time and follow what we know works". I think the JWT realm is showing signs of not taking advantage of problems that have already been solved once.

In the interests of being explicit about what I'm proposing, I spent a couple of hours last night refactoring the code as it was then (commit 3164d3a) so that it was more like how this problem is solved in OIDC. Specifically

  • non blocking HTTP calls
  • a ListenableFuture to maintain a single outbound connection at a time
  • using the style of code that OIDC does (smaller methods, split the loading code into a separate class).

I've pushed my changes here

  • tvernum@1254f2d
    Feel free to incorporate it as-is, or take the bits you want. It's there so that I can offer a more concrete suggestion about what I think we should be doing, rather than just providing commentary on what's there in the PR.
    I only spent a couple of hours (and didn't do a full review of the diff) so it's probably not as polished as it could be, but the tests seem to pass, so I know it's not totally broken.

@justincr-elastic
Copy link
Contributor Author

justincr-elastic commented Jul 21, 2022

This is more likely to happen because blocking even a single transport_worker thread is enough to cause cluster health issue (doc). (by @ywangd)

We have to fix the blocking HTTP calls. It's not OK to block on the main authentication thread. (by @tvernum)

I did not fully realize a single blocking call inside a Realm.authenticate() call could lead to cluster instability. The combination of your follow up comments help me to understand the issue a little better now.

…HTTP client init back to constructor, to shrink JwkSetLoader to minimal async reload helper.
@justincr-elastic
Copy link
Contributor Author

justincr-elastic commented Jul 21, 2022

@ywangd and @tvernum, thank you for all of your feedback and advocating for the additional changes. This has been very helpful.

Do the latest commits sufficiently address all of the remaining concerns? There has been a lot of feedback, I want to make sure I don't miss something.

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Some minor comments for your consideration plus a few small things that can wait for future PRs. Thanks a lot for the iterations.

Comment on lines +188 to +196
final URI jwkSetPathUri = JwtUtil.parseHttpsUri(jwkSetPath);
if (jwkSetPathUri == null) {
this.jwkSetPathUri = null; // local file path
this.httpClient = null;
} else {
this.httpClient = JwtUtil.createHttpClient(super.config, sslService);
this.jwkSetPathUri = jwkSetPathUri; // HTTPS URL
this.httpClient = JwtUtil.createHttpClient(this.config, sslService);
}
this.jwkSetLoader = new JwkSetLoader(); // PKC JWKSet loader for HTTPS URL or local file path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both httpClient and jwkSetPath can just be handled within the new JwKSetLoader class instead of cluttering the main construstor.

It does not change anything functionally. So it is not critical for this PR and you don't need to do anything. But I am writing it down as my thought for potential future refactoring opportunities.

Copy link
Contributor Author

@justincr-elastic justincr-elastic Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was design intent. If you notice now, JwkSetLoader does not handle constructing anything for JwtRealm.

If I were to refactor it, I would pull loadInternal out as well. It would be passed in as a lamba.

That is the direction I was headed in for refactoring, to make JwkSetLoader into a pure concurrency helper. No business logic.

if (this.httpClient != null) {
try {
this.httpClient.close();
} catch (IOException e) {
LOGGER.warn(() -> "Exception closing HTTPS client for realm [" + super.name() + "]", e);
LOGGER.warn(() -> "Exception closing HTTPS client for realm [" + JwtRealm.this.name() + "]", e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: why not just

Suggested change
LOGGER.warn(() -> "Exception closing HTTPS client for realm [" + JwtRealm.this.name() + "]", e);
LOGGER.warn(() -> "Exception closing HTTPS client for realm [" + name() + "]", e);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In @tvernum's proposal, that log message was inside JwkSetLoader, and it had to be qualified to distinguish JwtRealm.this versus JwkSetLoader.this. I missed changing JwtRealm.this.name() to super.name() when I moved that log message out, so I will fix it now. Thanks for catching it.

I tried to move as much business logic as possible outside of JwkSetLoader. In theory, all business logic from JwtSetLoader can be moved out, leaving JwtSetLoader as a pure listener coordinator.

Copy link
Contributor Author

@justincr-elastic justincr-elastic Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context, my personal preference is to consistently use super.method() and this.method(). It is for readers viewing code in GitHub or text editors. In other words, I assume they are not using an IDE.

If I used this.method() for declarations in the current class and all inherited classes, it is ambigious to the reader where the method is declared. A lot of times, I find myself pausing reading of code to lookup the origin of a method or field. By using super.method() and this.method() consistently, it helps remove that ambiguity. It is even possible to chain them like super.super.method() or super.super.super.method() to show deeper inheritance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume they are not using an IDE

We should not assume that. Qualifying calls unnecessary is an anti-pattern for the reasons @ywangd mentioned. It assumes that the caller knows exactly which declaration of the method should be called, and the purpose of virtual method calls is to avoid the caller knowing that.

Comment on lines +620 to +626
if (Arrays.equals(this.contentAndJwksAlgsPkc.sha256, newContentAndJwksAlgs.sha256)) {
// No change in JWKSet
logger.debug("Reloaded same PKC JWKs, can't retry verify JWT token=[{}]", tokenPrincipal);
listener.onFailure(primaryException);
return;
}
this.contentAndJwksAlgsPkc = newContentAndJwksAlgs;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still has the racing issue (I didn't read Tim's proposed change closely enough ...). Tim and I had a quick discussion and we agreed it is minor enough that can be fixed in a future PR. I am flagging it so it is not lost. But no change is required for this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let me know the details when it is convenient. Thank you.

Copy link
Contributor

@tvernum tvernum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to remove the TestLogging, but otherwise I think we're OK to merge.

There's more work to do to get it to the state we really want it to be in, but it's better to merge this and keep working on it.

@justincr-elastic
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/part-1

@justincr-elastic justincr-elastic merged commit 89e54be into elastic:master Jul 22, 2022
ywangd added a commit to ywangd/elasticsearch that referenced this pull request Jul 25, 2022
The change ensure the mutation of JWKs is done in a single thread and
visible to all other threads, which in turn ensures validation to be
correctly performed concurrently.

Relates: elastic#88023
elasticsearchmachine pushed a commit that referenced this pull request Jul 26, 2022
This PR ensures the mutation of JWKs is done in a single thread and
visible to all other threads, which in turn ensures validation to be
correctly performed concurrently.

Relates: #88023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) Team:Security Meta label for security team v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants