Skip to content
This repository has been archived by the owner on Aug 28, 2024. It is now read-only.

Couldn't retrieve remote JWK set: connect/read timed out #802

Closed
sandboxbohemian opened this issue Dec 27, 2019 · 27 comments
Closed

Couldn't retrieve remote JWK set: connect/read timed out #802

sandboxbohemian opened this issue Dec 27, 2019 · 27 comments

Comments

@sandboxbohemian
Copy link

Environment

  • Spring boot starter:

    • active directory spring boot starter
  • OS Type: Linux

  • Java version: 1.8

Summary

We have a Spring Boot app that uses Spring Security with the Azure AD starter. We have observed sporadic connect and read timeouts while the library tries to get a response from the JWKS URL (https://login.microsoftonline.com/common/discovery/keys).

Reproduce steps

We are using the following dependencies in our stack

  • org.springframework.boot:spring-boot-starter-security:2.1.0.RELEASE
  • org.springframework.security:spring-security-oauth2-client:5.1.3.RELEASE
  • org.springframework.security:spring-security-oauth2-jose:5.1.3.RELEASE
  • com.microsoft.azure:azure-active-directory-spring-boot-starter:2.1.2
  • com.nimbusds:oauth2-oidc-sdk:5.64.4

We have the same code running in two different environments under two different tenant ids, and we have observed timeouts in only one of them. Following is the stack trace.

Servlet.service() for servlet [dispatcherServlet] in context with path [/api/v1/src-annuities-query] threw exception [com.nimbusds.jose.RemoteKeySourceException: Couldn't retrieve remote JWK set: connect timed out] with root cause
j.n.SocketTimeoutException: connect timed out
	at j.n.PlainSocketImpl.socketConnect(PlainSocketImpl.java)
	at j.n.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	at j.n.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at j.n.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at j.n.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at s.s.s.SSLSocketImpl.connect(SSLSocketImpl.java:666)
	at s.net.NetworkClient.doConnect(NetworkClient.java:175)
	at s.n.w.h.HttpClient.openServer(HttpClient.java:463)
	at s.n.w.h.HttpClient.openServer(HttpClient.java:558)
	at s.n.w.p.h.HttpsClient.<init>(HttpsClient.java:264)
	at s.n.w.p.h.HttpsClient.New(HttpsClient.java:367)
	at s.n.w.p.h.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
	at s.n.w.p.h.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162)
	at s.n.w.p.h.HttpURLConnection.plainConnect(HttpURLConnection.java:1056)
	at s.n.w.p.h.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
	at s.n.w.p.h.HttpURLConnection.getInputStream0(HttpURLConnection.java:1570)
	at s.n.w.p.h.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
	at s.n.w.p.h.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:268)
	at c.n.j.u.DefaultResourceRetriever.getInputStream(DefaultResourceRetriever.java:249)
	... 66 frames truncated

Following #417 , we set the timeout parameters as below, but even then we have observed timeouts (but definitely fewer).
azure.activedirectory.jose.connect-timeout=2000
azure.activedirectory.jose.read-timeout=2000
azure.activedirectory.jose.size-limit=51200

Stepping through the code, we found that the JWTResourceRetriever is conditional, so we set it up to ensure that it gets overwritten with the parameters above.

@Bean
public ResourceRetriever getJWTResourceRetriever(@Autowired JoseConfigurationProperties properties) {
	return new DefaultResourceRetriever(properties.getConnectTimeout(), properties.getReadTimeout(), properties.getSizeLimit());
}

And in a different @configuration class to avoid cyclic dependencies

@Autowired
private AADAuthenticationFilter azureAdFilter;

@Override
protected void configure(HttpSecurity http) throws Exception {
	http.authorizeRequests()
		.antMatchers("api/**/*")
		.authenticated();
	http.authorizeRequests()
		.anyRequest()
		.permitAll();
	http.addFilterBefore(azureAdFilter, UsernamePasswordAuthenticationFilter.class);
}

JoseConfigurationproperties.java

@Configuration
@ConfigurationProperties(prefix = "azure.activedirectory.jose")
public class JoseConfigurationProperties {
	
	private int connectTimeout;
	private int readTimeout;
	private int sizeLimit;
	// getters and setters...
}

Expected Results

The connection should not have timed out. Over a period of 7 days, we have a 55-45 split between connect timeout and read timeout.
Also, since this is serving static content, is there any way to force caching of the JWKS content and not perform a fetch on every incoming request?

Actual Results

Servlet.service() for servlet [dispatcherServlet] in context with path [/api/v1/src-annuities-query] threw exception [com.nimbusds.jose.RemoteKeySourceException: Couldn't retrieve remote JWK set: connect timed out] with root cause
j.n.SocketTimeoutException: connect timed out
	at j.n.PlainSocketImpl.socketConnect(PlainSocketImpl.java)
	at j.n.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	at j.n.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at j.n.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at j.n.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at s.s.s.SSLSocketImpl.connect(SSLSocketImpl.java:666)
	at s.net.NetworkClient.doConnect(NetworkClient.java:175)
	at s.n.w.h.HttpClient.openServer(HttpClient.java:463)
	at s.n.w.h.HttpClient.openServer(HttpClient.java:558)
	at s.n.w.p.h.HttpsClient.<init>(HttpsClient.java:264)
	at s.n.w.p.h.HttpsClient.New(HttpsClient.java:367)
	at s.n.w.p.h.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
	at s.n.w.p.h.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162)
	at s.n.w.p.h.HttpURLConnection.plainConnect(HttpURLConnection.java:1056)
	at s.n.w.p.h.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
	at s.n.w.p.h.HttpURLConnection.getInputStream0(HttpURLConnection.java:1570)
	at s.n.w.p.h.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
	at s.n.w.p.h.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:268)
	at c.n.j.u.DefaultResourceRetriever.getInputStream(DefaultResourceRetriever.java:249)
	... 66 frames truncated
@superrdean
Copy link
Contributor

Thanks for your feedback, we will inspect it, then give you a response later. Please stay tuned.

@superrdean
Copy link
Contributor

Hi @sandboxbohemian
The JWK URI is used to verify any JSON Web Token issued by the authorization server, what we have done in SDK is just integrates the verified process in azure-spring-boot.
When you encounter the time-out issue, the only thing you can do in codes is that adjust the time-out parameter. Or you can investigate your network environment, maybe something cause that, because as you said, one of your APP is ok, but another encounters the issues from time to time. And also, you can create an issue to the team that hosts the JWK URI and ask them for some help.
As for the proposal that cache the JWK contents, it is not recommended, because in standard OIDC process, every time we get id_token from the access token, we need to verify it instantly, if we cache the JWK contents, it may cause some security problems.

@sandboxbohemian
Copy link
Author

@neuqlz Thanks for looking into this. Although the information you provided is correct, it does not really provide enough context to me as to how this issue can be mediated.

In the code, we have increased the timeouts to 3000 ms for the ResourceRetriever used in the RemoteJWKSet. And since then the number of errors have reduced significantly, there are still requests that take over 3000 ms to complete the round trip and time out. Indefinitely increasing the timeout is not a good approach, IMO.

I am also confused as to why caching the JWK set is not a good option here. We spoke with the Microsoft Identity Support team, and they mentioned that the .Net libraries cache the JWK set by default as the JWKS content changes every 2-3 days. The nimbus library also provides a constructor as
RemoteJWKSet(final URL jwkSetURL, final ResourceRetriever resourceRetriever, final JWKSetCache jwkSetCache)
The user principal manager, however, calls the 2 argument constructor
public RemoteJWKSet(final URL jwkSetURL, final ResourceRetriever resourceRetriever)

I am not saying that caching should be implemented by default, but the option to configure it should be available as part of auto-configuration.

If you can point me to their repo, I can also create an issue report with the JWK hosting team. The endpoints I have used so far are

  1. https://login.microsoftonline.com/common/discovery/keys
  2. https://login.microsoftonline.com/{domain}/discovery/keys

@HEVARY
Copy link

HEVARY commented Jan 8, 2020

@neuqlz - Thank you for working on this, as per our discussion over teams -- do we have any update on the fix or plan of action ?

I really appreciate if we speed this process.

Thank you,
Hemanth

@superrdean
Copy link
Contributor

Hi @sandboxbohemian
I have investigated what you mentioned. You are right, we can add cache to mitigate the time-out issue. But you may still encounter time-out problems from time to time even the cache is added. So I suggest you can ask the AAD team for support to solve the problem from the root perspective.
And also there is a point you need to notice if no cache instance is put into the constructor, RemoteJWKSet will use a default cache which has 5 mins lifespan.

@HEVARY
Copy link

HEVARY commented Jan 9, 2020

@neuqlz - Let me check with the team that handles this endpoint.

@HEVARY
Copy link

HEVARY commented Jan 13, 2020

@neuqlz - Can you give an previous instances, where you have noticed a timeout from AAD even after the cache is added.
Do we have any ETA on adding the cache ?

@superrdean
Copy link
Contributor

@HEVARY
The current codes will use the default cache class even if no cache class is sent to RemoteJWKSet. That means the application customer using now has a cache, but the timeout problem still occurs.
As for the second question, @sandboxbohemian has created a PR to add specific cache, but it still has some codes to advance, I have given some comments in that PR. Once the PR is merged into the master, the specific cache mechanism will take effects.

@HEVARY
Copy link

HEVARY commented Jan 22, 2020

@neuqlz
I have discussed issue with @sandboxbohemian last week, and i was able to get application logs at the time of failure

January 17th 2020, 21:39:04.483 XXXXXXXXXXXXXXXXXXXXXXXX Servlet.service() for servlet [dispatcherServlet] in context with path [/api/v1/src-annuities-workflow] threw exception [com.nimbusds.jose.RemoteKeySourceException: Couldn't retrieve remote JWK set: Read timed out] with root cause
January 17th 2020, 21:21:18.070 XXXXXXXXXXXXXXXXXXXXXXXXX Servlet.service() for servlet [dispatcherServlet] in context with path [/api/v1/src-annuities-workflow] threw exception [com.nimbusds.jose.RemoteKeySourceException: Couldn't retrieve remote JWK set: Read timed out] with root cause
January 17th 2020, 21:08:24.178 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Servlet.service() for servlet [dispatcherServlet] in context with path [/api/v1/src-annuities-query] threw exception [java.util.concurrent.ExecutionException: java.net.ConnectException: Operation timed out (Connection timed out)] with root cause
January 17th 2020, 21:08:24.177 XXXXXXXXXXXXXXXXXXXXXXXXXXXX Failed to acquire graph api token.
January 17th 2020, 21:08:24.176 XXXXXXXXXXXXXXXXXXXXXXXXX [Correlation ID: 3dc44501-37d0-4e0d-9cfe-b30faadfeed5] Execution of class com.microsoft.aad.adal4j.AcquireTokenCallable failed.
January 17th 2020, 21:08:18.520 XXXXXXXXXXXXXXXXXXXX Servlet.service() for servlet [dispatcherServlet] in context with path [/api/v1/src-annuities-workflow] threw exception [com.nimbusds.jose.RemoteKeySourceException: Couldn't retrieve remote JWK set: Read timed out] with root cause
January 17th 2020, 20:55:26.933 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Servlet.service() for servlet [dispatcherServlet] in context with path [/api/v1/src-annuities-workflow] threw exception [com.nimbusds.jose.RemoteKeySourceException: Couldn't retrieve remote JWK set: Read timed out] with root cause

Based on my analysis, I have queried at AADGateway level and I don't see any failures from the AppID's -- I am not able to find any request with the correlation ID.

With this, I can clearly confirm the issue is not related to AAD because we were not seeing this logs at all at our end.

This can clearly be an issue with the network where the application is hosted or SpringBoot. if we raised timeout to 5000Ms and still seeing these issues then it is clearly a network problem.

@sandboxbohemian _ please share me your thoughts for the same

@HEVARY
Copy link

HEVARY commented Jan 22, 2020

I have removed appId due to security reasons. Please reach out to me if you need the appid's.

@bcannariato
Copy link
Contributor

Hi, also from the app team with @sandboxbohemian .

@neuqlz can you please respond to @HEVARY above. We have a question relating to the Correlation IDs as seen from the above logs.

He is not able to see anything on his end relating to the Correlation ID. This gets generated from the adal library. Is this something that you can track down on your end to give us more insight? Is this generated in all scenarios or just specific error scenarios?

Perhaps if you are able to search by Correlation ID on your end we can track down some successes and some failures (like the failure shown above 3dc44501-37d0-4e0d-9cfe-b30faadfeed5) to help.

Thanks

@jialindai
Copy link
Contributor

It seems @sandboxbohemian has done some customization to set timeout of JWTResourceRetriever. Actually following configurations can be used without customization (for version 2.1.8 and 2.2.2):

  • azure.activedirectory.jwt-connect-timeout
  • azure.activedirectory.jwt-read-timeout
  • azure.activedirectory.jwt-size-limit

It seems that the timeout problem is intermittent and likely caused by network. Besides troubleshooting on the network part, there is something can be done in application which can further mitigate the problem.

One approach is to make JWK key cache expiration time configurable (currently it's the default 5 minutes)

Another approach is allowing custom JWKSource implementation. In current azure-spring-boot, RemoteJWKSet is used, its simplified logic is:

  • Get keys from cache
  • In case cache is empty or expired, get keys from remote

A custom implementation could be:

  • Get keys from cache
  • If cache is empty, need to wait to retrieve keys from remote
  • In case cache is expired, use a background worker to retrieve keys from remote (with retries)
    • Before new keys are returned, the existing keys in cache are used.

Note even one of above approach is used, the network timeout problem can still block JWT token verification.

@HEVARY, @bcannariato, @sandboxbohemian, @neuqlz, please share your thoughts.

@bcannariato
Copy link
Contributor

@jialindai do you have an example of a custom JWKSource implementation? I tried searching around and couldn't find anything. Not that familiar with it outside of our use of the library.

@jialindai
Copy link
Contributor

@bcannariato , I think you can reference RemoteJWKSet which is built in implementation of JWKSource in nimbus-jose-jwt library.

@bcannariato
Copy link
Contributor

@jialindai @HEVARY we are working on the cache portion.

Wanted to just circle back on the correlation ID part. We know that is generated from the library (which we call from our app). Is anyone able to track that within their logs? I think there was some confusion about the correlation ID. I believe we wanted to know if you had any way of utilizing that correlation ID on your end to track faillures.

Let me know if this is possible or if this is unclear.

Hi, also from the app team with @sandboxbohemian .

@neuqlz can you please respond to @HEVARY above. We have a question relating to the Correlation IDs as seen from the above logs.

He is not able to see anything on his end relating to the Correlation ID. This gets generated from the adal library. Is this something that you can track down on your end to give us more insight? Is this generated in all scenarios or just specific error scenarios?

Perhaps if you are able to search by Correlation ID on your end we can track down some successes and some failures (like the failure shown above 3dc44501-37d0-4e0d-9cfe-b30faadfeed5) to help.

Thanks

@superrdean
Copy link
Contributor

@bcannariato Sorry, I am a little confused about "your end" you mentioned above. From my perspective, there are two ends, one is the app, in the app you use our SDK to do something, the other end is the AADGateway which provides the endpoint for the SDK to curl. So you should be able to see all the logs output by the SDK. And we don't store any logs in our team by the way.

@superrdean
Copy link
Contributor

@bcannariato as for the customized cache part, we can discuss it in the PR which was created by @sandboxbohemian. I had given some comments. After all the comments are solved, the PR will be merged into the master.

@bcannariato
Copy link
Contributor

Yeah I responded back on the pull request.

There might've been some confusion in terms of the azure "end" vs the SDK "end". Right now as per the above post we sometimes see failures related to a correlation ID:

[Correlation ID: 669c3636-1dcc-442f-8f10-6a0aaadfeed5] Execution of class com.microsoft.aad.adal4j.AcquireTokenCallable failed.

We were mostly wondering if this would be useful on the azure "end" to track failed requests. @HEVARY I believe we discussed this point before, you cannot see these on your end at all?

@superrdean
Copy link
Contributor

@bcannariato yep, azure "end" is the team that hosts the endpoint https://login.microsoftonline.com/common/discovery/keys. According to @HEVARY 's information, it is the AAD gateway team. I think you two can think together more deeply to figure out the timeout problem.
And I remembered that @HEVARY mentioned your application is hosted in AWS, could the reason be the network condition of AWS? That might be a way to think about it.

@HEVARY
Copy link

HEVARY commented Feb 5, 2020

Well, in this scenario the correlation is was generated at application. The ADAL library used by application generated code. If Aad generates code, we should see error like AADxxxxxxxx

@bcannariato
Copy link
Contributor

So the examples we've posted with the correlation ID @HEVARY you can see those logs in your system?

@HEVARY
Copy link

HEVARY commented Feb 5, 2020 via email

@bcannariato
Copy link
Contributor

So what use is the correlation ID then if we only see the one log statement/failure that we showed above?

Just trying to get a better understanding what we could do with the correlation ID on our end. We only see the one log statement for the one failure that I can see.

@bcannariato
Copy link
Contributor

Please see above new pull request when possible for caching help

@jialindai
Copy link
Contributor

I think there are two things mixed together:

  • Timeout when retrieving JWK. (which is the main issue we are looking)
  • There is one log show correlation id and states acquire token failed.

The code to retrieve JWK is not from adal library, it's actually from nimbus library.

The correlation id is probably related to logic to query graph api. @neuqlz , could you help to take a look?

@superrdean
Copy link
Contributor

Hi @bcannariato
As for the cache customized part, I have approved your PR, I will merge it into master once the IT passes.
I have checked the code. The code will do these two things in AADAuthenticationFilter:

  • validate IdToken (in this part, our code will use nimbus library to retrieve JWK)

  • get GraphToken by IdToken (in this part, our code will use Adal library to get GraphToken )

In the logs of your application, you got

  • com.nimbusds.jose.RemoteKeySourceException: Couldn't retrieve remote JWK set: Read timed out

  • [Correlation ID: 3dc44501-37d0-4e0d-9cfe-b30faadfeed5] Execution of class com.microsoft.aad.adal4j.AcquireTokenCallable failed

Just as @jialindai said, they are two things.

As for the first error log, now we can mitigate it by customizing the cache just like what the pull request #827 does. In order to find the root cause, you may need to check your network environment.

As for the second error log, a point that needs to be noticed is that your application has retrieved JWK successfully if you see the log, then as I mentioned above, your code will get GraphToken using Adal4j library, but it failed. So it's a different story with the timeout problem. And because you don't provide the whole exception information, I can't find the reason why it failed. We can create a new issue to investigate it deeply if you want. ^_^

And I noticed that you are using com.microsoft.azure:azure-active-directory-spring-boot-starter:2.1.2 it is pretty old now, I recommend that you update to the latest 2.1.x version.

@chenrujun
Copy link

Closing this issue.
Because it's not active for a long time.
If anyone have similar issue, please create issue in new repo: https://github.com/Azure/azure-sdk-for-java/issues

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants