-
Notifications
You must be signed in to change notification settings - Fork 457
Couldn't retrieve remote JWK set: connect/read timed out #802
Comments
Thanks for your feedback, we will inspect it, then give you a response later. Please stay tuned. |
Hi @sandboxbohemian |
@neuqlz Thanks for looking into this. Although the information you provided is correct, it does not really provide enough context to me as to how this issue can be mediated. In the code, we have increased the timeouts to 3000 ms for the I am also confused as to why caching the JWK set is not a good option here. We spoke with the Microsoft Identity Support team, and they mentioned that the .Net libraries cache the JWK set by default as the JWKS content changes every 2-3 days. The nimbus library also provides a constructor as I am not saying that caching should be implemented by default, but the option to configure it should be available as part of auto-configuration. If you can point me to their repo, I can also create an issue report with the JWK hosting team. The endpoints I have used so far are |
@neuqlz - Thank you for working on this, as per our discussion over teams -- do we have any update on the fix or plan of action ? I really appreciate if we speed this process. Thank you, |
Hi @sandboxbohemian |
@neuqlz - Let me check with the team that handles this endpoint. |
@neuqlz - Can you give an previous instances, where you have noticed a timeout from AAD even after the cache is added. |
@HEVARY |
@neuqlz
Based on my analysis, I have queried at AADGateway level and I don't see any failures from the AppID's -- I am not able to find any request with the correlation ID. With this, I can clearly confirm the issue is not related to AAD because we were not seeing this logs at all at our end. This can clearly be an issue with the network where the application is hosted or SpringBoot. if we raised timeout to 5000Ms and still seeing these issues then it is clearly a network problem. @sandboxbohemian _ please share me your thoughts for the same |
I have removed appId due to security reasons. Please reach out to me if you need the appid's. |
Hi, also from the app team with @sandboxbohemian . @neuqlz can you please respond to @HEVARY above. We have a question relating to the Correlation IDs as seen from the above logs. He is not able to see anything on his end relating to the Correlation ID. This gets generated from the adal library. Is this something that you can track down on your end to give us more insight? Is this generated in all scenarios or just specific error scenarios? Perhaps if you are able to search by Correlation ID on your end we can track down some successes and some failures (like the failure shown above 3dc44501-37d0-4e0d-9cfe-b30faadfeed5) to help. Thanks |
It seems @sandboxbohemian has done some customization to set timeout of JWTResourceRetriever. Actually following configurations can be used without customization (for version 2.1.8 and 2.2.2):
It seems that the timeout problem is intermittent and likely caused by network. Besides troubleshooting on the network part, there is something can be done in application which can further mitigate the problem. One approach is to make JWK key cache expiration time configurable (currently it's the default 5 minutes) Another approach is allowing custom JWKSource implementation. In current azure-spring-boot, RemoteJWKSet is used, its simplified logic is:
A custom implementation could be:
Note even one of above approach is used, the network timeout problem can still block JWT token verification. @HEVARY, @bcannariato, @sandboxbohemian, @neuqlz, please share your thoughts. |
@jialindai do you have an example of a custom JWKSource implementation? I tried searching around and couldn't find anything. Not that familiar with it outside of our use of the library. |
@bcannariato , I think you can reference RemoteJWKSet which is built in implementation of JWKSource in nimbus-jose-jwt library. |
@jialindai @HEVARY we are working on the cache portion. Wanted to just circle back on the correlation ID part. We know that is generated from the library (which we call from our app). Is anyone able to track that within their logs? I think there was some confusion about the correlation ID. I believe we wanted to know if you had any way of utilizing that correlation ID on your end to track faillures. Let me know if this is possible or if this is unclear.
|
@bcannariato Sorry, I am a little confused about "your end" you mentioned above. From my perspective, there are two ends, one is the app, in the app you use our SDK to do something, the other end is the AADGateway which provides the endpoint for the SDK to curl. So you should be able to see all the logs output by the SDK. And we don't store any logs in our team by the way. |
@bcannariato as for the customized cache part, we can discuss it in the PR which was created by @sandboxbohemian. I had given some comments. After all the comments are solved, the PR will be merged into the master. |
Yeah I responded back on the pull request. There might've been some confusion in terms of the azure "end" vs the SDK "end". Right now as per the above post we sometimes see failures related to a correlation ID:
We were mostly wondering if this would be useful on the azure "end" to track failed requests. @HEVARY I believe we discussed this point before, you cannot see these on your end at all? |
@bcannariato yep, azure "end" is the team that hosts the endpoint https://login.microsoftonline.com/common/discovery/keys. According to @HEVARY 's information, it is the AAD gateway team. I think you two can think together more deeply to figure out the timeout problem. |
Well, in this scenario the correlation is was generated at application. The ADAL library used by application generated code. If Aad generates code, we should see error like AADxxxxxxxx |
So the examples we've posted with the correlation ID @HEVARY you can see those logs in your system? |
So what use is the correlation ID then if we only see the one log statement/failure that we showed above? Just trying to get a better understanding what we could do with the correlation ID on our end. We only see the one log statement for the one failure that I can see. |
Please see above new pull request when possible for caching help |
I think there are two things mixed together:
The code to retrieve JWK is not from adal library, it's actually from nimbus library. The correlation id is probably related to logic to query graph api. @neuqlz , could you help to take a look? |
Hi @bcannariato
In the logs of your application, you got
Just as @jialindai said, they are two things. As for the first error log, now we can mitigate it by customizing the cache just like what the pull request #827 does. In order to find the root cause, you may need to check your network environment. As for the second error log, a point that needs to be noticed is that your application has retrieved JWK successfully if you see the log, then as I mentioned above, your code will get GraphToken using Adal4j library, but it failed. So it's a different story with the timeout problem. And because you don't provide the whole exception information, I can't find the reason why it failed. We can create a new issue to investigate it deeply if you want. ^_^ And I noticed that you are using com.microsoft.azure:azure-active-directory-spring-boot-starter:2.1.2 it is pretty old now, I recommend that you update to the latest 2.1.x version. |
Closing this issue. |
Environment
Spring boot starter:
OS Type: Linux
Java version: 1.8
Summary
We have a Spring Boot app that uses Spring Security with the Azure AD starter. We have observed sporadic connect and read timeouts while the library tries to get a response from the JWKS URL (https://login.microsoftonline.com/common/discovery/keys).
Reproduce steps
We are using the following dependencies in our stack
org.springframework.boot:spring-boot-starter-security:2.1.0.RELEASE
org.springframework.security:spring-security-oauth2-client:5.1.3.RELEASE
org.springframework.security:spring-security-oauth2-jose:5.1.3.RELEASE
com.microsoft.azure:azure-active-directory-spring-boot-starter:2.1.2
com.nimbusds:oauth2-oidc-sdk:5.64.4
We have the same code running in two different environments under two different tenant ids, and we have observed timeouts in only one of them. Following is the stack trace.
Following #417 , we set the timeout parameters as below, but even then we have observed timeouts (but definitely fewer).
azure.activedirectory.jose.connect-timeout=2000
azure.activedirectory.jose.read-timeout=2000
azure.activedirectory.jose.size-limit=51200
Stepping through the code, we found that the
JWTResourceRetriever
is conditional, so we set it up to ensure that it gets overwritten with the parameters above.And in a different @configuration class to avoid cyclic dependencies
JoseConfigurationproperties.java
Expected Results
The connection should not have timed out. Over a period of 7 days, we have a 55-45 split between connect timeout and read timeout.
Also, since this is serving static content, is there any way to force caching of the JWKS content and not perform a fetch on every incoming request?
Actual Results
The text was updated successfully, but these errors were encountered: