-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] After a certain number of timeouts ARM ends up in a bad state where all future calls also timeout #37694
Comments
Hi @jfleuryStatcan Is there a real-world scenario behind this repro? I don't think any requests will succeed with the |
Hi @jfleuryStatcan. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue. |
The first 20 times the loop runs it is supposed to timeout. After that the timeout gets set to default, so it should succeed. But it doesn't. |
We've had real world events where our application has timeouts with the default setting and then it ends up in this bad state were every call timeouts until the application is restarted. |
I think there may be a bug in your repro code. I can't repro if I eliminate the outer most if block. |
Hi @jfleuryStatcan. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue. |
Not sure why you are eliminating the outer most if block. Here is the code written a different way. Maybe this is more clear.
|
So the real world scenario here is you have an application that makes these calls every so often. Normally they work but sometimes there is a problem with the network for a period of time and you get a bunch of timeouts. Eventually the network goes back to normal but these calls keep timing out. The application ends up in a bad state and they will continually time out until the application is restarted. So to reproduce this issue what I'm doing is setting the timeout to something ridiculously small for the first 20 times we make this call. This way we get some timeouts. After those first 20 times we sent the timeout back to the default values. It should work normally now. The fact that it failed 20 times previously should have no impact. But this will continually timeout even when we revert to default settings. It takes the whole 5ish minutes with default settings and then times out. Because the first 20 calls timed out the system is now in a bad state and all future calls timeout. |
So if we write the repro code like below then it should count down from 20 to 0 and then every 3 seconds it should output 0 and grab the instance view. We don't output exceptions that happen when the timeout was set to 1ms. When the timeout reverts to default it should work. Instead what we get is a countdown to zero and then timeouts exceptions. All future calls timeout even with the default 00:01:40 timeout.
|
With that latest code block, I see get the following output: (I added an additional console write to show that we are in the exception block regardless of timeout setting.
|
Hi @jfleuryStatcan. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue. |
Interesting. Looks like you can't reproduce the issue. Are you using .net 6? My output obviously looks like this: And if I run it again with the count initially set to 0 then it looks like this: |
Hi @jfleuryStatcan. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue. |
With count set to 1 or 5 I haven't been able to reproduce the issue. With count set to 10 I've been able to reproduce the issue the majority of the time but at least once or twice it didn't happen. Count of 20 seems to reproduce the issue every time for me. I tried it again using Azure.Identity 1.10.0-beta1. I was not able to reproduce the issue. Is there a fix in that version or does it just make it harder to reproduce the issue? |
I think it would be the newer Azure.Core that is brought in by that version of Azure.Identity. Can you try reverting that to 1.9.0 and only upgrading Azure.Core directly? https://www.nuget.org/packages/Azure.Core/1.34.0 |
I was able to reproduce the issue using the following packages:
[Informational] Azure-Identity: DefaultAzureCredential.GetToken invoked. Scopes: [ https://management.azure.com//.default ] ParentRequestId: d35e328c-61d9-4278-900f-c660bbc14700 [Informational] Azure-Identity: False MSAL 4.49.1.0 MSAL.NetCore .NET 6.0.18 Microsoft Windows 10.0.19044 [2023-07-19 15:08:50Z - 5334c337-285c-437b-8ceb-780943a50e0d] [Informational] Azure-Identity: False MSAL 4.49.1.0 MSAL.NetCore .NET 6.0.18 Microsoft Windows 10.0.19044 [2023-07-19 15:08:50Z - 5334c337-285c-437b-8ceb-780943a50e0d] === Token Acquisition (ClientCredentialRequest) started: [Error] Azure-Identity: False MSAL 4.49.1.0 MSAL.NetCore .NET 6.0.18 Microsoft Windows 10.0.19044 [2023-07-19 15:08:50Z - 5334c337-285c-437b-8ceb-780943a50e0d] Exception type: Azure.Identity.CredentialUnavailableException at Azure.Identity.ImdsManagedIdentitySource.HandleResponseAsync(Boolean async, TokenRequestContext context, Response response, CancellationToken cancellationToken) Content: Headers: [Informational] Azure-Identity: VisualStudioCredential.GetToken invoked. Scopes: [ https://management.azure.com//.default ] ParentRequestId: d35e328c-61d9-4278-900f-c660bbc14700 ---> (Inner Exception #2) System.Threading.Tasks.TaskCanceledException: The operation was cancelled because it exceeded the configured timeout of 0:01:40. Network timeout can be adjusted in ClientOptions.Retry.NetworkTimeout. ---> (Inner Exception #3) System.Threading.Tasks.TaskCanceledException: The operation was cancelled because it exceeded the configured timeout of 0:01:40. Network timeout can be adjusted in ClientOptions.Retry.NetworkTimeout. 0 [Informational] Azure-Identity: False MSAL 4.49.1.0 MSAL.NetCore .NET 6.0.18 Microsoft Windows 10.0.19044 [2023-07-19 15:15:47Z - ca49b242-617a-4b52-a39b-efbf8305aa20] [Informational] Azure-Identity: False MSAL 4.49.1.0 MSAL.NetCore .NET 6.0.18 Microsoft Windows 10.0.19044 [2023-07-19 15:15:47Z - ca49b242-617a-4b52-a39b-efbf8305aa20] === Token Acquisition (ClientCredentialRequest) started: [Error] Azure-Identity: False MSAL 4.49.1.0 MSAL.NetCore .NET 6.0.18 Microsoft Windows 10.0.19044 [2023-07-19 15:15:47Z - ca49b242-617a-4b52-a39b-efbf8305aa20] Exception type: Azure.Identity.CredentialUnavailableException at Azure.Identity.ImdsManagedIdentitySource.HandleResponseAsync(Boolean async, TokenRequestContext context, Response response, CancellationToken cancellationToken) Content: Headers: [Informational] Azure-Identity: VisualStudioCredential.GetToken invoked. Scopes: [ https://management.azure.com//.default ] ParentRequestId: 30142602-07f2-4ad5-875d-eba49c25045a |
I was able to reproduce the issue with the beta version of Identity as well.
[Informational] Azure-Identity: DefaultAzureCredential.GetToken invoked. Scopes: [ https://management.azure.com//.default ] ParentRequestId: d5cde929-b9d7-4569-9357-8298ecfb6ca8 [Informational] Azure-Identity: False MSAL 4.54.1.0 MSAL.NetCore .NET 6.0.18 Microsoft Windows 10.0.19044 [2023-07-19 15:32:07Z - 21b1ec5c-7a3f-4c01-97dc-26588147f083] [Informational] Azure-Identity: False MSAL 4.54.1.0 MSAL.NetCore .NET 6.0.18 Microsoft Windows 10.0.19044 [2023-07-19 15:32:07Z - 21b1ec5c-7a3f-4c01-97dc-26588147f083] === Token Acquisition (ClientCredentialRequest) started: [Error] Azure-Identity: False MSAL 4.54.1.0 MSAL.NetCore .NET 6.0.18 Microsoft Windows 10.0.19044 [2023-07-19 15:32:07Z - 21b1ec5c-7a3f-4c01-97dc-26588147f083] Exception type: Azure.Identity.CredentialUnavailableException at Azure.Identity.ImdsManagedIdentitySource.HandleResponseAsync(Boolean async, TokenRequestContext context, Response response, CancellationToken cancellationToken) Content: Headers: [Informational] Azure-Identity: VisualStudioCredential.GetToken invoked. Scopes: [ https://management.azure.com//.default ] ParentRequestId: d5cde929-b9d7-4569-9357-8298ecfb6ca8 ---> (Inner Exception #2) System.Threading.Tasks.TaskCanceledException: The operation was cancelled because it exceeded the configured timeout of 0:01:40. Network timeout can be adjusted in ClientOptions.Retry.NetworkTimeout. ---> (Inner Exception #3) System.Threading.Tasks.TaskCanceledException: The operation was cancelled because it exceeded the configured timeout of 0:01:40. Network timeout can be adjusted in ClientOptions.Retry.NetworkTimeout. |
Thanks - just to clarify, is there a config that you still cannot reproduce with? |
I have yet to reproduce the problem with .net core 3.1 or .net 5. It's hard to say if it's just easier to reproduce with .net 6 or if the problem is really with .net 6. I'm going to try .net 7 and .net 8 shortly. Are you able reproduce the problem at all with the different package versions? |
Sorry, I meant can you still reproduce with all the package version permutations we tried? I thought there was one combo that wouldn't reproduce for you even on net 6.0 |
I was just able to reproduce the issue the following settings as well. I think that's just about every permutation now. Using newer versions of Core/Identity does seems to make it harder to reproduce but not impossible.
|
I'm afraid that I'm unable to repro, even with the older Azure.Identity. Just out of curiosity, in your actual application, are you using singletons for the ArmClient and credential? |
Hi @jfleuryStatcan. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue. |
We are not using singletons. When we first started investigating this we noticed that it's recommended to use singletons as a best practice. So that's something we've been planning on trying. We should actually be deploying that fix tomorrow morning if everything goes well. Another idea I had around this was that the ARMClients are using a singleton HttpClient instance. So if one failed call was affecting the next I thought maybe using a new HttpClient instance each time and disposing of it may help. We tried that yesterday and it did not work. Hopefully using a singleton ARMClient fixes the problem. |
Unless you are configuring a custom transport via ArmClientOptions, the singleton probably won't make a big difference with regard to the Http client behavior. Under the covers, we use a singleton HttpClient for all clients in the same process, unless a custom transport is provided. It will consume fewer resources, however. The single instance of HttpClient is exactly what should be happening to efficiently use the connection pool. See this blog post for details. |
Hi @jfleuryStatcan. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue. |
Hi @jfleuryStatcan, we're sending this friendly reminder because we haven't heard back from you in 7 days. We need more information about this issue to help address it. Please be sure to give us your input. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you! |
It seems like using a singleton for ARMClient, ArmClientOptions and HttpClient solved our problems. At least we haven't had any exceptions or lockups in the last three weeks. |
Library name and version
Azure.ResourceManager 1.7.0
Describe the bug
After a certain number of timeouts exception resource manager ends up in a bad state where all subsequent calls will time out even if the far end is available. The application needs to be restarted.
This seems to only happen when multiple ArmClients are created. If we pass in HttpClients to each one and dispose of them when we are done the problem disappears. This seems to have something to do with all Azure SDK clients, by default, sharing a single HttpClient instance.
Expected behavior
Subsequent calls work and are not impacted by previous timeouts.
Actual behavior
All calls time out until the application is restarted.
Here is the exception we get:
Exception: ClientSecretCredential authentication failed: Request to the endpoint timed out. at Azure.Identity.CredentialDiagnosticScope.FailWrapAndThrow(Exception ex, String additionalMessage)
at Azure.Identity.ClientSecretCredential.GetToken(TokenRequestContext requestContext, CancellationToken cancellationToken)
at Azure.Core.Pipeline.BearerTokenAuthenticationPolicy.AccessTokenCache.GetHeaderValueFromCredentialAsync(TokenRequestContext context, Boolean async, CancellationToken cancellationToken)
at Azure.Core.Pipeline.BearerTokenAuthenticationPolicy.AccessTokenCache.GetHeaderValueAsync(HttpMessage message, TokenRequestContext context, Boolean async)
at Azure.Core.Pipeline.BearerTokenAuthenticationPolicy.AccessTokenCache.GetHeaderValueAsync(HttpMessage message, TokenRequestContext context, Boolean async)
at Azure.Core.Pipeline.BearerTokenAuthenticationPolicy.AuthorizeRequest(HttpMessage message)
at Azure.Core.Pipeline.BearerTokenAuthenticationPolicy.ProcessAsync(HttpMessage message, ReadOnlyMemory
1 pipeline, Boolean async) at Azure.Core.Pipeline.BearerTokenAuthenticationPolicy.Process(HttpMessage message, ReadOnlyMemory
1 pipeline)at Azure.Core.Pipeline.HttpPipelinePolicy.ProcessNext(HttpMessage message, ReadOnlyMemory
1 pipeline) at Azure.Core.Pipeline.RedirectPolicy.ProcessAsync(HttpMessage message, ReadOnlyMemory
1 pipeline, Boolean async)at Azure.Core.Pipeline.RedirectPolicy.Process(HttpMessage message, ReadOnlyMemory
1 pipeline) at Azure.Core.Pipeline.RetryPolicy.ProcessAsync(HttpMessage message, ReadOnlyMemory
1 pipeline, Boolean async)at Azure.Core.Pipeline.RetryPolicy.ProcessAsync(HttpMessage message, ReadOnlyMemory
1 pipeline, Boolean async) at Azure.Core.Pipeline.RetryPolicy.Process(HttpMessage message, ReadOnlyMemory
1 pipeline)at Azure.Core.Pipeline.HttpPipelineSynchronousPolicy.Process(HttpMessage message, ReadOnlyMemory
1 pipeline) at Azure.Core.Pipeline.HttpPipelineSynchronousPolicy.Process(HttpMessage message, ReadOnlyMemory
1 pipeline)at Azure.Core.Pipeline.HttpPipelineSynchronousPolicy.Process(HttpMessage message, ReadOnlyMemory`1 pipeline)
at Azure.Core.Pipeline.HttpPipeline.Send(HttpMessage message, CancellationToken cancellationToken)
at Azure.ResourceManager.Compute.VirtualMachinesRestOperations.InstanceView(String subscriptionId, String resourceGroupName, String vmName, CancellationToken cancellationToken)
at Azure.ResourceManager.Compute.VirtualMachineResource.InstanceView(CancellationToken cancellationToken)
Reproduction Steps
Environment
This was run on windows 10 using .net 6 and Visual Studio 17.1.1
The text was updated successfully, but these errors were encountered: