Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

azcosmos occasionally returning 403 Forbidden Connection insufficiently secured #19785

Closed
serbrech opened this issue Jan 11, 2023 · 42 comments
Closed
Assignees
Labels

Comments

@serbrech
Copy link
Member

Bug Report

  • pakage : /sdk/data/azcosmos
  • SDK version: latest
  • go version: 1.18.6

Every now and then, a call to cosmos db will return 403 Forbidden, with a message that the client did not use the minimum TLS version:

GET https://XXXXX.documents.azure.com:443/dbs/xxx/colls/xxx/docs/1822b18eb972595bda1b797b332dff1b11567aaaba936ce75824bc0fefdd282e
--------------------------------------------------------------------------------
RESPONSE 403: 403 Forbidden
ERROR CODE: Forbidden
--------------------------------------------------------------------------------
{
"code": "Forbidden",
"message": "Connection is insufficiently secured. Please use Tls SSL protocol or higher\r\nActivityId: adb35793-a4ca-481d-9cc6-dd5d0adf8eb5, documentdb-dotnet-sdk/2.14.0 Host/64-bit MicrosoftWindowsNT/10.0.17763.0"
}
--------------------------------------------------------------------------------
, Dependency: Microsoft.DocumentDB, OriginError: GET https://xxxx.documents.azure.com:443/dbs/xxx/colls/xxxx/docs/1822b18eb972595bda1b797b332dff1b11567aaaba936ce75824bc0fefdd282e
--------------------------------------------------------------------------------
RESPONSE 403: 403 Forbidden
ERROR CODE: Forbidden
--------------------------------------------------------------------------------
{
"code": "Forbidden",
"message": "Connection is insufficiently secured. Please use Tls SSL protocol or higher\r\nActivityId: adb35793-a4ca-481d-9cc6-dd5d0adf8eb5, documentdb-dotnet-sdk/2.14.0 Host/64-bit MicrosoftWindowsNT/10.0.17763.0"
}
--------------------------------------------------------------------------------

Note that the error message includes a reference to the .net document db SDK. We do not use .net, we receive this error when talking to cosmosdb from the golang SDK.

golang 1.18+ defaults to minTLS 1.2 in the http stack, and we explicitely set it anyways.

almost all calls from that same client succeed, and a few calls a day fail with 403 without any changes in configuration. this happens in all regions since early december.

  • What did you expect or want to happen?

no min TLS error

  • How can we reproduce it?

run a service that issues calls to a DB continuously, it will hit this error.

  • Anything we should know about your environment.

I don't think there is anything special about our environment.

@ghost ghost added the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Jan 11, 2023
@jhendrixMSFT jhendrixMSFT added Cosmos and removed needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. labels Jan 11, 2023
@ealsur
Copy link
Member

ealsur commented Jan 11, 2023

Cosmos Go SDK has no custom HTTP configurations and uses the Azure SDK pipeline, I don't see what could be related to this from the Cosmos SDK point of view.

@jhendrixMSFT Is the TLS behavior something that the Go HTTP client sets?

@ealsur
Copy link
Member

ealsur commented Jan 11, 2023

@serbrech You mentioned that this started on early december, was there any change related to the code itself? Was the Cosmos Go SDK updated or changed? Any configuration on the environment that might have changed?

If not code changed, then probably the root cause of this issue is not the application code from what I can understand?

@serbrech
Copy link
Member Author

serbrech commented Jan 11, 2023

That's correct yes, it uses the standard httop client, which defaults to minTLS 1.2 since go 1.18.
We also happen to manually configure minTLS to 1.2 anyways, it's hardcoded in our client configuration.

I reported this to cosmosdb service directly, and they bounced me to reporting this to the SDK as a client issue, eventhough I can't see how this could happen occasionally, and non-deterministically from the client side...

Note the error message logging a .net SDK, as well as the message itself being poorly worded. it seems it's missing the version argument :

"Please use Tls SSL protocol or higher".

I think the message might be

"Please use Tls SSL protocol 1.2 or higher"

but the 1.2 argument is empty in the gateway code.

@ealsur
Copy link
Member

ealsur commented Jan 11, 2023

the error message logging a .net SDK

This is because the error message comes from a .NET environment, and they are including the internal client identifier, Cosmos DB consumes the SDKs internally too. Your HTTP request is reaching Gateway (https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/sdk-connection-modes) and then internally being routed to the backend replicas through TCP.

I can't see how this could happen occasionally, and non-deterministically from the client side...

My same thought. Unless something on the environment is randomly removing TLS.

So there were no code or environment changes in December and it just started to happen?

@ealsur
Copy link
Member

ealsur commented Jan 11, 2023

Could this be related: #19469 (comment)

@mattKorwel
Copy link

i'm seeing the same behavior and its not occasional, i'd say its pretty consistent At about 2 out of 10 calls. This is happening both in dev and In production (meaning i highly doubt Its environment On our side).

@mattKorwel
Copy link

Could this be related: #19469 (comment)

@ealsur the related issue seems unlikely to me. the call works, just not all the time. i did a loop upserting the same document and out of 100 calls 20 of them threw this error. if it was a config issue i would expect none of them to work.

ps this also just Started recently looking at our logs.

@ealsur
Copy link
Member

ealsur commented Jan 13, 2023

@mattKorwel Is this something that started happening for you too without any client changes? Did you update the Go version by any chance? (that is one of the things we discovered on the other case). When was the date it started for you?

@mattKorwel
Copy link

mattKorwel commented Jan 14, 2023

Friday dec 9th is the first time we see it In the logs. no config changes or go version changes i can find. ink fact no deploys that week at all with the early holidays.

i did update to the newest version of the sdk today and am testing it now and think i'm not seeing it anymore.... haven't sent it to prod yet though. as soon as i typed that i had failures show up. i jinxed it.

BTW, here is an activityId from a recent failure, i am hoping you have the ability to look up server side logs based on that? 9c2bceab-0458-4ac4-9620-cd31993bbc55

@serbrech
Copy link
Member Author

serbrech commented Jan 15, 2023

We started seeing it early December too. First occurrence in our logs was on the 3rd if I recall correctly from the investigation data.

@mattKorwel
Copy link

@ealsur Any luck on tracing this down? I'm seeing it ever more frequently. I've added retries specifically for this in most code spots but its quite impactful at this point.

@ealsur
Copy link
Member

ealsur commented Jan 18, 2023

@mattKorwel No luck so far, this seems unrelated to SDK, I have reported it to the service team to try and track down the source. So far it seems only Go customers are experiencing it.

@JasonQuinn
Copy link

I've also started seeing this happen once or twice a day, in code which was deployed last month so it ran without seeing this issue until the 21st Jan.

@ealsur
Copy link
Member

ealsur commented Jan 23, 2023

@JasonQuinn Would you mind sharing 1 ActivityId from the failures you are seeing?

@JasonQuinn
Copy link

@ealsur The activity id for one that happened at 00:09 UTC today was ecea9ef6-b5a3-4119-b0d6-b3ec3f1040eb.

The full error was
[*exported.ResponseError]: GET https://xxxx.documents.azure.com:443/dbs/xxx/colls/xxx/docs/xxx -------------------------------------------------------------------------------- RESPONSE 403: 403 Forbidden ERROR CODE: Forbidden -------------------------------------------------------------------------------- { "code": "Forbidden", "message": "Connection is insufficiently secured. Please use Tls SSL protocol or higher\r\nActivityId: ecea9ef6-b5a3-4119-b0d6-b3ec3f1040eb, documentdb-dotnet-sdk/2.14.0 Host/64-bit MicrosoftWindowsNT/10.0.17763.0" }

@jim-minter
Copy link
Member

Interesting that the error ^^ says documentdb-dotnet-sdk (i.e. not golang)
@serbrech

@serbrech
Copy link
Member Author

serbrech commented Jan 25, 2023

@jim-minter
I had the same remark initially. It has been discussed internally. That's the gateway from document db that emits the error adding metadata. I think that's an issue on its own, but according to them, it's separate from the root cause.

@RobertGreyling
Copy link

I came across this thread looking for a solution to the same error taking place in our production environment for approximately the last 2 days. Nothing in our production env has changed. We are using DAPR as part of our microservice architecture which is built in Go, but we're running our services on top of that built in .NET Core 6. The production env has remain unchanged since late 2022, so we know there's no config/code change and we're getting the same error from two of our 26 microservices, and only getting them occasionally, so it seems to be that Cosmos in Azure is rejecting these requests for some reason. Our errors all look very similar to this:

Dapr.DaprException: State operation failed: the Dapr endpoint indicated a failure. See InnerException for details.
 ---> Grpc.Core.RpcException: Status(StatusCode="Internal", Detail="fail to get 2df68788cd9142e6b5b453ff6a8585a9 from state store state-xxx: Forbidden, Connection is insufficiently secured. Please use Tls SSL protocol or higher
ActivityId: e3930f2d-96c9-4ed9-be90-e09098e060b6, documentdb-dotnet-sdk/2.14.0 Host/64-bit MicrosoftWindowsNT/10.0.17763.0")
   at Dapr.Client.DaprClientGrpc.GetStateAsync[TValue](String storeName, String key, Nullable`1 consistencyMode, IReadOnlyDictionary`2 metadata, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at Dapr.Client.DaprClientGrpc.GetStateAsync[TValue](String storeName, String key, Nullable`1 consistencyMode, IReadOnlyDictionary`2 metadata, CancellationToken cancellationToken)
   at xxx.yyy.zzz.GetStateAsync[TValue](String storeName, String key, Nullable`1 consistencyMode, IReadOnlyDictionary`2 meta, CancellationToken cancellationToken)
   at xxx.yyy.zzz.Handlers.GetQuantityAvailabilityHandler.Handle(GetQuantityAvailabilityRequest handlerRequest, CancellationToken cancellationToken) in /home/Technical-Agent/xxx-yyy-deploys/master/xxx-yyy-Services-Availability/xxx.yyy.zzz/Handlers/QuantityAvailabilityHandlers/GetQuantityAvailabilityHandler.cs:line 24
   at MediatR.Pipeline.RequestExceptionProcessorBehavior`2.Handle(TRequest request, CancellationToken cancellationToken, RequestHandlerDelegate`1 next)
   at MediatR.Pipeline.RequestExceptionProcessorBehavior`2.Handle(TRequest request, CancellationToken cancellationToken, RequestHandlerDelegate`1 next)
   at MediatR.Pipeline.RequestExceptionActionProcessorBehavior`2.Handle(TRequest request, CancellationToken cancellationToken, RequestHandlerDelegate`1 next)
   at MediatR.Pipeline.RequestExceptionActionProcessorBehavior`2.Handle(TRequest request, CancellationToken cancellationToken, RequestHandlerDelegate`1 next)
   at MediatR.Pipeline.RequestPostProcessorBehavior`2.Handle(TRequest request, CancellationToken cancellationToken, RequestHandlerDelegate`1 next)
   at MediatR.Pipeline.RequestPreProcessorBehavior`2.Handle(TRequest request, CancellationToken cancellationToken, RequestHandlerDelegate`1 next)
   at xxx.yyy.zzz.Controllers.QuantityController.UpdateRemainingMethod(String id, Int32 remaining, Authentication authentication) in /home/Technical-Agent/xxx-yyy-deploys/master/xxx-yyy-Services-Availability/xxx.yyy.zzz/Controllers/QuantityController.cs:line 318
   at xxx.yyy.zzz.Controllers.QuantityController.SubUpdateRemaining(UpdateQuantityAvailabilityRemainingRequest request) in /home/Technical-Agent/xxx-yyy-deploys/master/xxx-yyy-Services-Availability/xxx.yyy.zzz/Controllers/QuantityController.cs:line 226
   at Microsoft.AspNetCore.Mvc.Infrastructure.ActionMethodExecutor.TaskOfIActionResultExecutor.Execute(IActionResultTypeMapper mapper, ObjectMethodExecutor executor, Object controller, Object[] arguments)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeActionMethodAsync>g__Logged|12_1(ControllerActionInvoker invoker)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeNextActionFilterAsync>g__Awaited|10_0(ControllerActionInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Rethrow(ActionExecutedContextSealed context)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeInnerFilterAsync>g__Awaited|13_0(ControllerActionInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeFilterPipelineAsync>g__Awaited|19_0(ResourceInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeAsync>g__Logged|17_1(ResourceInvoker invoker)
   at Microsoft.AspNetCore.Routing.EndpointMiddleware.<Invoke>g__AwaitRequestTask|6_0(Endpoint endpoint, Task requestTask, ILogger logger)
   at Dapr.CloudEventsMiddleware.ProcessBodyAsync(HttpContext httpContext, String charSet)
   at Microsoft.AspNetCore.Authorization.AuthorizationMiddleware.Invoke(HttpContext context)
   at xxx.yyy.zzz`1.LogResponse(HttpContext context, LoggingEntity loggingData)
   at xxx.yyy.zzz`1.LogResponse(HttpContext context, LoggingEntity loggingData)
   at xxx.yyy.zzz`1.Invoke(HttpContext context)
   at xxx.yyy.zzz.Invoke(HttpContext context)
   at Serilog.AspNetCore.RequestLoggingMiddleware.Invoke(HttpContext httpContext)
   at Swashbuckle.AspNetCore.SwaggerUI.SwaggerUIMiddleware.Invoke(HttpContext httpContext)
   at Swashbuckle.AspNetCore.Swagger.SwaggerMiddleware.Invoke(HttpContext httpContext, ISwaggerProvider swaggerProvider)
   at Microsoft.AspNetCore.Builder.Extensions.UsePathBaseMiddleware.Invoke(HttpContext context)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequests[TContext](IHttpApplication`1 application)

@ealsur
Copy link
Member

ealsur commented Jan 25, 2023

@RobertGreyling Is this using what SDK? Or is this a custom REST API wrapper on c#? From the ActivityId, it looks like you are using documentdb-go as library? So not this repo's SDK?

@jim-minter Yes, this is an internal user agent.

@RobertGreyling
Copy link

@ealsur - thanks for responding. This was the only place on the internet (so far) I could find even talking about this, but it did seem coincidental that this is being reported in a Go SDK, and our microservice architecture is running on DAPR that's also built using Go. In order to support state calls to Cosmos, I'm assuming it is using that documentdb-go as a library or sorts to interact. Under the covers in there somewhere is something triggering this problem, at least, that's what I'm gathering from all this.

We're currently running on DAPR v1.8 in production and latest is v1.9, so I may look at a simple upgrade to see if that fixes it, but if this is one of the only places talking about this issue, then I imagine not many people even know about it, not giving me much hope that v1.9 will solve the issue. It seems the problem may be resident in a common Go lib that talks to Azure, but I haven't gone digging there yet. That's what I'll likely do if the upgrade doesn't solve it.

@serbrech
Copy link
Member Author

@RobertGreyling Assuming you use the state store implementation for cosmosdb, it's using this same package:

https://github.com/dapr/components-contrib/blob/master/state/azure/cosmosdb/cosmosdb_query.go

@jim-minter
Copy link
Member

@mattKorwel are you able to provide a simplified reproducer? I haven't yet been able to successfully create one.

@ealsur
Copy link
Member

ealsur commented Jan 25, 2023

I am also working on trying to get a repro, but has been unsuccessful. @serbrech, based on the ActivityId @RobertGreyling is using something that generates this useragent: documentdb-go/v1.3.1-0.20211026005403-13c3593b3c3a dapr-1.5.0 so my guess is this library: https://github.com/a8m/documentdb

I think the only common ground is the Go HttpClient.

@RobertGreyling
Copy link

@serbrech Yes I can confirm we are using the common state store impl for cosmosdb.

@ealsur I have only supplied one ActivityId, but if you need them, I can supply hundreds more as they are happening quite frequently now, though in relatively low numbers when compared with overall successful traffic.

@ealsur
Copy link
Member

ealsur commented Jan 26, 2023

@RobertGreyling one is enough for now, thanks.

@serbrech
Copy link
Member Author

serbrech commented Jan 27, 2023

@mattKorwel Could you share your code that repro with upserts?
I wrote a small program that ran all night without triggering it doing upserts and reads... not a single 403

STATS: GET 200 OK:1166188 - POST 200 OK:1166189

yet it happens all the time in prod for us still.

If that helps anyone to work around it, I made the client retry on 403. it's awful and slow, but at least we don't return an error from our API. With this, we haven't seen a single failure in 3 weeks. it has always succeeded on 2nd try.

	opt.Retry = policy.RetryOptions{
		StatusCodes: []int{
			// azcore defaults, see RetryOptions doc
			http.StatusRequestTimeout,      // 408
			http.StatusTooManyRequests,     // 429
			http.StatusInternalServerError, // 500
			http.StatusBadGateway,          // 502
			http.StatusServiceUnavailable,  // 503
			http.StatusGatewayTimeout,      // 504

			// adding 403 to work around random 403 from cosmosdb
			// see https://github.com/Azure/azure-sdk-for-go/issues/19785
			http.StatusForbidden, // 403
		},
		MaxRetries:    2,
		TryTimeout:    5 * time.Second,
		RetryDelay:    1 * time.Second,
		MaxRetryDelay: 2 * time.Second, // Delay increases exponentially, delay from 1 second, 2 time will be 2 seconds
	}

@RobertGreyling
Copy link

Update on this, I have raised an issue via our account manage with the Azure Cosmos product team - still awaiting an update from them as to the cause.

We've also since upgraded dapr to the latest (v1.9.5) on Kubernetes control plane and the dapr client within the docker images running on there. I'm not sure if there is any later version of the Go Azure Cosmos lib being used, but at least everything of ours is on the latest versions.

The problem still persists getting errors a few every hour with thousands of the same calls successful in between - thereby slowly corrupting our data over time. We're using manual data fixes in the mean time, but this isn't tenable over the medium-term.

I'll update when I have more from the Azure product team, but I'm not hopeful it can be rectified with a code fix from the Go side as the default behaviour is to use TLS 1.2 at least which is what the error is saying is not happening.

Any insights I have missed or could try further would be much appreciated.

@ealsur
Copy link
Member

ealsur commented Feb 6, 2023

@RobertGreyling The latest dapr library uses the SDK in this repository, that is experiencing the issue as reported in this thread. The issue seems unrelated to the library being used, but rather related with Go's HttpClient and something on the Cosmos DB Gateway / Service.

I work on the product team, we have been working on trying to reproduce this for the past weeks, and we have not been able to yet find a pattern. The team is still working on it.

@RobertGreyling
Copy link

Thanks @ealsur - much appreciated. I've got my team putting in a crude "retry" option at the moment as mentioned by @serbrech and looking to deploy that as soon as possible.

This is a really bad problem to have when you're doing Event Sourcing on top of Cosmos and one call out of a thousand fails thereby corrupting the entire event stream from that point onward.

@ealsur - if you require more ActivityIds to investigate payloads from the Cosmos side, I'm happy to provide hundreds of them if you require, and if that helps reproduce the issue somehow. I struggle to see how it's a problem with the Go HttpClient given that the same code is sending the request that succeeds thousands of times and then fails once with effectively the same payload on each request.

@ealsur
Copy link
Member

ealsur commented Feb 6, 2023

@RobertGreyling

I struggle to see how it's a problem with the Go HttpClient given that the same code is sending the request that succeeds thousands of times and then fails once with effectively the same payload on each request.

The only common ground between the SDK in this repo and the SDK where you initially reported the problem (https://github.com/a8m/documentdb) is that they both use Go's HttpClient to send the HTTP requests.

I am not saying there is something wrong on Go's HttpClient, what I am saying is that something on its behavior, under some circumstance (this thread mentions renegotiations, but I am not sure) is conflicting with expectations on the Cosmos DB Gateway endpoint (the one receiving those HTTP requests). What it is, we don't know yet.

@RobertGreyling
Copy link

Hi @ealsur - yes agreed the HttpClient is the common ground.

It's worth noting that (in my original message) the version of the software we were using had not changed in many months on our deployment side and were getting zero failures of this kind, but that on the 24th January, these error events starting taking place.

In my head, that doesn't point to anything in the behaviour of the HttpClient, under any circumstances, primarily because those same circumstances (behaviour or otherwise) have been happening for months in a stable unchanging environment, and hence I came to the conclusion that something must have changed downstream from there.

We don't have access to how Cosmos or the Gateway change over time as that is internal to your team of course, but perhaps looking at what might have changed around those dates could point in the right direction?

I would imagine looking at the Cosmos gateway that comparing a successful ActivityId versus an unsuccessful ActivityId of exactly the same payload may help in determining where the problem lies. Such data will become available once this retry option is in place, but my guess is that your team may already have access to that sort of data.

I would also think that additional telemetry is being recorded relating to these failed requests attached to the ActivityIds where headers and protocols can be inspected to determine if the request truly was being made with something less than TLS 1.2, or if in fact the gateway rejected it regardless of its TLS level?

These are questions I cannot get answers to, but being on the inside, you may have better mileage.

@RobertGreyling
Copy link

For the sake of completeness, we have this morning deployed a new version of our microservice that has a retry option for 403 responses specifically from state failures (gets, puts or posts) linked to this Go HttpClient.

We have since had two failed GETs (ActivityId: 918c25fa-cb80-413a-8d57-52358bbfe31b and f776b2d2-3b96-473b-b69e-b4db4f6a3641) within a few minutes of each other. The subsequent retries of both which happened milliseconds after each request, succeeded without any issue, and as such, we have avoided the subsequent data corruption for these calls.

Of course as has been mentioned on this thread before, this is not ideal, but at least it seems we are not suffering the consequences now - I will update if that changes.

@ealsur I hope those failed ActivityIds can be of some assistance to the internal team, given that an identical request directly afterward succeeded which should hopefully allow the team some guidance on reproducing the issue.

@ealsur
Copy link
Member

ealsur commented Feb 7, 2023

In my head, that doesn't point to anything in the behaviour of the HttpClient, under any circumstances, primarily because those same circumstances (behaviour or otherwise) have been happening for months in a stable unchanging environment, and hence I came to the conclusion that something must have changed downstream from there.

There was definitely something that changed on the service side, some new validation over TLS that was not there before. Go is the only language reporting these failures related to TLS after this validation seems to have been added, my guess is there is something different that only happens on Go regarding TLS that does not replicate in other languages (.NET, Java, NodeJS, Python have Cosmos SDKs and not reporting this problem). Probably this behavior was always there but the validation now brought it up.

@RobertGreyling
Copy link

@ealsur that makes sense, and if you've confirmed tighter validation constraints put live in Cosmos/Gateway over the last few weeks, then that make sense too and it all lines up.

In that case, there may be a potential fix that could find its way into the Go HttpClient once the root cause is found. I'm not sure this SDK is the correct project to be reflecting on that then, and instead perhaps an issue needs to be raised on https://github.com/a8m/documentdb if that is indeed the root library for this issue? I haven't been able to confirm that it is, or at least which version, but I note that others have mentioned it, including yourself.

The dapr team would also be interested in knowing about this as anyone using Cosmos via dapr state management are not getting a reliable experience and possibly don't know about it yet.

@ealsur
Copy link
Member

ealsur commented Feb 7, 2023

@RobertGreyling This repository is for the azcosmos library (official Cosmos DB Go SDK) which is user in the latest version of dapr. I believe the dapr team is already aware of this.

@RobertGreyling
Copy link

Update from the Azure Cosmos issue raised via support - they tell me that some configuration changes have been made on the Cosmos/Gateway side and continuing to monitor for 403 errors. I suspect you may have further insight on this @ealsur but mentioning just in case you're not on the internal chain of comms.

From our Production service logs, I'm happy to report that we've had no further "retries for 403 errors" attempted for the last 12 hours now which means we're no longer experiencing the issue retries were put in place for. We'll be monitoring through the day, but given that we've consistently had at least 3 or 4 per hour over the last two weeks, to have had zero in 12 hours sounds like tremendous progress.

Fingers crossed it stays that way - though I would be interested to know what the config change may have entailed.

@serbrech
Copy link
Member Author

serbrech commented Feb 8, 2023

Can we have an RCA here, or will that be internal only?

@serbrech
Copy link
Member Author

serbrech commented Feb 16, 2023

What is the fix deployment schedule?

I have not seen any changes from our logs point of view. We still get occasionally the same TLS failure.

It was the configuration change just on your instance (don't validate minTLS?)
@ealsur could you chime in?

@RobertGreyling
Copy link

That's strange @serbrech - FYI we've not had a single TLS error in production since Feb 8

I was under the impression the fix was service-wide, but perhaps it's only been applied to our instance?

I'm still waiting for the RCA from my account manager/support ticket

@ealsur
Copy link
Member

ealsur commented Feb 17, 2023

@serbrech I have no visibility on support engagements, I am not aware of any service-wide fixes regarding this issue. But my area of engagement is client SDKs, I don't know what changes could have been made/are planned on the service side, so I cannot comment, I don't know what change was done.

@ealsur
Copy link
Member

ealsur commented Feb 21, 2023

I went and reached out to the team to get more clarity and details about the situation to share here. The team has identified a fix related to HTTP2 protocol, and it's in the final stages of testing. Once that is completed, it will rollout to all Gateway endpoints but there is no fixed date for this yet as testing is underway (and we know based on the experience on this thread how hard it might be to test/replicate). Will share more updates as it progresses.

kodiakhq bot pushed a commit to cloudquery/cloudquery that referenced this issue Apr 11, 2023

#### Summary

This is meant to address #9269 (comment).
On accounts with many subscriptions (e.g. ~1.8k) the discovery phase can error out if multiple concurrent jobs are used.
I tested with 1.8k simulated subscriptions and also hit a number of different network issues.

The Azure SDK already applies default retry logic, which does not include 403 (as reported in the issue). I don't think we should override it, see Azure/azure-sdk-for-go#19785 (comment)

<!--
@ealsur
Copy link
Member

ealsur commented Jul 6, 2023

This issue has been fixed on the service and the fix has been widely deployed. Please re-open if the issue is still happening or arises again.

@ealsur ealsur closed this as completed Jul 6, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Oct 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

7 participants