-
Notifications
You must be signed in to change notification settings - Fork 853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
azcosmos occasionally returning 403 Forbidden Connection insufficiently secured #19785
Comments
Cosmos Go SDK has no custom HTTP configurations and uses the Azure SDK pipeline, I don't see what could be related to this from the Cosmos SDK point of view. @jhendrixMSFT Is the TLS behavior something that the Go HTTP client sets? |
@serbrech You mentioned that this started on early december, was there any change related to the code itself? Was the Cosmos Go SDK updated or changed? Any configuration on the environment that might have changed? If not code changed, then probably the root cause of this issue is not the application code from what I can understand? |
That's correct yes, it uses the standard httop client, which defaults to minTLS 1.2 since go 1.18. I reported this to cosmosdb service directly, and they bounced me to reporting this to the SDK as a client issue, eventhough I can't see how this could happen occasionally, and non-deterministically from the client side... Note the error message logging a .net SDK, as well as the message itself being poorly worded. it seems it's missing the
I think the message might be
but the 1.2 argument is empty in the gateway code. |
This is because the error message comes from a .NET environment, and they are including the internal client identifier, Cosmos DB consumes the SDKs internally too. Your HTTP request is reaching Gateway (https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/sdk-connection-modes) and then internally being routed to the backend replicas through TCP.
My same thought. Unless something on the environment is randomly removing TLS. So there were no code or environment changes in December and it just started to happen? |
Could this be related: #19469 (comment) |
i'm seeing the same behavior and its not occasional, i'd say its pretty consistent At about 2 out of 10 calls. This is happening both in dev and In production (meaning i highly doubt Its environment On our side). |
@ealsur the related issue seems unlikely to me. the call works, just not all the time. i did a loop upserting the same document and out of 100 calls 20 of them threw this error. if it was a config issue i would expect none of them to work. ps this also just Started recently looking at our logs. |
@mattKorwel Is this something that started happening for you too without any client changes? Did you update the Go version by any chance? (that is one of the things we discovered on the other case). When was the date it started for you? |
Friday dec 9th is the first time we see it In the logs. no config changes or go version changes i can find. ink fact no deploys that week at all with the early holidays. i did update to the newest version of the sdk today and am testing it now and think BTW, here is an activityId from a recent failure, i am hoping you have the ability to look up server side logs based on that? |
We started seeing it early December too. First occurrence in our logs was on the 3rd if I recall correctly from the investigation data. |
@ealsur Any luck on tracing this down? I'm seeing it ever more frequently. I've added retries specifically for this in most code spots but its quite impactful at this point. |
@mattKorwel No luck so far, this seems unrelated to SDK, I have reported it to the service team to try and track down the source. So far it seems only Go customers are experiencing it. |
I've also started seeing this happen once or twice a day, in code which was deployed last month so it ran without seeing this issue until the 21st Jan. |
@JasonQuinn Would you mind sharing 1 ActivityId from the failures you are seeing? |
@ealsur The activity id for one that happened at 00:09 UTC today was ecea9ef6-b5a3-4119-b0d6-b3ec3f1040eb. The full error was |
Interesting that the error ^^ says documentdb-dotnet-sdk (i.e. not golang) |
@jim-minter |
I came across this thread looking for a solution to the same error taking place in our production environment for approximately the last 2 days. Nothing in our production env has changed. We are using DAPR as part of our microservice architecture which is built in Go, but we're running our services on top of that built in .NET Core 6. The production env has remain unchanged since late 2022, so we know there's no config/code change and we're getting the same error from two of our 26 microservices, and only getting them occasionally, so it seems to be that Cosmos in Azure is rejecting these requests for some reason. Our errors all look very similar to this:
|
@RobertGreyling Is this using what SDK? Or is this a custom REST API wrapper on c#? From the ActivityId, it looks like you are using @jim-minter Yes, this is an internal user agent. |
@ealsur - thanks for responding. This was the only place on the internet (so far) I could find even talking about this, but it did seem coincidental that this is being reported in a Go SDK, and our microservice architecture is running on DAPR that's also built using Go. In order to support state calls to Cosmos, I'm assuming it is using that We're currently running on DAPR v1.8 in production and latest is v1.9, so I may look at a simple upgrade to see if that fixes it, but if this is one of the only places talking about this issue, then I imagine not many people even know about it, not giving me much hope that v1.9 will solve the issue. It seems the problem may be resident in a common Go lib that talks to Azure, but I haven't gone digging there yet. That's what I'll likely do if the upgrade doesn't solve it. |
@RobertGreyling Assuming you use the state store implementation for cosmosdb, it's using this same package: |
@mattKorwel are you able to provide a simplified reproducer? I haven't yet been able to successfully create one. |
I am also working on trying to get a repro, but has been unsuccessful. @serbrech, based on the ActivityId @RobertGreyling is using something that generates this useragent: I think the only common ground is the Go HttpClient. |
@serbrech Yes I can confirm we are using the common state store impl for cosmosdb. @ealsur I have only supplied one ActivityId, but if you need them, I can supply hundreds more as they are happening quite frequently now, though in relatively low numbers when compared with overall successful traffic. |
@RobertGreyling one is enough for now, thanks. |
@mattKorwel Could you share your code that repro with upserts?
yet it happens all the time in prod for us still. If that helps anyone to work around it, I made the client retry on 403. it's awful and slow, but at least we don't return an error from our API. With this, we haven't seen a single failure in 3 weeks. it has always succeeded on 2nd try.
|
Update on this, I have raised an issue via our account manage with the Azure Cosmos product team - still awaiting an update from them as to the cause. We've also since upgraded dapr to the latest (v1.9.5) on Kubernetes control plane and the dapr client within the docker images running on there. I'm not sure if there is any later version of the Go Azure Cosmos lib being used, but at least everything of ours is on the latest versions. The problem still persists getting errors a few every hour with thousands of the same calls successful in between - thereby slowly corrupting our data over time. We're using manual data fixes in the mean time, but this isn't tenable over the medium-term. I'll update when I have more from the Azure product team, but I'm not hopeful it can be rectified with a code fix from the Go side as the default behaviour is to use TLS 1.2 at least which is what the error is saying is not happening. Any insights I have missed or could try further would be much appreciated. |
@RobertGreyling The latest dapr library uses the SDK in this repository, that is experiencing the issue as reported in this thread. The issue seems unrelated to the library being used, but rather related with Go's HttpClient and something on the Cosmos DB Gateway / Service. I work on the product team, we have been working on trying to reproduce this for the past weeks, and we have not been able to yet find a pattern. The team is still working on it. |
Thanks @ealsur - much appreciated. I've got my team putting in a crude "retry" option at the moment as mentioned by @serbrech and looking to deploy that as soon as possible. This is a really bad problem to have when you're doing Event Sourcing on top of Cosmos and one call out of a thousand fails thereby corrupting the entire event stream from that point onward. @ealsur - if you require more ActivityIds to investigate payloads from the Cosmos side, I'm happy to provide hundreds of them if you require, and if that helps reproduce the issue somehow. I struggle to see how it's a problem with the Go HttpClient given that the same code is sending the request that succeeds thousands of times and then fails once with effectively the same payload on each request. |
The only common ground between the SDK in this repo and the SDK where you initially reported the problem (https://github.com/a8m/documentdb) is that they both use Go's HttpClient to send the HTTP requests. I am not saying there is something wrong on Go's HttpClient, what I am saying is that something on its behavior, under some circumstance (this thread mentions renegotiations, but I am not sure) is conflicting with expectations on the Cosmos DB Gateway endpoint (the one receiving those HTTP requests). What it is, we don't know yet. |
Hi @ealsur - yes agreed the HttpClient is the common ground. It's worth noting that (in my original message) the version of the software we were using had not changed in many months on our deployment side and were getting zero failures of this kind, but that on the 24th January, these error events starting taking place. In my head, that doesn't point to anything in the behaviour of the HttpClient, under any circumstances, primarily because those same circumstances (behaviour or otherwise) have been happening for months in a stable unchanging environment, and hence I came to the conclusion that something must have changed downstream from there. We don't have access to how Cosmos or the Gateway change over time as that is internal to your team of course, but perhaps looking at what might have changed around those dates could point in the right direction? I would imagine looking at the Cosmos gateway that comparing a successful ActivityId versus an unsuccessful ActivityId of exactly the same payload may help in determining where the problem lies. Such data will become available once this retry option is in place, but my guess is that your team may already have access to that sort of data. I would also think that additional telemetry is being recorded relating to these failed requests attached to the ActivityIds where headers and protocols can be inspected to determine if the request truly was being made with something less than TLS 1.2, or if in fact the gateway rejected it regardless of its TLS level? These are questions I cannot get answers to, but being on the inside, you may have better mileage. |
For the sake of completeness, we have this morning deployed a new version of our microservice that has a retry option for 403 responses specifically from state failures (gets, puts or posts) linked to this Go HttpClient. We have since had two failed GETs ( Of course as has been mentioned on this thread before, this is not ideal, but at least it seems we are not suffering the consequences now - I will update if that changes. @ealsur I hope those failed |
There was definitely something that changed on the service side, some new validation over TLS that was not there before. Go is the only language reporting these failures related to TLS after this validation seems to have been added, my guess is there is something different that only happens on Go regarding TLS that does not replicate in other languages (.NET, Java, NodeJS, Python have Cosmos SDKs and not reporting this problem). Probably this behavior was always there but the validation now brought it up. |
@ealsur that makes sense, and if you've confirmed tighter validation constraints put live in Cosmos/Gateway over the last few weeks, then that make sense too and it all lines up. In that case, there may be a potential fix that could find its way into the Go HttpClient once the root cause is found. I'm not sure this SDK is the correct project to be reflecting on that then, and instead perhaps an issue needs to be raised on https://github.com/a8m/documentdb if that is indeed the root library for this issue? I haven't been able to confirm that it is, or at least which version, but I note that others have mentioned it, including yourself. The dapr team would also be interested in knowing about this as anyone using Cosmos via dapr state management are not getting a reliable experience and possibly don't know about it yet. |
@RobertGreyling This repository is for the azcosmos library (official Cosmos DB Go SDK) which is user in the latest version of dapr. I believe the dapr team is already aware of this. |
Update from the Azure Cosmos issue raised via support - they tell me that some configuration changes have been made on the Cosmos/Gateway side and continuing to monitor for 403 errors. I suspect you may have further insight on this @ealsur but mentioning just in case you're not on the internal chain of comms. From our Production service logs, I'm happy to report that we've had no further "retries for 403 errors" attempted for the last 12 hours now which means we're no longer experiencing the issue retries were put in place for. We'll be monitoring through the day, but given that we've consistently had at least 3 or 4 per hour over the last two weeks, to have had zero in 12 hours sounds like tremendous progress. Fingers crossed it stays that way - though I would be interested to know what the config change may have entailed. |
Can we have an RCA here, or will that be internal only? |
What is the fix deployment schedule? I have not seen any changes from our logs point of view. We still get occasionally the same TLS failure. It was the configuration change just on your instance (don't validate minTLS?) |
That's strange @serbrech - FYI we've not had a single TLS error in production since Feb 8 I was under the impression the fix was service-wide, but perhaps it's only been applied to our instance? I'm still waiting for the RCA from my account manager/support ticket |
@serbrech I have no visibility on support engagements, I am not aware of any service-wide fixes regarding this issue. But my area of engagement is client SDKs, I don't know what changes could have been made/are planned on the service side, so I cannot comment, I don't know what change was done. |
I went and reached out to the team to get more clarity and details about the situation to share here. The team has identified a fix related to HTTP2 protocol, and it's in the final stages of testing. Once that is completed, it will rollout to all Gateway endpoints but there is no fixed date for this yet as testing is underway (and we know based on the experience on this thread how hard it might be to test/replicate). Will share more updates as it progresses. |
#### Summary This is meant to address #9269 (comment). On accounts with many subscriptions (e.g. ~1.8k) the discovery phase can error out if multiple concurrent jobs are used. I tested with 1.8k simulated subscriptions and also hit a number of different network issues. The Azure SDK already applies default retry logic, which does not include 403 (as reported in the issue). I don't think we should override it, see Azure/azure-sdk-for-go#19785 (comment) <!--
This issue has been fixed on the service and the fix has been widely deployed. Please re-open if the issue is still happening or arises again. |
Bug Report
Every now and then, a call to cosmos db will return 403 Forbidden, with a message that the client did not use the minimum TLS version:
Note that the error message includes a reference to the .net document db SDK. We do not use .net, we receive this error when talking to cosmosdb from the golang SDK.
golang 1.18+ defaults to minTLS 1.2 in the http stack, and we explicitely set it anyways.
almost all calls from that same client succeed, and a few calls a day fail with 403 without any changes in configuration. this happens in all regions since early december.
no min TLS error
run a service that issues calls to a DB continuously, it will hit this error.
I don't think there is anything special about our environment.
The text was updated successfully, but these errors were encountered: