-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
all: DNS request amplification with default settings #7812
Comments
Thank you for your thoughtful and thorough report. We will investigate and report back. |
cc @mohanli-ml This seems to affect clients that are using direct path |
@codyoss note that we are not to my knowledge /using/ direct path, it's just that the codepath taken when it's disabled is different enough that the SRV records aren't looked up which decreases the DNS lookups by 20 |
A small update:
Still need to explore this more, but seem possible. Want to chat with some colleagues that know go-grpc better though.
One of my other colleagues will update on this thread related to direct path code.
I don't think we will move forward with this, but will investigate a little bit more. I know this was attempted once in our node libraries and caused issues and ended up getting reverted. googleapis/google-cloud-node#2213 I think tuning resolver confs, as you are doing, is indeed the best option right now for reducing sub-path queries. |
I suspect the story in golang is a lot saner than nodejs - For instance, here's a test in the stdlib looking up both |
FYI @markdroth @ejona86 @apolcyn Thanks for the report! This issue could happen when: 1). client attempts DirectPath with gRPCLB; AND 2). traffic fallback to CloudPath. Bigtable and Spanner clients attempts DirectPath with gRPCLB by default. In this case, the top gRPC channel is created with the grpclb service config, https://github.com/googleapis/google-api-go-client/blob/main/transport/grpc/dial.go#L183. Since DirectPath is still an experimental feature, it is only accessible to few early access customers, and other customers' traffic is expected to fallback to CloudPath. The fallback is achieved by not returning the gRPCLB server domain name in the SRV query. However, the grpclb service config will force a re-resolution every 30s if the channel fails to get gRPCLB server domain name from SRV query. Based on your report, each re-resolution will have 18 queries (one SRV, one A, and one AAAA to 6 services). So the per channel query QPS is 18/30 = 0.6. If your channel pool size is N, the total query QPS will be 0.6*N. For example, if your total QPS is 100, then your channel pool size is ~167. Based on this understanding, I have some responses about your report.
For the first issue, using FQDN can reduce DNS queries by a constant factor and should indeed be a reasonable workaround for your use case. We could conceivably make this the default behavior of the client library, but that change may present some risk so it would need more investigation. For the constant re-resolution, we can avoid it by either disable DirectPath (GOOGLE_CLOUD_DISABLE_DIRECT_PATH as you mentioned), or attempt DirectPath with Traffic Director (GOOGLE_CLOUD_ENABLE_DIRECT_PATH_XDS). However, since DirectPath with Traffic Director is still under development, I think for now the best short-term workaround will be disabling DirectPath. BE CAREFUL: DirectPath with gRPCLB is expected to be deprecated and will be replaced by DirectPath with Traffic Director, https://github.com/googleapis/google-api-go-client/blob/main/transport/grpc/dial.go#L158-L186. However, we are not ready to make DirectPath with Traffic Director the default DirectPath behavior. We expect to finish the migration this year, and this will no longer be an issue for DirectPath with Traffic Director. Disabling DirectPath means DirectPath with Traffic Director is also disabled. It is fine for now, but once DirectPath with Traffic Director is ready we expect every Bigtable and Spanner customers to eventually use DirectPath with Traffic Director by default, so remember to remove the GOOGLE_CLOUD_DISABLE_DIRECT_PATH later. We will work on a proper long-term solution for this issue before DirectPath with Traffic Director is ready. |
Thank you for the details, @mohanli-ml - I think with what you provided, the explanation of our measured behavior is complete
This likely explains how extreme ours was in production before setting GOOGLE_CLOUD_DISABLE_DIRECT_PATH. We have tens of thousands of connections to bigtable in production, so the DNS QPS got extremely large. Note that because of golang's LookupSRV falls back to trying search suffixes /after/ the initial query fails (vs trying them before the non-rooted query), this query gets amplified by the search suffix list length (5 in kube, so 6 queries total) regardless of ndots setting. As far as I can tell, there's no way to remove items from the search suffix list in kubernetes without nuking the entire dnsConfig, which requires you to put cluster-specific DNS server IP addresses in your pod config, which is not super fun from an infrastructure as code perspective. This seems like a really unfortunate list of defaults that all interact together to incrementally cause bad behavior. I think it's an important note that I don't know what direct path is, so having to explicitly disable it to get our DNS servers to not cost thousands of dollars per month serving NXDOMAINs seems to be on the non-ideal end of the spectrum.
It does not currently seem like there's a good way to externally override this. I definitely would not want to be in the business of maintaining hardcoded hostnames for Google services. Having an option to enable usage of fully qualified domain names in the client would be helpful, though not as good as having it as the default, given that there's no reason for the client to attempt to connect to local services under normal circumstances. I haven't tested this, but it seems like having a kubernetes deployment named So in summary:
|
If you want to do this in your code you always could with the client option to override the endpoint: https://pkg.go.dev/google.golang.org/api/option#WithEndpoint |
Just FYI for vitaminmoo, the 30s is a rate limit in the dns resolver. It makes sure to not re-resolve until it has been at least 30s since the last resolution. It isn't the cause of the re-resolution; something else is doing that.
@mohanli-ml, that doesn't sound right (it may be what happens, but it shouldn't happen). The grpclb policy itself should only cause re-resolution if there are no ordinary backend addresses and no grpclb addresses. That seems to be the case. And the resolver shouldn't consider it a failure either. That seems to be the case. And if either of these were the case, I'd expect it to cause exponential backoff (which looks like it would take 4 minutes before being discernible from 30s polling, as it starts at 1s and then is delayed to 30s because of the rate limit). There are ways to cause the continual refreshing with grpclb, but they are more like "SRV returns results, but we can't connect to them" or "grpclb server returns addresses, but we can't connect to them." |
Looked into this a little bit yesterday and it seemed like sharing a global resolver was not going to be consistently helpful. The pure Go implementation of dns resolution does not seem to cache from my understanding -- but the cgo version does. So depending on how a user builds their code this would have no affect. Also the default resolver for gRPC points back to the same resolver which is the standard default found in the http package. Even when the default is not used gRPC sets "PreferGo" on the resolver: https://github.com/grpc/grpc-go/blob/39972fdd744873a2b829ab98962ab0e85800d591/internal/resolver/dns/dns_resolver.go#L106 |
#9186 should have fixed this issue for most users. |
Client
I believe this affects everything, but am primarily testing with bigtable and have personally verified that spanner is also impacted
Environment
Kubernetes, including GKE
Go Environment
Code
Expected behavior
A couple DNS requests are made to initiate connections
Actual behavior
70 extra DNS requests are made, and refreshed regularly. This number gets larger if option.WithGRPCConnectionPool is higher than the default of four
Screenshots
CPU usage node-local-dns daemonset before and after the mitigations described below were put in place just before Apr 20
Additional context
In GKE with the default DNS policy (ndots 5, five search suffixes, four connections), when the code above initializes it causes 75 requests to DNS, only five of which are useful, and most of those are duplicated
Counts of lookups during initialization, from tcdumping port 53 in a sidecar container:
All SRV lookups, at least for bigtable result in NXDomains 100% of the time (and as such are only cached by node-local-dns for two seconds)
All DNS requests to .local or .internal domains are pointless and caused by the default ndots:5 and search suffix list, and also cause NXDomains
There's 4x more valid DNS requests than neccessary due to the default pool size - Increasing the pool size from four also increases the DNS requests
NXDomains are not generally cached heavily, so for large pool sizes and large numbers of pods, the traffic generated can be immense, causing timeouts talking to kube-dns, or high cpu usage in node-local-dns
These queries are also repeated once per hour by default, though as mentioned below we were seeing it be /much/ more aggressive
In our production environment, this was causing host-local-dns to have high CPU usage (20-50 cores just for node-local-dns) and log millions of timeouts talking to kube-dns in the form of
[ERROR] plugin/errors: 2 <hostname> A: read tcp <node IP: port>-><kubedns IP>:53: i/o timeout
as documented here, but happening with modern GKE.The factors at play appear to be:
As for mitigation, you can
With both of these set, DNS requests are still
2 * option.WithGRPCConnectionPool
due to the A and AAAA records each being queried per connectionWith our large connection pools in our production environment, we also saw constant re-resolution of all these names on a rather rapid cadence - causing up to hundreds or thousands of DNS requests per second per pod, indefinitely.
Proposed changes to solve this properly:
Please let me know if a more thorough example of reproduction and measurement is desired
The text was updated successfully, but these errors were encountered: