Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all: DNS request amplification with default settings #7812

Closed
vitaminmoo opened this issue Apr 21, 2023 · 11 comments
Closed

all: DNS request amplification with default settings #7812

vitaminmoo opened this issue Apr 21, 2023 · 11 comments
Assignees
Labels
priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@vitaminmoo
Copy link

Client

I believe this affects everything, but am primarily testing with bigtable and have personally verified that spanner is also impacted

Environment

Kubernetes, including GKE

Go Environment

$ go version
go version go1.20.3 linux/amd64

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN="/workspace/gobin"
GOCACHE="/builder/home/.cache/go-build"
GOENV="/builder/home/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/go/pkg/mod"
GONOPROXY="github.com/lytics"
GONOSUMDB="github.com/lytics"
GOOS="linux"
GOPATH="/go"
GOPRIVATE="github.com/<redacted>"
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.20.3"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="0"
GOMOD="/workspace/go.mod"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build2036783142=/tmp/go-build -gno-record-gcc-switches"

Code

package main

import (
	"context"
	"fmt"
	"log"
	"time"

	"cloud.google.com/go/bigtable"
)

func main() {
	client, err := bigtable.NewClient(context.Background(),	"project", "instance")
	if err != nil {
		log.Fatalf("creating client: %v", err)
	}
	tbl := client.Open("table")
	for {
		_, err := tbl.SampleRowKeys(ctx)
		if err != nil {
			log.Fatalf("sampling: %v", err)
		}
		fmt.Println("sampled")
		time.Sleep(10 * time.Second)
	}
}

Expected behavior

A couple DNS requests are made to initiate connections

Actual behavior

70 extra DNS requests are made, and refreshed regularly. This number gets larger if option.WithGRPCConnectionPool is higher than the default of four

Screenshots

CPU usage node-local-dns daemonset before and after the mitigations described below were put in place just before Apr 20

image

Additional context

In GKE with the default DNS policy (ndots 5, five search suffixes, four connections), when the code above initializes it causes 75 requests to DNS, only five of which are useful, and most of those are duplicated

Counts of lookups during initialization, from tcdumping port 53 in a sidecar container:

      4 SRV? _grpclb._tcp.bigtable.googleapis.com.svc.cluster.local.
      4 SRV? _grpclb._tcp.bigtable.googleapis.com.google.internal.
      4 SRV? _grpclb._tcp.bigtable.googleapis.com.default.svc.cluster.local.
      4 SRV? _grpclb._tcp.bigtable.googleapis.com.c.lyticsstaging.internal.
      4 SRV? _grpclb._tcp.bigtable.googleapis.com.cluster.local.
      4 SRV? _grpclb._tcp.bigtable.googleapis.com.
      4 A? bigtable.googleapis.com.svc.cluster.local.
      4 A? bigtable.googleapis.com.google.internal.
      4 A? bigtable.googleapis.com.default.svc.cluster.local.
      4 A? bigtable.googleapis.com.c.lyticsstaging.internal.
      4 A? bigtable.googleapis.com.cluster.local.
      4 A? bigtable.googleapis.com. // this is 4x larger than it should be
      4 AAAA? bigtable.googleapis.com.svc.cluster.local.
      4 AAAA? bigtable.googleapis.com.google.internal.
      4 AAAA? bigtable.googleapis.com.default.svc.cluster.local.
      4 AAAA? bigtable.googleapis.com.c.lyticsstaging.internal.
      4 AAAA? bigtable.googleapis.com.cluster.local.
      4 AAAA? bigtable.googleapis.com. // this is 4x larger than it should be
      1 PTR? 10.16.0.10.in-addr.arpa. // this is fine
      1 A? metadata.google.internal. // this is fine
      1 AAAA? metadata.google.internal. // this is fine

All SRV lookups, at least for bigtable result in NXDomains 100% of the time (and as such are only cached by node-local-dns for two seconds)
All DNS requests to .local or .internal domains are pointless and caused by the default ndots:5 and search suffix list, and also cause NXDomains
There's 4x more valid DNS requests than neccessary due to the default pool size - Increasing the pool size from four also increases the DNS requests

NXDomains are not generally cached heavily, so for large pool sizes and large numbers of pods, the traffic generated can be immense, causing timeouts talking to kube-dns, or high cpu usage in node-local-dns

These queries are also repeated once per hour by default, though as mentioned below we were seeing it be /much/ more aggressive

In our production environment, this was causing host-local-dns to have high CPU usage (20-50 cores just for node-local-dns) and log millions of timeouts talking to kube-dns in the form of [ERROR] plugin/errors: 2 <hostname> A: read tcp <node IP: port>-><kubedns IP>:53: i/o timeout as documented here, but happening with modern GKE.

The factors at play appear to be:

  • The local resolver cache isn't shared between connections, causing the multiplication based on the size of your connection pools (4 by default, but ours in production are much larger than that)
  • grpclb features are not disabled, causing SRV lookups that aren't ever going to return anything
  • The hostname of the APIs for some or all of the endpoints lack trailing periods to make them fully qualified, causing them to have the search suffixes appended with kubernetes' default ndots config of 5
  • Additionally, it looks like golang's LookupSRV tries the search suffixes after receiving an NXDomain even if ndots is set to something sane like 1, so mitigating this outside of code changes is difficult

As for mitigation, you can

  • Enable ndots:1, which drops the non-SRV search suffix lookups
  • Enable the undocumented GOOGLE_CLOUD_DISABLE_DIRECT_PATH environmental variable, which causes a different resolver to be used that doesn't try grpclb lookups - This seems like a hack, though

With both of these set, DNS requests are still 2 * option.WithGRPCConnectionPool due to the A and AAAA records each being queried per connection

With our large connection pools in our production environment, we also saw constant re-resolution of all these names on a rather rapid cadence - causing up to hundreds or thousands of DNS requests per second per pod, indefinitely.

Proposed changes to solve this properly:

  • Share a global resolver cache process-wide so DNS lookups happen once per process, not once per connection
  • Disable grpclb features for services that do not implement them
  • Change hostnames to fully qualified with a trailing dot to avoid search suffix amplification

Please let me know if a more thorough example of reproduction and measurement is desired

@vitaminmoo vitaminmoo added the triage me I really want to be triaged. label Apr 21, 2023
@codyoss codyoss added status: investigating The issue is under investigation, which is determined to be non-trivial. and removed triage me I really want to be triaged. labels Apr 21, 2023
@codyoss codyoss self-assigned this Apr 21, 2023
@codyoss
Copy link
Member

codyoss commented Apr 21, 2023

Thank you for your thoughtful and thorough report. We will investigate and report back.

@codyoss
Copy link
Member

codyoss commented Apr 21, 2023

cc @mohanli-ml This seems to affect clients that are using direct path

@vitaminmoo
Copy link
Author

@codyoss note that we are not to my knowledge /using/ direct path, it's just that the codepath taken when it's disabled is different enough that the SRV records aren't looked up which decreases the DNS lookups by 20

@codyoss
Copy link
Member

codyoss commented Apr 24, 2023

A small update:

Share a global resolver cache process-wide so DNS lookups happen once per process, not once per connection

Still need to explore this more, but seem possible. Want to chat with some colleagues that know go-grpc better though.

Disable grpclb features for services that do not implement them

One of my other colleagues will update on this thread related to direct path code.

Change hostnames to fully qualified with a trailing dot to avoid search suffix amplification

I don't think we will move forward with this, but will investigate a little bit more. I know this was attempted once in our node libraries and caused issues and ended up getting reverted. googleapis/google-cloud-node#2213

I think tuning resolver confs, as you are doing, is indeed the best option right now for reducing sub-path queries.

@vitaminmoo
Copy link
Author

vitaminmoo commented Apr 24, 2023

I don't think we will move forward with this, but will investigate a little bit more. I know this was attempted once in our node libraries and caused issues and ended up getting reverted. googleapis/google-cloud-node#2213

I suspect the story in golang is a lot saner than nodejs - For instance, here's a test in the stdlib looking up both google.com and google.com. and expecting the result to be valid - This seems to indicate that this behavior is expected to work on all platforms. https://github.com/golang/go/blob/master/src/net/lookup_test.go#L388-L419

@mohanli-ml
Copy link
Contributor

mohanli-ml commented Apr 26, 2023

FYI @markdroth @ejona86 @apolcyn

Thanks for the report! This issue could happen when: 1). client attempts DirectPath with gRPCLB; AND 2). traffic fallback to CloudPath.

Bigtable and Spanner clients attempts DirectPath with gRPCLB by default. In this case, the top gRPC channel is created with the grpclb service config, https://github.com/googleapis/google-api-go-client/blob/main/transport/grpc/dial.go#L183.

Since DirectPath is still an experimental feature, it is only accessible to few early access customers, and other customers' traffic is expected to fallback to CloudPath. The fallback is achieved by not returning the gRPCLB server domain name in the SRV query.

However, the grpclb service config will force a re-resolution every 30s if the channel fails to get gRPCLB server domain name from SRV query. Based on your report, each re-resolution will have 18 queries (one SRV, one A, and one AAAA to 6 services). So the per channel query QPS is 18/30 = 0.6.

If your channel pool size is N, the total query QPS will be 0.6*N. For example, if your total QPS is 100, then your channel pool size is ~167.

Based on this understanding, I have some responses about your report.

I believe this affects everything, but am primarily testing with bigtable and have personally verified that spanner is also impacted.
This issue should only affects Bigtable and Spanner, because only these two services attempt DirectPath by default:

With our large connection pools in our production environment, we also saw constant re-resolution of all these names on a rather rapid cadence - causing up to hundreds or thousands of DNS requests per second per pod, indefinitely.
There are two issues here:

  • Extra DNS queries due to non fully qualified domain name (i.e., bigtable.googleapis.com instead of bigtable.googleapis.com.);
  • Constant re-resolution every 30s;

For the first issue, using FQDN can reduce DNS queries by a constant factor and should indeed be a reasonable workaround for your use case. We could conceivably make this the default behavior of the client library, but that change may present some risk so it would need more investigation.

For the constant re-resolution, we can avoid it by either disable DirectPath (GOOGLE_CLOUD_DISABLE_DIRECT_PATH as you mentioned), or attempt DirectPath with Traffic Director (GOOGLE_CLOUD_ENABLE_DIRECT_PATH_XDS). However, since DirectPath with Traffic Director is still under development, I think for now the best short-term workaround will be disabling DirectPath.

BE CAREFUL: DirectPath with gRPCLB is expected to be deprecated and will be replaced by DirectPath with Traffic Director, https://github.com/googleapis/google-api-go-client/blob/main/transport/grpc/dial.go#L158-L186. However, we are not ready to make DirectPath with Traffic Director the default DirectPath behavior. We expect to finish the migration this year, and this will no longer be an issue for DirectPath with Traffic Director. Disabling DirectPath means DirectPath with Traffic Director is also disabled. It is fine for now, but once DirectPath with Traffic Director is ready we expect every Bigtable and Spanner customers to eventually use DirectPath with Traffic Director by default, so remember to remove the GOOGLE_CLOUD_DISABLE_DIRECT_PATH later. We will work on a proper long-term solution for this issue before DirectPath with Traffic Director is ready.

@vitaminmoo
Copy link
Author

vitaminmoo commented Apr 26, 2023

Thank you for the details, @mohanli-ml - I think with what you provided, the explanation of our measured behavior is complete

Constant re-resolution every 30s;

This likely explains how extreme ours was in production before setting GOOGLE_CLOUD_DISABLE_DIRECT_PATH. We have tens of thousands of connections to bigtable in production, so the DNS QPS got extremely large.

Note that because of golang's LookupSRV falls back to trying search suffixes /after/ the initial query fails (vs trying them before the non-rooted query), this query gets amplified by the search suffix list length (5 in kube, so 6 queries total) regardless of ndots setting. As far as I can tell, there's no way to remove items from the search suffix list in kubernetes without nuking the entire dnsConfig, which requires you to put cluster-specific DNS server IP addresses in your pod config, which is not super fun from an infrastructure as code perspective.

This seems like a really unfortunate list of defaults that all interact together to incrementally cause bad behavior. I think it's an important note that I don't know what direct path is, so having to explicitly disable it to get our DNS servers to not cost thousands of dollars per month serving NXDOMAINs seems to be on the non-ideal end of the spectrum.

For the first issue, using FQDN can reduce DNS queries by a constant factor and should indeed be a reasonable workaround for your use case. We could conceivably make this the default behavior of the client library, but that change may present some risk so it would need more investigation.

It does not currently seem like there's a good way to externally override this. I definitely would not want to be in the business of maintaining hardcoded hostnames for Google services. Having an option to enable usage of fully qualified domain names in the client would be helpful, though not as good as having it as the default, given that there's no reason for the client to attempt to connect to local services under normal circumstances. I haven't tested this, but it seems like having a kubernetes deployment named bigtable in the same namespace in kube would break the bigtable client entirely, as with the default ndots setting, the local one would be preferred.

So in summary:

  • Our current mitigations are as good as it gets with the current client code
  • Kubernetes is probably not changing their defaults any time soon
  • GOOGLE_CLOUD_DISABLE_DIRECT_PATH is fine for now but will disable unreleased features that we may want in the future
  • Per-process DNS cache sharing and the use of fully qualified hostnames would be the largest real fixes (avoiding multiplication by connection pool size, and multiplication by search suffix length), and are maybe possible, and somewhat scary (though the test I previously linked does seem to indicate there's an expectation that trailing dots are expected to work in golang)

@codyoss
Copy link
Member

codyoss commented Apr 26, 2023

...the use of fully qualified hostnames would be the largest real fixes

If you want to do this in your code you always could with the client option to override the endpoint: https://pkg.go.dev/google.golang.org/api/option#WithEndpoint

@ejona86
Copy link

ejona86 commented Apr 27, 2023

Just FYI for vitaminmoo, the 30s is a rate limit in the dns resolver. It makes sure to not re-resolve until it has been at least 30s since the last resolution. It isn't the cause of the re-resolution; something else is doing that.

However, the grpclb service config will force a re-resolution every 30s if the channel fails to get gRPCLB server domain name from SRV query.

@mohanli-ml, that doesn't sound right (it may be what happens, but it shouldn't happen). The grpclb policy itself should only cause re-resolution if there are no ordinary backend addresses and no grpclb addresses. That seems to be the case. And the resolver shouldn't consider it a failure either. That seems to be the case. And if either of these were the case, I'd expect it to cause exponential backoff (which looks like it would take 4 minutes before being discernible from 30s polling, as it starts at 1s and then is delayed to 30s because of the rate limit).

There are ways to cause the continual refreshing with grpclb, but they are more like "SRV returns results, but we can't connect to them" or "grpclb server returns addresses, but we can't connect to them."

@codyoss codyoss added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Sep 15, 2023
@codyoss
Copy link
Member

codyoss commented Oct 4, 2023

Looked into this a little bit yesterday and it seemed like sharing a global resolver was not going to be consistently helpful. The pure Go implementation of dns resolution does not seem to cache from my understanding -- but the cgo version does. So depending on how a user builds their code this would have no affect. Also the default resolver for gRPC points back to the same resolver which is the standard default found in the http package. Even when the default is not used gRPC sets "PreferGo" on the resolver: https://github.com/grpc/grpc-go/blob/39972fdd744873a2b829ab98962ab0e85800d591/internal/resolver/dns/dns_resolver.go#L106

@codyoss codyoss added priority: p3 Desirable enhancement or fix. May not be included in next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. status: investigating The issue is under investigation, which is determined to be non-trivial. labels Oct 17, 2023
@codyoss
Copy link
Member

codyoss commented Apr 16, 2024

#9186 should have fixed this issue for most users.

@codyoss codyoss closed this as completed Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

4 participants