Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected DNS caching #726

Closed
paulgrav opened this issue Jul 30, 2018 · 19 comments · Fixed by #1612
Closed

Unexpected DNS caching #726

paulgrav opened this issue Jul 30, 2018 · 19 comments · Fixed by #1612

Comments

@paulgrav
Copy link

paulgrav commented Jul 30, 2018

I’d better like to understand K6s behaviour in relation to DNS caching.

I’m trying to run tests against a couple of load balanced K8S clusters. The failover is managed by a traffic manager so once the traffic manager has detected a failure in one of the clusters it will update the DNS records so that traffic only hits the healthy cluster. I’ve tried to reduce the problem, described below.

Steps to reproduce:

  1. Create a performance test to run for 5m against a particular endpoint. E.g., https://gist.github.com/paulgrav/a8b6d77639b0d54774b499c5dfd660c6
  2. Ensure that host.name points to a CNAME that returns HTML and has a TTL of 60s.
  3. Run the test.
  4. After one minute, update the CNAME for host.name to a domain that doesn’t resolve e.g., foo.bar.asdf

Actual results:

All HTTP calls in the test succeed despite the CNAME record being changed to point to a domain that doesn’t resolve.

Expected results:

After 60s but before the test ends, there should be some test failures because the HTTP requests fail.

@luizbafilho
Copy link
Contributor

Hi @paulgrav,

That is expected, we cache the DNS records for the whole period of the test in order to maximize the throughput. https://github.com/loadimpact/k6/blob/9531175bc4f5b525655db6c8f96865689bb215e7/lib/netext/dialer.go#L54

I guess it would be nice to have an option to set the DNS caching time, what do you think @na-- ?

@na--
Copy link
Member

na-- commented Jul 30, 2018

Yeah, caching DNS records forever doesn't seem like the most flexible solution. It may be fine for simple tests, but some flexibility would be best, even though it probably should only be exposed via a JS/envvar option, not a CLI flag, since I doubt it would be commonly used.

Not sure how easy it would be to implement, but is seems to me like the default k6 behavior shouldn't be to cache everything indefinitely, but to respect the TTL setting of DNS records... As @paulgrav mentions in the issue title, the current behavior may definitely be unexpected to users.

@paulgrav
Copy link
Author

Thanks for acknowledging this. Respecting the TTL would be ideal. Overriding the DNS caching via envvar would be better than the current hardcoded indefinite cache time.

@bookmoons
Copy link
Contributor

Looking into what it would take to do this.

Here's the situation:

  • Stdlib doesn't expose TTL.
  • I'm not finding a ready made caching resolver that respects TTL.

I propose to handle this by adding TTL support to an existing quality package then replacing the current resolver with that package.

domainr/dnsr seems to be the nicest. It has caching. It's well tested. It throws meaningful errors. The team is receptive. Don't think it will be hard to add.

@na--
Copy link
Member

na-- commented Jun 12, 2019

I'm not familiar with domainr/dnsr, but at a first glance, I don't see any issues with using it, if it had TTL support added. It depends on the miekg/dns DNS library, which is probably what will be using to implement #851 and #1011 anyway.

@bookmoons
Copy link
Contributor

OK, perfect.

@bookmoons
Copy link
Contributor

Started a submission to dnsr domainr/dnsr#32 to add TTL support.

@bookmoons
Copy link
Contributor

The dnsr team has kindly accepted the patch. Got it on my list to put together a k6 patch trying it as a replacement resolver.

@bookmoons
Copy link
Contributor

Realized something else is necessary. I have a question in with dnsr domainr/dnsr#35 about a second patch.

@bookmoons
Copy link
Contributor

I'm not hearing back from dnsr. They have their own efforts taking attention.

What would you guys think about forking dnsr and publishing a loadimpact resolver? There does seem to be a need for it. Had trouble finding a decent one when I did that search. The license is MIT.

I have a tested patch for this final feature ready to submit.

@na--
Copy link
Member

na-- commented Jul 8, 2019

I'd prefer that we avoid forking a project that's at least somewhat active... Given the way Go imports are specified by URLs, it's a bit of a PITA to temporarily fork a repo. I'm not sure if it would be better after #1070, but for now it's probably not worth it.

I haven't delved into the dnsr code or your patches there yet, but does your latest patch need to be in the repo itself? Can it work as a helper function in the k6 repo as a temporary workaround?

@bookmoons
Copy link
Contributor

Alright. I'll try pinging dnsr, maybe they just haven't seen it.

Tried to find something external first, but I don't think there's a way. It needs to use cache in a new way. The cache field is private so (I think?) there's no way to get at it from outside.

@bookmoons
Copy link
Contributor

Opened a preliminary submission for this #1085. Turned out there are a few things dnsr doesn't quite do. I've built around it to try to handle them. It puts more resolution logic outside the resolver than I really wanted. Won't blame you if you don't want it. But at least it gives you the option. Maybe things can be moved into the resolver over time.

@na--
Copy link
Member

na-- commented Jul 18, 2019

Thanks! I'll take a thorough look tomorrow or early next week.

@na-- na-- added this to the v0.26.0 milestone Aug 27, 2019
@na-- na-- modified the milestones: v0.26.0, v0.27.0 Oct 10, 2019
@gabrielqs
Copy link

Hi! Would this by any chance be available on 0.26? I'm experiencing it on a round-robin internal domain managed by route 53, TTLs aren't respected to a single node ends receiving all of the traffic from k6.

@na--
Copy link
Member

na-- commented Nov 20, 2019

@gabrielqs, sorry for the late response 😞 Unfortunately these fixes won't be in k6 v0.26.0, since we're in the final stages of testing and hoping to publicly release it by the end of this week. We're aiming to have all of the DNS fixes be part of the next k6 release though, so watch this space for updates! For context, the delay with fixing the DNS issues stems from the lack of suitable DNS resolving and caching libraries for our use cases in k6, so we basically had to write one, starting from a lower-level than expected...

@gabrielqs
Copy link

@na-- thanks for the info!

@sniku
Copy link
Collaborator

sniku commented Apr 30, 2020

I have an interesting observation regarding DNS caching.

When running a large scale test against a system hosted on AWS, the load-balancer will change IPs for the target system during autoscaling.

A "cold" LB will expose only 1 IP. When it detects a large load, it will start auto-scaling and will spin up additional LB machines, and expose additional IPs. The old IP from the "cold" LB will be removed after a while.

Since k6 cached the original IP and doesn't know about the new IPs, requests will start failing even though the target system isn't overloaded.

image

Errors like this are printer by k6 on the terminal when it can't connect to the old IP:

WARN[0057] Request Failed                                error="Get http://pawel.staging.loadimpact.com/static/logo.svg?url=v3: dial tcp 52.18.24.222:80: i/o timeout"

Another thing I have noticed is that k6 doesn't use all available IPs. It seems to be using up to 3 IPs.

In my test, there were 6 IPs available for the domain, but only 3 were used by k6.
image

@na-- na-- modified the milestones: v0.27.0, v0.28.0 May 21, 2020
imiric pushed a commit to imiric/k6 that referenced this issue Aug 19, 2020
This more or less reproduces the scenario described in grafana#726.
imiric pushed a commit to imiric/k6 that referenced this issue Aug 20, 2020
This more or less reproduces the scenario described in grafana#726.
@imiric
Copy link
Contributor

imiric commented Sep 9, 2020

While working on #1612 I stumbled upon this project: https://github.com/ncruces/go-dns

It's a very lightweight implementation of custom resolvers that works by overriding net.DefaultResolver. If nothing else it could serve as a reference for fixing these DNS issues in k6 with a less radical approach than writing custom resolvers from scratch, something we've made progress on by building on top of https://github.com/miekg/dns, which is a much more complex re-implementation of DNS in Go. The net.DefaultResolver approach would be much simpler while continuing to benefit from the stdlib DNS support.

If we need an abstraction for DNS messages, we could also use https://godoc.org/golang.org/x/net/dns/dnsmessage instead of miekg/dns. This package is used internally by the stdlib so it makes sense to reuse that in our code rather than bring in a large dependency like miekg/dns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants