Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ruler][DNS] Don't propagate no such host error if using default resolver #3257

Merged
merged 11 commits into from
Oct 15, 2020

Conversation

OGKevin
Copy link
Contributor

@OGKevin OGKevin commented Sep 29, 2020

Signed-off-by: Kevin Hellemun [email protected]

Fixes #3186

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Verification

@OGKevin
Copy link
Contributor Author

OGKevin commented Sep 29, 2020

Shall there be a debug log entry when this happens 🤔 ?

@OGKevin OGKevin marked this pull request as draft September 29, 2020 14:31
@OGKevin OGKevin marked this pull request as ready for review September 30, 2020 09:27
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

It's not the only a ruler who is using this, also Querier and other components and projects including Cortex.

I wonder if this makes sense, maybe putting it behind some option would do 🤔

No host found is when you cannot reach DNS host and this is definitely a configuration, server or just access failure. Imagine you are rolling to new cluster and somehow pod cannot each DNS server. By masking this error we will never know about this error right?

cc @pracucci @pstibrany

pkg/discovery/dns/resolver.go Outdated Show resolved Hide resolved
@bwplotka
Copy link
Member

What exactly you want to achieve @OGKevin what's the use case? (:

@OGKevin
Copy link
Contributor Author

OGKevin commented Sep 30, 2020

AFAIK, Host not found is also returned when the DNS server is reachable. Have a look at this article for example.

➜  ~ host k3jhut933.messagebird.io
Host k3jhut933.messagebird.io not found: 3(NXDOMAIN)
➜  ~ host google.com  
google.com has address 172.217.17.78
google.com has IPv6 address 2a00:1450:4001:817::200e
google.com mail is handled by 30 alt2.aspmx.l.google.com.
google.com mail is handled by 20 alt1.aspmx.l.google.com.
google.com mail is handled by 10 aspmx.l.google.com.
google.com mail is handled by 50 alt4.aspmx.l.google.com.
google.com mail is handled by 40 alt3.aspmx.l.google.com.

The DNS server is working fine, the DNS just does not exits, hence the NXDOMAIN which in go results in a Host not found error.

The issue linked in #3186 has more context of the use case that we want to avoid. Thought, this use case is ruler specific 🤔 not sure how other components handle this.

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I checked a bit and it looks like indeed IsNotFound is returned when EAI_NONAME is given by Linux resolver (getaddrinfo) so you might be right.

Still I would consider this as an error, potentially a configuration error case. (imagine you made a typo in crafting a DNS target for alertmananger).

I think the main question regarding your issue is really, should a ruler just mention the error and continue running or not. This decision might orthogonal to this change and in fact depends what it resolves for (key functionality or not) 🤔

Some more opinions welcome, but I would rather stop crashing ruler on fail like this on start, but still log & instrument it as failure overall.

pkg/discovery/dns/resolver.go Outdated Show resolved Hide resolved
@OGKevin
Copy link
Contributor Author

OGKevin commented Sep 30, 2020

The problem might not be a config issue only. If Alertmanagers was up and running fine, and the DNS name becomes invalid because all pods are down. Ruler wil crashloop without making any config change to ruler.

So IMO Ruler should for sure not crash because Alertmanagers can't be resolved. Because ruler is not only responsible for evaluating alerting rules. I also agree that there should be a log entry saying that DNS resolution failed.

On the other issues, e.g. how it impacts the other components I can't make a clear assessment. What we could do indeed is, add a flag and set this to "true" by default for ruler, and false for the other components so that the behaviour only changes for Ruler.

@bwplotka
Copy link
Member

. Ruler wil crashloop without making any config change to ruler.

Hm, crashlooop will actually apply all config changes right?

@OGKevin
Copy link
Contributor Author

OGKevin commented Sep 30, 2020

Let me try to rephrase: ruler will crashloop not because of a config change made by a human to ruler. Ruler will go in a crashloop because a dependency, Alertmanager, is down and k8s service FQDN resolution returns a NXDOMAIN (or equivalent) and dies. Subsequently, ruler won't start up because of the DNS resolution failure and eventually ends up in a crash loop.

addDiscoveryGroups(g, amClient, alertmgrsDNSSDInterval)

thanos/cmd/thanos/rule.go

Lines 792 to 798 in e4941a5

g.Add(func() error {
return runutil.Repeat(interval, ctx.Done(), func() error {
return c.Resolve(ctx)
})
}, func(error) {
cancel()
})

@OGKevin OGKevin marked this pull request as draft September 30, 2020 13:33
@OGKevin OGKevin force-pushed the 3186-fix-default-dns-crash branch from 87462b9 to e30a981 Compare September 30, 2020 13:49
@OGKevin
Copy link
Contributor Author

OGKevin commented Oct 1, 2020

The link that is being reported as dead is not actually dead. I dont quite understand why e2e failed 🤔

@OGKevin OGKevin marked this pull request as ready for review October 1, 2020 08:31
@OGKevin OGKevin requested a review from bwplotka October 1, 2020 08:31
pkg/discovery/dns/resolver.go Outdated Show resolved Hide resolved
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to dns package looks good to me.

pkg/discovery/dns/provider.go Outdated Show resolved Hide resolved
pkg/discovery/dns/resolver.go Outdated Show resolved Hide resolved
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for offline chat, I think this makes sense, but let's make it a normal behaviour.

Some tests would be nice as well (:

pkg/discovery/dns/provider.go Outdated Show resolved Hide resolved
pkg/discovery/dns/resolver.go Outdated Show resolved Hide resolved
pkg/discovery/dns/resolver.go Outdated Show resolved Hide resolved
pkg/discovery/dns/provider.go Outdated Show resolved Hide resolved
pkg/discovery/dns/resolver.go Outdated Show resolved Hide resolved
if dnsErr, ok := err.(*net.DNSError); !ok || !dnsErr.IsNotFound || s.returnErrOnNotFound {
return nil, errors.Wrapf(err, "lookup IP addresses %q", host)
}
level.Error(s.logger).Log("msg", "failed to lookup IP addresses", "host", host, "err", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's delete this and if only, let's have one for both miekg and Go DNS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I don't quite understand this comment 🤔. Can you elaborate a lillte bit more?

pkg/discovery/dns/resolver.go Outdated Show resolved Hide resolved
pkg/discovery/dns/resolver.go Outdated Show resolved Hide resolved
Signed-off-by: Kevin Hellemun <[email protected]>
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am fine with this consistency check, thanks!

@bwplotka
Copy link
Member

Can you rebase 🤗

@OGKevin OGKevin force-pushed the 3186-fix-default-dns-crash branch from a9489a3 to 19ab4a3 Compare October 15, 2020 08:50
@OGKevin
Copy link
Contributor Author

OGKevin commented Oct 15, 2020

@bwplotka I could not rebase as I merged master in this branch before. Rebasing now will cause a headache. Do you want me to squash this to 1 commit?

@bwplotka bwplotka merged commit 7a9a077 into thanos-io:master Oct 15, 2020
@bwplotka
Copy link
Member

Nah no need, GH squashes this. Thanks and sorry for long discussion (:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Thanos ruler dies on startup if AM DNS can't be resolved
3 participants