Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added logging for dns resolution error in the Query component as well #2525

Merged
merged 9 commits into from
May 18, 2020
Merged

Added logging for dns resolution error in the Query component as well #2525

merged 9 commits into from
May 18, 2020

Conversation

yashrsharma44
Copy link
Contributor

@yashrsharma44 yashrsharma44 commented Apr 26, 2020

Signed-off-by: Yash [email protected]

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

This PR addresses the issue #2404

Verification

I had run the changes, and the Query component also logs the failure of the dns resolution. Earlier, only the provider.go logged the issue, now both are logging the error.

yash@kmaster  thanos-io/thanos   query-component-issue  ./thanos query --store.sd-dns-resolver=miekgdns --store=dnssrv+_grpc._tcp.thanos-store-grpc.default.svc.cluster.local
level=info ts=2020-04-26T08:10:36.149858286Z caller=main.go:152 msg="Tracing will be disabled"
level=info ts=2020-04-26T08:10:36.150778772Z caller=options.go:23 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2020-04-26T08:10:36.151007023Z caller=query.go:408 msg="starting query node"
level=info ts=2020-04-26T08:10:36.151180057Z caller=intrumentation.go:60 msg="changing probe status" status=healthy
level=info ts=2020-04-26T08:10:36.151220838Z caller=http.go:56 service=http/server component=query msg="listening for requests and metrics" address=0.0.0.0:10902
level=info ts=2020-04-26T08:10:36.151303554Z caller=intrumentation.go:48 msg="changing probe status" status=ready
level=info ts=2020-04-26T08:10:36.151383662Z caller=grpc.go:106 service=gRPC/server component=query msg="listening for StoreAPI gRPC" address=0.0.0.0:10901
level=error ts=2020-04-26T08:11:38.153041054Z caller=provider.go:127 msg="dns resolution failed" addr=dnssrv+_grpc._tcp.thanos-store-grpc.default.svc.cluster.local err="lookup SRV records \"_grpc._tcp.thanos-store-grpc.default.svc.cluster.local\": could not resolve \"_grpc._tcp.thanos-store-grpc.default.svc.cluster.local\": all servers responded with errors to at least one search domain. Errs ;could not resolve _grpc._tcp.thanos-store-grpc.default.svc.cluster.local.: no servers returned a viable answer. Errs ;resolution against server 127.0.0.1 for _grpc._tcp.thanos-store-grpc.default.svc.cluster.local.: exchange: read udp 127.0.0.1:49740->127.0.0.1:53: i/o timeout"
level=error ts=2020-04-26T08:11:38.153395929Z caller=query.go:328 msg="failed to resolve addresses for storeAPIs" err="lookup SRV records \"_grpc._tcp.thanos-store-grpc.default.svc.cluster.local\": could not resolve \"_grpc._tcp.thanos-store-grpc.default.svc.cluster.local\": all servers responded with errors to at least one search domain. Errs ;could not resolve _grpc._tcp.thanos-store-grpc.default.svc.cluster.local.: no servers returned a viable answer. Errs ;resolution against server 127.0.0.1 for _grpc._tcp.thanos-store-grpc.default.svc.cluster.local.: exchange: read udp 127.0.0.1:49740->127.0.0.1:53: i/o timeout"

Comments

I would like to discuss this approach, and see if this approach matches the requirements, as this is my first technical PR to Thanos! 🤓

Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please refactor this to use errors.MultiError instead of an array of error? AFAIK we generally try to avoid using []error as you can see in the current code-base. That other type provides a nicer and unified format IMHO. The code would simpler as well since the for-loop would be contained inside of MultiError.Error() + we'd have one, simple message about all failures instead of spam :P

@yashrsharma44
Copy link
Contributor Author

Could we please refactor this to use errors.MultiError instead of an array of error? AFAIK we generally try to avoid using []error as you can see in the current code-base. That other type provides a nicer and unified format IMHO. The code would simpler as well since the for-loop would be contained inside of MultiError.Error() + we'd have one, simple message about all failures instead of spam :P

Thanks for the review! I guess that sounds elegant, as the error log certainly seems a bit clogged. Making the changes right away! 😄

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Some comments and bug challenge =D

CHANGELOG.md Outdated
@@ -14,6 +14,7 @@ We use *breaking* word for marking changes that are not backward compatible (rel
### Fixed

- [#2501](https://github.com/thanos-io/thanos/pull/2501) Query: gracefully handle additional fields in `SeriesResponse` protobuf message that may be added in the future.
- [#2525](https://github.com/thanos-io/thanos/pull/2525) Query: Added logging for dns resolution error in the `Query` component.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So fixed or added? (; You put this in Fixed section , but we said "Added" here... ^^

I think it's fixed as we fix issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, some sloppy grammar from my side :P

@@ -304,7 +304,9 @@ func runQuery(
}
fileSDCache.Update(update)
stores.Update(ctxUpdate)
dnsProvider.Resolve(ctxUpdate, append(fileSDCache.Addresses(), storeAddrs...))
if err := dnsProvider.Resolve(ctxUpdate, append(fileSDCache.Addresses(), storeAddrs...)); err.Err() != nil {
level.Error(logger).Log("msg", "failed to resolve addresses for storeAPIs", "err", err.Error())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
level.Error(logger).Log("msg", "failed to resolve addresses for storeAPIs", "err", err.Error())
level.Error(logger).Log("msg", "failed to resolve addresses for storeAPIs", "err", err)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logger will know what to do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@@ -304,7 +304,9 @@ func runQuery(
}
fileSDCache.Update(update)
stores.Update(ctxUpdate)
dnsProvider.Resolve(ctxUpdate, append(fileSDCache.Addresses(), storeAddrs...))
if err := dnsProvider.Resolve(ctxUpdate, append(fileSDCache.Addresses(), storeAddrs...)); err.Err() != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is quite big bug here, but CI should help you to find what.

I will not say exactly what's wrong for educational purpose, please try to get from CI what's wrong you should be able to quickly spot this! 🤗 Let me know if you are blocked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Challenge accepted!

@@ -748,8 +748,8 @@ func addDiscoveryGroups(g *run.Group, c *http_util.Client, interval time.Duratio

g.Add(func() error {
return runutil.Repeat(interval, ctx.Done(), func() error {
c.Resolve(ctx)
return nil
err := c.Resolve(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

// Fail in case no server address is resolved.
servers := c.dnsProvider.Addresses()
if len(servers) == 0 {
if len(servers) == 0 || err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to always check error, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, made a note.

resolvedAddrs := map[string][]string{}
errorList := tsdberrors.MultiError{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use to use just errs name for those

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@@ -120,6 +123,7 @@ func (p *Provider) Resolve(ctx context.Context, addrs []string) {
resolved, err := p.resolver.Resolve(ctx, name, QType(qtype))
p.resolverLookupsCount.Inc()
if err != nil {
errorList.Add(err)
// The DNS resolution failed. Continue without modifying the old records.
p.resolverFailuresCount.Inc()
level.Error(p.logger).Log("msg", "dns resolution failed", "addr", addr, "err", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this wrong.

The problem is that we handle errors TWICE. One time here, then we pass errs somewhere. This will mean that we probably will log errors couple of times and just add confusion... Same with metric which is even more risky. Let's handle error just once. Either handle it all (just log), or pass further... not both! (:

What do you think?

Copy link
Contributor Author

@yashrsharma44 yashrsharma44 May 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should not log error in the dns/provider.go, rather the error should be logged by the function requesting it, so that we can clearly log where the error is effecting which function.
So provider.go would return the failed list of errors and the requesting function would log all the errors.
Does it sound fair?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we log, and we do that only for this error, do we really need to return this error? Does the caller ("requesting function") needs to act in specific way for those errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so I forgot to mention that, I am not logging the error now in provider.go, rather I am returning it(in my current change that I will push), and the receiver would handle that error(by logging it).

Copy link
Contributor Author

@yashrsharma44 yashrsharma44 May 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the caller ("requesting function") needs to act in specific way for those errors?

If we can log in the scope of the requesting function, that would be helpful in understanding which function request resulted in the error. Other than that, there is no specific behaviour that I am looking to intend.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this, but still what would be the solid answer to the question:

If we log, and we do that only for this error, do we really need to return this error? Does the caller ("requesting function") needs to act in specific way for those errors?

(: ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we log, and we do that only for this error, do we really need to return this error? Does the caller ("requesting function") needs to act in specific way for those errors?

As I have understood this question, I think we should return the error and the requesting function should log the error. If you have found any function which is doing both, then please highlight that, I might have missed that 😅

testutil.Equals(t, float64(2), promtestutil.ToFloat64(prv.resolverAddrs.WithLabelValues("any+a")))
testutil.Equals(t, float64(2), promtestutil.ToFloat64(prv.resolverAddrs.WithLabelValues("any+b")))
testutil.Equals(t, float64(1), promtestutil.ToFloat64(prv.resolverAddrs.WithLabelValues("any+c")))
if err := prv.Resolve(ctx, []string{"any+x"}); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this pattern?

With this if there is error on ALL resolves tests will be still green.
Why not using just testutil.Ok(t, prv.Resolve...)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I didn't had a look at the testutil package, the construct is quite elegant.

pkg/http/http.go Outdated
@@ -163,7 +164,7 @@ func (c FileSDConfig) convert() (file.SDConfig, error) {
}

type AddressProvider interface {
Resolve(context.Context, []string)
Resolve(context.Context, []string) tsdberrors.MultiError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should never return concrete type of error in Go.

Suggested change
Resolve(context.Context, []string) tsdberrors.MultiError
Resolve(context.Context, []string) errors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, reverted it.

pkg/http/http.go Outdated
func (c *Client) Resolve(ctx context.Context) {
c.provider.Resolve(ctx, append(c.fileSDCache.Addresses(), c.staticAddresses...))
func (c *Client) Resolve(ctx context.Context) tsdberrors.MultiError {
// Skip the error as it would be logged in the corresponding component.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So why we return error then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also what do you mean skip? We are not skipping , we are passing it further so handling properly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed it.

@yashrsharma44
Copy link
Contributor Author

Thanks for the suggestion, I will make the changes 😄

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing! Last suggestions and LGTM (:


// If some of the dns resolution fails, log the error.
if err := c.dnsProvider.Resolve(ctx, c.config.Addresses); err != nil {
level.Error(c.logger).Log("msg", "failed to resolve addresses for storeAPIs", "err", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth mentioning addresses?

Suggested change
level.Error(c.logger).Log("msg", "failed to resolve addresses for storeAPIs", "err", err)
level.Error(c.logger).Log("msg", "failed to resolve addresses for storeAPIs", "addresses", strings.Join(c.config.Addresses,","), "err", err)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that looks neat!

@@ -143,6 +147,11 @@ func (p *Provider) Resolve(ctx context.Context, addrs []string) {
p.resolverAddrs.Submit()

p.resolved = resolvedAddrs
if len(errs) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Power tip: we can do just returns. errs.Err() (: Without check! (check will be done on err)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh great, let me add it up!

Signed-off-by: Yash <[email protected]>

Nitpicking

Signed-off-by: Yash <[email protected]>

More nitpicking

Signed-off-by: Yash <[email protected]>

Removed the errorlist from lock scope for faster access

Signed-off-by: Yash <[email protected]>

Added tsdb.MultiError for error handling

Signed-off-by: Yash <[email protected]>

Fixed some import issues

Signed-off-by: Yash <[email protected]>

Fixed linting

Signed-off-by: Yash <[email protected]>

Changelog

Signed-off-by: Yash <[email protected]>

Linting issues

Signed-off-by: Yash <[email protected]>

More linting issues

Signed-off-by: Yash <[email protected]>

fixed import error

Signed-off-by: Yash <[email protected]>

Reverted some changes

Signed-off-by: Yash <[email protected]>

Added some more error checks

Signed-off-by: Yash <[email protected]>

Fixed some failing tests

Signed-off-by: Yash <[email protected]>

Linting

Signed-off-by: Yash <[email protected]>

Removed extra error param for logging

Signed-off-by: Yash <[email protected]>

Changelog modification

Signed-off-by: Yash <[email protected]>

Fixed unused variable

Signed-off-by: Yash <[email protected]>

Logged failed dns resolution in the memcached_client

Signed-off-by: Yash <[email protected]>

Changed the errorList to errs

Signed-off-by: Yash <[email protected]>

Changed concrete error type to primitive one

Signed-off-by: Yash <[email protected]>

Added logging of dns resolution failure in `http.go`
Changed the error message in the `memcached_client.go`

Signed-off-by: Yash <[email protected]>

Removed logging of errors in . Now it would return error list, which is to be handled by the receiver

Signed-off-by: Yash <[email protected]>

Removed Err() method

Signed-off-by: Yash <[email protected]>

Fixed import errors

Signed-off-by: Yash <[email protected]>

Removed extra variables

Signed-off-by: Yash <[email protected]>

Linting

Signed-off-by: Yash <[email protected]>

Check to ensure that return error if list of error is non-empty

Signed-off-by: Yash <[email protected]>

Modified tests of provider_test.go

Signed-off-by: Yash <[email protected]>

more checks

Signed-off-by: Yash <[email protected]>
Signed-off-by: Yash <[email protected]>
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, LGTM

@bwplotka bwplotka merged commit 960563c into thanos-io:master May 18, 2020
@bwplotka
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants