-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added logging for dns resolution error in the Query component as well #2525
Added logging for dns resolution error in the Query component as well #2525
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we please refactor this to use errors.MultiError
instead of an array of error
? AFAIK we generally try to avoid using []error
as you can see in the current code-base. That other type provides a nicer and unified format IMHO. The code would simpler as well since the for-loop would be contained inside of MultiError.Error()
+ we'd have one, simple message about all failures instead of spam :P
Thanks for the review! I guess that sounds elegant, as the error log certainly seems a bit clogged. Making the changes right away! 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Some comments and bug challenge =D
CHANGELOG.md
Outdated
@@ -14,6 +14,7 @@ We use *breaking* word for marking changes that are not backward compatible (rel | |||
### Fixed | |||
|
|||
- [#2501](https://github.com/thanos-io/thanos/pull/2501) Query: gracefully handle additional fields in `SeriesResponse` protobuf message that may be added in the future. | |||
- [#2525](https://github.com/thanos-io/thanos/pull/2525) Query: Added logging for dns resolution error in the `Query` component. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So fixed or added? (; You put this in Fixed
section , but we said "Added" here... ^^
I think it's fixed as we fix issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, some sloppy grammar from my side :P
cmd/thanos/query.go
Outdated
@@ -304,7 +304,9 @@ func runQuery( | |||
} | |||
fileSDCache.Update(update) | |||
stores.Update(ctxUpdate) | |||
dnsProvider.Resolve(ctxUpdate, append(fileSDCache.Addresses(), storeAddrs...)) | |||
if err := dnsProvider.Resolve(ctxUpdate, append(fileSDCache.Addresses(), storeAddrs...)); err.Err() != nil { | |||
level.Error(logger).Log("msg", "failed to resolve addresses for storeAPIs", "err", err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
level.Error(logger).Log("msg", "failed to resolve addresses for storeAPIs", "err", err.Error()) | |
level.Error(logger).Log("msg", "failed to resolve addresses for storeAPIs", "err", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logger will know what to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
cmd/thanos/query.go
Outdated
@@ -304,7 +304,9 @@ func runQuery( | |||
} | |||
fileSDCache.Update(update) | |||
stores.Update(ctxUpdate) | |||
dnsProvider.Resolve(ctxUpdate, append(fileSDCache.Addresses(), storeAddrs...)) | |||
if err := dnsProvider.Resolve(ctxUpdate, append(fileSDCache.Addresses(), storeAddrs...)); err.Err() != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is quite big bug here, but CI should help you to find what.
I will not say exactly what's wrong for educational purpose, please try to get from CI what's wrong you should be able to quickly spot this! 🤗 Let me know if you are blocked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Challenge accepted!
cmd/thanos/rule.go
Outdated
@@ -748,8 +748,8 @@ func addDiscoveryGroups(g *run.Group, c *http_util.Client, interval time.Duratio | |||
|
|||
g.Add(func() error { | |||
return runutil.Repeat(interval, ctx.Done(), func() error { | |||
c.Resolve(ctx) | |||
return nil | |||
err := c.Resolve(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! No need for separate variable though.
Please see https://thanos.io/contributing/coding-style-guide.md/#avoid-defining-variables-used-only-once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
pkg/cacheutil/memcached_client.go
Outdated
// Fail in case no server address is resolved. | ||
servers := c.dnsProvider.Addresses() | ||
if len(servers) == 0 { | ||
if len(servers) == 0 || err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to always check error, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, made a note.
pkg/discovery/dns/provider.go
Outdated
resolvedAddrs := map[string][]string{} | ||
errorList := tsdberrors.MultiError{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use to use just errs
name for those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
pkg/discovery/dns/provider.go
Outdated
@@ -120,6 +123,7 @@ func (p *Provider) Resolve(ctx context.Context, addrs []string) { | |||
resolved, err := p.resolver.Resolve(ctx, name, QType(qtype)) | |||
p.resolverLookupsCount.Inc() | |||
if err != nil { | |||
errorList.Add(err) | |||
// The DNS resolution failed. Continue without modifying the old records. | |||
p.resolverFailuresCount.Inc() | |||
level.Error(p.logger).Log("msg", "dns resolution failed", "addr", addr, "err", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok this wrong.
The problem is that we handle errors TWICE. One time here, then we pass errs
somewhere. This will mean that we probably will log errors couple of times and just add confusion... Same with metric which is even more risky. Let's handle error just once. Either handle it all (just log), or pass further... not both! (:
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should not log error in the dns/provider.go
, rather the error should be logged by the function requesting it, so that we can clearly log where the error is effecting which function.
So provider.go
would return the failed list of errors and the requesting function would log all the errors.
Does it sound fair?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we log, and we do that only for this error, do we really need to return this error? Does the caller ("requesting function") needs to act in specific way for those errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so I forgot to mention that, I am not logging the error now in provider.go
, rather I am returning it(in my current change that I will push), and the receiver would handle that error(by logging it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the caller ("requesting function") needs to act in specific way for those errors?
If we can log in the scope of the requesting function, that would be helpful in understanding which function request resulted in the error. Other than that, there is no specific behaviour that I am looking to intend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this, but still what would be the solid answer to the question:
If we log, and we do that only for this error, do we really need to return this error? Does the caller ("requesting function") needs to act in specific way for those errors?
(: ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we log, and we do that only for this error, do we really need to return this error? Does the caller ("requesting function") needs to act in specific way for those errors?
As I have understood this question, I think we should return the error and the requesting function should log the error. If you have found any function which is doing both, then please highlight that, I might have missed that 😅
pkg/discovery/dns/provider_test.go
Outdated
testutil.Equals(t, float64(2), promtestutil.ToFloat64(prv.resolverAddrs.WithLabelValues("any+a"))) | ||
testutil.Equals(t, float64(2), promtestutil.ToFloat64(prv.resolverAddrs.WithLabelValues("any+b"))) | ||
testutil.Equals(t, float64(1), promtestutil.ToFloat64(prv.resolverAddrs.WithLabelValues("any+c"))) | ||
if err := prv.Resolve(ctx, []string{"any+x"}); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this pattern?
With this if there is error on ALL resolves tests will be still green.
Why not using just testutil.Ok(t, prv.Resolve...)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I didn't had a look at the testutil
package, the construct is quite elegant.
pkg/http/http.go
Outdated
@@ -163,7 +164,7 @@ func (c FileSDConfig) convert() (file.SDConfig, error) { | |||
} | |||
|
|||
type AddressProvider interface { | |||
Resolve(context.Context, []string) | |||
Resolve(context.Context, []string) tsdberrors.MultiError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should never return concrete type of error in Go.
Resolve(context.Context, []string) tsdberrors.MultiError | |
Resolve(context.Context, []string) errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, reverted it.
pkg/http/http.go
Outdated
func (c *Client) Resolve(ctx context.Context) { | ||
c.provider.Resolve(ctx, append(c.fileSDCache.Addresses(), c.staticAddresses...)) | ||
func (c *Client) Resolve(ctx context.Context) tsdberrors.MultiError { | ||
// Skip the error as it would be logged in the corresponding component. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So why we return error then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also what do you mean skip? We are not skipping , we are passing it further so handling properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed it.
Thanks for the suggestion, I will make the changes 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing! Last suggestions and LGTM (:
pkg/cacheutil/memcached_client.go
Outdated
|
||
// If some of the dns resolution fails, log the error. | ||
if err := c.dnsProvider.Resolve(ctx, c.config.Addresses); err != nil { | ||
level.Error(c.logger).Log("msg", "failed to resolve addresses for storeAPIs", "err", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth mentioning addresses?
level.Error(c.logger).Log("msg", "failed to resolve addresses for storeAPIs", "err", err) | |
level.Error(c.logger).Log("msg", "failed to resolve addresses for storeAPIs", "addresses", strings.Join(c.config.Addresses,","), "err", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that looks neat!
pkg/discovery/dns/provider.go
Outdated
@@ -143,6 +147,11 @@ func (p *Provider) Resolve(ctx context.Context, addrs []string) { | |||
p.resolverAddrs.Submit() | |||
|
|||
p.resolved = resolvedAddrs | |||
if len(errs) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Power tip: we can do just returns. errs.Err()
(: Without check! (check will be done on err)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh great, let me add it up!
Signed-off-by: Yash <[email protected]> Nitpicking Signed-off-by: Yash <[email protected]> More nitpicking Signed-off-by: Yash <[email protected]> Removed the errorlist from lock scope for faster access Signed-off-by: Yash <[email protected]> Added tsdb.MultiError for error handling Signed-off-by: Yash <[email protected]> Fixed some import issues Signed-off-by: Yash <[email protected]> Fixed linting Signed-off-by: Yash <[email protected]> Changelog Signed-off-by: Yash <[email protected]> Linting issues Signed-off-by: Yash <[email protected]> More linting issues Signed-off-by: Yash <[email protected]> fixed import error Signed-off-by: Yash <[email protected]> Reverted some changes Signed-off-by: Yash <[email protected]> Added some more error checks Signed-off-by: Yash <[email protected]> Fixed some failing tests Signed-off-by: Yash <[email protected]> Linting Signed-off-by: Yash <[email protected]> Removed extra error param for logging Signed-off-by: Yash <[email protected]> Changelog modification Signed-off-by: Yash <[email protected]> Fixed unused variable Signed-off-by: Yash <[email protected]> Logged failed dns resolution in the memcached_client Signed-off-by: Yash <[email protected]> Changed the errorList to errs Signed-off-by: Yash <[email protected]> Changed concrete error type to primitive one Signed-off-by: Yash <[email protected]> Added logging of dns resolution failure in `http.go` Changed the error message in the `memcached_client.go` Signed-off-by: Yash <[email protected]> Removed logging of errors in . Now it would return error list, which is to be handled by the receiver Signed-off-by: Yash <[email protected]> Removed Err() method Signed-off-by: Yash <[email protected]> Fixed import errors Signed-off-by: Yash <[email protected]> Removed extra variables Signed-off-by: Yash <[email protected]> Linting Signed-off-by: Yash <[email protected]> Check to ensure that return error if list of error is non-empty Signed-off-by: Yash <[email protected]> Modified tests of provider_test.go Signed-off-by: Yash <[email protected]> more checks Signed-off-by: Yash <[email protected]>
Signed-off-by: Yash <[email protected]>
…rror Signed-off-by: Yash <[email protected]>
Signed-off-by: Yash <[email protected]>
Signed-off-by: Yash <[email protected]>
Signed-off-by: Yash <[email protected]>
Signed-off-by: Yash <[email protected]>
Signed-off-by: Yash <[email protected]>
Signed-off-by: Yash <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, LGTM
Thanks! |
Signed-off-by: Yash [email protected]
Changes
This PR addresses the issue #2404
Verification
I had run the changes, and the Query component also logs the failure of the dns resolution. Earlier, only the
provider.go
logged the issue, now both are logging the error.Comments
I would like to discuss this approach, and see if this approach matches the requirements, as this is my first technical PR to Thanos! 🤓