Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] Remote storages can warn rather than error on failure #1938

Merged
merged 6 commits into from
Sep 13, 2019

Conversation

arnikola
Copy link
Collaborator

What this PR does / why we need it:
Previously any failing remote storage calls would cause the entire fanout query to fail. This PR sets default remote storage options to warn users but keep going, with an option to override it and set remotes as necessary.

Special notes for your reviewer:

Does this PR introduce a user-facing and/or backwards incompatible change?:

NONE

Does this PR require updating code package or user-facing documentation?:

NONE

) Remote {
if override != nil {
global = *override
}
Copy link
Collaborator

@robskillington robskillington Sep 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit(ignore if you'd like): Any reason to do this in the constructor, than perhaps just in the for loop creating the remotes?

Had to follow to next method call to work out what the ... defaultBehavior, nil) meant (i.e. what is nil param used for).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah missed responding to this; tried it both ways and this looked a little cleaner

fmt.Fprintf(os.Stderr, "unable to load %s: %v", runOpts.ConfigFiles, err)
os.Exit(1)
}
fmt.Fprintf(os.Stdout, "using %s config files: %v", serviceName, runOpts.ConfigFiles)

fmt.Fprintf(os.Stdout, "using %s config files: %v",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I haven't seen this fmt.Printf before - should we perhaps not be printing out this out? Seems strange and don't think any of our other servers do it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it was added to help debug an issue where the wrong config was getting picked up #1803 (fixing #1801)

If we don't need it, happy to get rid of it though

remotes, enabled, err := remoteClient(cfg, tagOptions, poolWrapper,
readWorkerPool, instrumentOptions)
opts := config.RemoteOptionsFromConfig(cfg.RPC)
lookback := *cfg.LookbackDuration
Copy link
Collaborator

@robskillington robskillington Sep 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, isn't it safer to use cfg.LookbackDurationOrDefault() here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was pretty bleh about the way this whole thing is setup, but its guaranteed to be safe here since this happens just above: https://sourcegraph.com/github.com/m3db/m3@arnikola/remote-error-behavior/-/blob/src/query/server/server.go#L218:29

I didn't want to just pull it out since some stuff depends on that lookback duration being set; wouldn't mind refactoring it a bit but wanted to stay more focused here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although looks like I can use lookbackDuration here at least

import "fmt"

var (
// NB: Container behavior must not be defined in configs, so it does not count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Is it worth adding what a container behavior is? Haven't managed to grok what that is, might be worth mentioning here too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sounds good. Might change name to composite or something, basically it's only for fanout.Storage

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sg

zap.Error(err),
zap.String("store", store.Name()),
zap.String("function", "FetchBlocks"))
}
Copy link
Collaborator

@robskillington robskillington Sep 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make this slightly more go like:

if warning, err := storage.IsWarning(store, err); warning {
	numWarning++
	s.instrumentOpts.Logger().Warn(
		"partial results: fanout to store returned warning",
		zap.Error(err),
		zap.String("store", store.Name()),
		zap.String("function", "FetchBlocks"))
		return
}

multiErr = multiErr.Add(err)
s.instrumentOpts.Logger().Warn(
	"fanout to store returned error",
	zap.Error(err),
	zap.String("store", store.Name()),
	zap.String("function", "FetchBlocks"))
return

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, looks a little weird to me but if that's the convention easy times

zap.String("function", "SearchSeries"))

return nil, err
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm thinking you're missing a "continue" for the warning case (since otherwise metrics = append(metrics, result.Metrics...) will likely be a nil pointer error?).

Also to make it more go like, maybe this:

if warning, err := storage.IsWarning(store, err); warning {
	s.instrumentOpts.Logger().Warn(
		"partial results: fanout to store returned warning",
		zap.Error(err),
		zap.String("store", store.Name()),
		zap.String("function", "SearchSeries"))
	continue
}


s.instrumentOpts.Logger().Warn(
	"fanout to store returned error",
	zap.Error(err),
	zap.String("store", store.Name()),
	zap.String("function", "SearchSeries"))
return nil, err

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch

zap.String("function", "CompleteTags"))

return nil, err
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing as above, think you're missing a continue and may as well do the de-nesting "Go like" pattern/thing I typed up above.

BehaviorWarn
// BehaviorContainer is for storages that contain substorages. It is necessary
// to look at the returned error to determine if it's a failing error or
// a warning error.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, maybe a link to this or just a brief duplicate-like explanation at the other definition would be good too.

@codecov
Copy link

codecov bot commented Sep 12, 2019

Codecov Report

Merging #1938 into master will decrease coverage by 5%.
The diff coverage is 42.9%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #1938     +/-   ##
=========================================
- Coverage    60.5%    55.5%   -5.1%     
=========================================
  Files        1105     1099      -6     
  Lines      104971   102648   -2323     
=========================================
- Hits        63604    57011   -6593     
- Misses      37102    41601   +4499     
+ Partials     4265     4036    -229
Flag Coverage Δ
#aggregator 69.1% <ø> (-10.7%) ⬇️
#cluster 44.7% <ø> (ø) ⬆️
#collector 47.9% <ø> (-15.8%) ⬇️
#dbnode 60.6% <ø> (-1.1%) ⬇️
#m3em 59.6% <ø> (ø) ⬆️
#m3ninx 60.9% <ø> (+2.1%) ⬆️
#m3nsch 51.1% <ø> (ø) ⬆️
#metrics 17.7% <ø> (ø) ⬆️
#msg 74.7% <ø> (ø) ⬆️
#query 45.8% <42.9%> (-21.8%) ⬇️
#x 74.7% <ø> (+6.7%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 80377b0...06a1075. Read the comment docs.


// Remote is the configuration for a single remote host.
type Remote struct {
// RemoteConfiguration is the configuration for a single remote host.
Copy link
Contributor

@pavelnikolov pavelnikolov Sep 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why do you need this? It's already in the config package and in the config.go file. So you end up with config.RemoteConfiguration vs config.Remote. I prefer the shorter version.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly to avoid conflicts with the new more concrete Remote type; ideally we shouldn't be using RemoteConfiguration anywhere outside this package, but we have to make it public for yaml parsing

)

// Remote is an option for remote storage.
type Remote struct {
Copy link
Contributor

@pavelnikolov pavelnikolov Sep 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why do you have both Remote and RemoteConfiguration?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So RemoteConfiguration is the parsed yaml, Remote is the parsed structure

@@ -450,6 +458,11 @@ type RPCConfiguration struct {
// RemoteListenAddresses will only allow for a single remote zone to be used.
RemoteListenAddresses []string `yaml:"remoteListenAddresses"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is RemoteListenAddresses deprecated in favour of Remotes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah; it used to limit to a single zone; with this PR, can technically do both but would recommend sticking to Remotes

At some point soon we'll update docs as well as our example configs to mirror this

Copy link
Collaborator

@robskillington robskillington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@arnikola arnikola merged commit 5898e6a into master Sep 13, 2019
@arnikola arnikola deleted the arnikola/remote-error-behavior branch September 13, 2019 01:21
// ErrorBehavior overrides the default error behavior for this host.
//
// NB: defaults to warning on error.
ErrorBehavior *storage.ErrorBehavior `yaml:"errorBehavior"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ErrorBehavior is uint8 how do you serialize it to yaml? Shouldn't you have some string representation instead of numbers? Also, I saw that the constants are using iota assignment which means that if in future someone adds more values, these need to always be appended at the end. Otherwise, existing apps might break.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we've defined a custom yaml unmarshall here, which allows us to use the plaintext version of the desired behavior in the yaml. The documentation is definitely missing at the moment, but you can look at some of the new tests added in remote_options_test.go to see how the options will look in yaml (in all honesty you'd probably be fine with just taking the default approach which sets all remotes to BehaviorWarn)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants