-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
diagnostics: refactor build-and-run for clarity #17857
diagnostics: refactor build-and-run for clarity #17857
Conversation
1839d10
to
871a337
Compare
so. many. flakes. i can't even. |
infra bug should be resolved |
@sosiouxme I will have look at this in afternoon while @juanvallejo is out. You can tag master team in future for the CLI reviews. |
@mfojtik that would be much appreciated. |
09fd06d
to
3f10acb
Compare
@openshift/sig-master a review here would be appreciated; first commit is #17773 which ought to be ready to merge, so just look at second commit. |
@juanvallejo can you please help review? |
/refresh |
pkg/oc/admin/diagnostics/config.go
Outdated
foundPath = path | ||
} | ||
} | ||
if foundPath != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: len(foundPath) == 0
pkg/oc/admin/diagnostics/config.go
Outdated
} | ||
} | ||
if foundPath != "" { | ||
if confFlagValue != "" && confFlagValue != foundPath { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same nit as above for confFlagValue
pkg/oc/admin/diagnostics/config.go
Outdated
} | ||
} | ||
|
||
if o.canOpenConfigFile(path, errmsg) && foundPath == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: len(foundPath) == 0
pkg/oc/admin/diagnostics/config.go
Outdated
if foundPath != "" { | ||
if confFlagValue != "" && confFlagValue != foundPath { | ||
// found config but not where --config said | ||
o.Logger().Error("DCli1001", fmt.Sprintf(` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a fatal error? if a user provides an explicit flag for a config file location, but we instead infer and find it elsewhere, is it safe to assume we are using a configuration that the user expects?
I get the sense that we may be doing too much for the user in this case.
As a user, I would probably be fine with this failing out if I provide a --config
value and it does not point to an actual config, and only have the actual configuration location discovered if I don't explicitly provide this flag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The effect is that all diagnostics requiring the client config are skipped. The other ones can still run, and you get a bad exit code and this message about your options to get client config working. I don't think it's necessary to actually halt execution at this point though I could easily see that argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I was trying to think of users who don't even know where their kubeconfig is...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I was trying to think of users who don't even know where their kubeconfig is
That is a fair point. Would it make sense to provide a list of found locations containing a kubeconfig in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does do that (although it's a short list), but does not actually use any found kubeconfig, as that crosses the line (IMHO) into "too helpful".
pkg/oc/admin/diagnostics/pod.go
Outdated
errors := []error{} | ||
diagnostics := []types.Diagnostic{} | ||
// BuildAndRunDiagnostics builds diagnostics based on the options and executes them, returning fatal error(s) only. | ||
func (o PodDiagnosticsOptions) BuildAndRunDiagnostics() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider renaming to just RunDiagnostics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though it's ultimately calling something else named RunDiagnostics? After building them...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it would follow the pattern set in other cmds closer, and would be easier to think of this func as a wrapper for util.RunDiagnostics
pkg/oc/admin/diagnostics/pod.go
Outdated
// BuildAndRunDiagnostics builds diagnostics based on the options and executes them, returning fatal error(s) only. | ||
func (o PodDiagnosticsOptions) BuildAndRunDiagnostics() error { | ||
var fatal error | ||
var diagnostics []types.Diagnostic | ||
|
||
func() { // don't trust discovery/build of diagnostics; wrap panic nicely in case of developer error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks a bit unusual - is this in case of a runtime error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. My thinking was that when a user runs a diagnostic, they're probably already facing a frustrating problem, and the most frustrating thing in the world would be to have the diagnostic tool itself completely bomb out. And given the system being diagnosed is probably broken, the likelihood of unexpected conditions leading to panics is much higher than during "normal" operation where most diagnostic development occurs.
Since individual diagnostics do orthogonal things, it seemed advisable to recover from panics and at least give the user something beyond just an infuriatingly obtuse stack trace. And still run the diagnostics that didn't crash.
This has been this way since the beginning of diagnostics (it's not unique to the pod diagnostics), and deads wasn't super fond of it then either, but it still seems right to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could set this up not here, but rather inside the commandRunFunc
this way each command would not have to do it? Or generally at a higher level. 👍 for the reasons above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, that doesn't quite match the conceptual level of where the problem is being isolated. Diagnostics come in several "areas" (avoiding overloading the terms "class", "kind", "type") -- client, cluster, host, etc. -- and the build routine for each has its own panic recovery; so one failing would still allow other areas to build and run. Additionally there is a panic recovery around the run of each individual diagnostic -- again, to maximize what can run in a diagnostic situation. Kicking it up to the command level would leave little improvement over just letting the panic halt execution entirely.
I can think of some further refactoring to get rid of the "areas" and put the panic isolation at clearer points, but I don't want this PR to wait on that.
}() | ||
} | ||
|
||
return errorCount > 0, nil, warnCount, errorCount | ||
if runCount == 0 { | ||
return fmt.Errorf("Requested diagnostic(s) skipped; nothing to run. See --help and consider setting flags or providing config to enable running.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
LGTM cc @soltysh for /lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added several more nits, that can be addressed as a followup. Although this needs to wait for #17773 to merge first, so maybe you could address them in the mean time.
/lgtm
pkg/oc/admin/diagnostics/config.go
Outdated
// Attempt to open file at path as client config | ||
// If there is a problem and errmsg is set, log an error | ||
func (o DiagnosticsOptions) canOpenConfigFile(path string, errmsg string) bool { | ||
var file *os.File |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
var (
file *os.File
err error
)
pkg/oc/admin/diagnostics/config.go
Outdated
} else { | ||
o.Logger().Error("DCli1008", fmt.Sprintf("%sbut there was an error opening it:\n%#v", errmsg, err)) | ||
} | ||
if file != nil { // it is open for reading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General rule of thumb is fail early, iow. this can be re-written as:
if file == nil {
return false
}
// it is open for reading
...
pkg/oc/admin/diagnostics/pod.go
Outdated
// BuildAndRunDiagnostics builds diagnostics based on the options and executes them, returning fatal error(s) only. | ||
func (o PodDiagnosticsOptions) BuildAndRunDiagnostics() error { | ||
var fatal error | ||
var diagnostics []types.Diagnostic | ||
|
||
func() { // don't trust discovery/build of diagnostics; wrap panic nicely in case of developer error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could set this up not here, but rather inside the commandRunFunc
this way each command would not have to do it? Or generally at a higher level. 👍 for the reasons above.
Adds the ability to specify parameters for individual diagnostics on the command line (without proliferating flags). Addresses openshift#14640
Improve the legibility of the code that builds and runs diagnostics. The main confusion was the need to track and report the number of diagnostic errors and warnings versus problems that halt execution prematurely and the need to return a correct status code at completion. In the end it seemed simplest to just have the logger report how many diagnostic errors and warnings were seen, leaving function signatures to return only build/run errors. As a side effect, I looked at the ConfigLoading code that does an early check to see if there is a client config, and concluded it was confusing and unnecessary for it to be a diagnostic, so I refactored it away. Main diagnostics as well as pod diagnostics are now implemented more uniformly.
3f10acb
to
8059482
Compare
Rebased and addressed review comments. Thanks! |
/retest |
Tests | 0 failed / 610 succeeded not even clear to me what failed much less how it failed :( /retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
The only thing I can't approve is man, which I'm addressing in #18267. I'm approving this as is. |
[APPROVALNOTIFIER] This PR is APPROVED Approval requirements bypassed by manually added approval. This pull-request has been approved by: soltysh, sosiouxme The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue (batch tested with PRs 17857, 18252, 18198). |
@sosiouxme: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Automatic merge from submit-queue (batch tested with PRs 16658, 18643). AppCreate diagnostic Implements https://trello.com/c/Zv4hVlyQ/130-diagnostic-to-recreate-app-create-loop-script as a diagnostic. https://trello.com/c/Zv4hVlyQ/27-3-continue-appcreate-diagnostic-work https://trello.com/c/aNWlMtMk/61-demo-merge-appcreate-diagnostic https://trello.com/c/H0jsgQwu/63-3-complete-appcreate-diagnostic-functionality Status: - [x] Create and cleanup project - [x] Deploy and cleanup app - [x] Wait for app to start - [x] Test ability to connect to app via service - [x] Test that app responds correctly - [x] Test ability to connect via route - [x] Write stats/results to file as json Not yet addressed in this PR (depending on how reviews progress vs development): - [ ] Run a build to completion - [ ] Test ability to attach storage - [ ] Gather and write useful information (logs, status) on failure Builds on top of #17773 for handling parameters to the diagnostic as well as #17857 which is a refactor on top of that.
This builds on #17773 which is the source of the first commit. Look at the second commit for the new changes.
Improve the legibility of the code that builds and runs diagnostics.
The main confusion was the need to track and report the number of diagnostic errors and warnings versus problems that halt execution prematurely and the need to return a correct status code at completion. In the end it seemed simplest to just have the logger report how many diagnostic errors and warnings were seen, leaving function signatures to return only build/run errors.
As a side effect, I looked at the ConfigLoading code that does an early check to see if there is a client config, and concluded it was confusing and unnecessary for it to be a diagnostic, so I refactored it away.
Commands for main diagnostics as well as pod diagnostics are now implemented more uniformly.