Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diagnostics: refactor build-and-run for clarity #17857

Conversation

sosiouxme
Copy link
Member

@sosiouxme sosiouxme commented Dec 18, 2017

This builds on #17773 which is the source of the first commit. Look at the second commit for the new changes.


Improve the legibility of the code that builds and runs diagnostics.

The main confusion was the need to track and report the number of diagnostic errors and warnings versus problems that halt execution prematurely and the need to return a correct status code at completion. In the end it seemed simplest to just have the logger report how many diagnostic errors and warnings were seen, leaving function signatures to return only build/run errors.

As a side effect, I looked at the ConfigLoading code that does an early check to see if there is a client config, and concluded it was confusing and unnecessary for it to be a diagnostic, so I refactored it away.

Commands for main diagnostics as well as pod diagnostics are now implemented more uniformly.

@openshift-ci-robot openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 18, 2017
@sosiouxme sosiouxme force-pushed the 20171217-diagnostic-refactor-summary branch 2 times, most recently from 1839d10 to 871a337 Compare December 19, 2017 11:54
@sosiouxme
Copy link
Member Author

sosiouxme commented Dec 19, 2017

apparent flakes #17875 and #17769 and #17556
/retest

@sosiouxme
Copy link
Member Author

so. many. flakes. i can't even.
/retest

@sosiouxme
Copy link
Member Author

infra bug should be resolved
/retest

@mfojtik
Copy link
Contributor

mfojtik commented Jan 4, 2018

@sosiouxme I will have look at this in afternoon while @juanvallejo is out. You can tag master team in future for the CLI reviews.

@sosiouxme
Copy link
Member Author

@mfojtik that would be much appreciated.

@sosiouxme sosiouxme mentioned this pull request Jan 8, 2018
10 tasks
@sosiouxme sosiouxme force-pushed the 20171217-diagnostic-refactor-summary branch 2 times, most recently from 09fd06d to 3f10acb Compare January 10, 2018 03:26
@sosiouxme
Copy link
Member Author

@openshift/sig-master a review here would be appreciated; first commit is #17773 which ought to be ready to merge, so just look at second commit.

@mfojtik
Copy link
Contributor

mfojtik commented Jan 10, 2018

@juanvallejo can you please help review?

@stevekuznetsov
Copy link
Contributor

/refresh

foundPath = path
}
}
if foundPath != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: len(foundPath) == 0

}
}
if foundPath != "" {
if confFlagValue != "" && confFlagValue != foundPath {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same nit as above for confFlagValue

}
}

if o.canOpenConfigFile(path, errmsg) && foundPath == "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: len(foundPath) == 0

if foundPath != "" {
if confFlagValue != "" && confFlagValue != foundPath {
// found config but not where --config said
o.Logger().Error("DCli1001", fmt.Sprintf(`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a fatal error? if a user provides an explicit flag for a config file location, but we instead infer and find it elsewhere, is it safe to assume we are using a configuration that the user expects?

I get the sense that we may be doing too much for the user in this case.
As a user, I would probably be fine with this failing out if I provide a --config value and it does not point to an actual config, and only have the actual configuration location discovered if I don't explicitly provide this flag

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The effect is that all diagnostics requiring the client config are skipped. The other ones can still run, and you get a bad exit code and this message about your options to get client config working. I don't think it's necessary to actually halt execution at this point though I could easily see that argument.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I was trying to think of users who don't even know where their kubeconfig is...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I was trying to think of users who don't even know where their kubeconfig is

That is a fair point. Would it make sense to provide a list of found locations containing a kubeconfig in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does do that (although it's a short list), but does not actually use any found kubeconfig, as that crosses the line (IMHO) into "too helpful".

errors := []error{}
diagnostics := []types.Diagnostic{}
// BuildAndRunDiagnostics builds diagnostics based on the options and executes them, returning fatal error(s) only.
func (o PodDiagnosticsOptions) BuildAndRunDiagnostics() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider renaming to just RunDiagnostics

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though it's ultimately calling something else named RunDiagnostics? After building them...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it would follow the pattern set in other cmds closer, and would be easier to think of this func as a wrapper for util.RunDiagnostics

// BuildAndRunDiagnostics builds diagnostics based on the options and executes them, returning fatal error(s) only.
func (o PodDiagnosticsOptions) BuildAndRunDiagnostics() error {
var fatal error
var diagnostics []types.Diagnostic

func() { // don't trust discovery/build of diagnostics; wrap panic nicely in case of developer error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks a bit unusual - is this in case of a runtime error?

Copy link
Member Author

@sosiouxme sosiouxme Jan 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. My thinking was that when a user runs a diagnostic, they're probably already facing a frustrating problem, and the most frustrating thing in the world would be to have the diagnostic tool itself completely bomb out. And given the system being diagnosed is probably broken, the likelihood of unexpected conditions leading to panics is much higher than during "normal" operation where most diagnostic development occurs.

Since individual diagnostics do orthogonal things, it seemed advisable to recover from panics and at least give the user something beyond just an infuriatingly obtuse stack trace. And still run the diagnostics that didn't crash.

This has been this way since the beginning of diagnostics (it's not unique to the pod diagnostics), and deads wasn't super fond of it then either, but it still seems right to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could set this up not here, but rather inside the commandRunFunc this way each command would not have to do it? Or generally at a higher level. 👍 for the reasons above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, that doesn't quite match the conceptual level of where the problem is being isolated. Diagnostics come in several "areas" (avoiding overloading the terms "class", "kind", "type") -- client, cluster, host, etc. -- and the build routine for each has its own panic recovery; so one failing would still allow other areas to build and run. Additionally there is a panic recovery around the run of each individual diagnostic -- again, to maximize what can run in a diagnostic situation. Kicking it up to the command level would leave little improvement over just letting the panic halt execution entirely.

I can think of some further refactoring to get rid of the "areas" and put the panic isolation at clearer points, but I don't want this PR to wait on that.

}()
}

return errorCount > 0, nil, warnCount, errorCount
if runCount == 0 {
return fmt.Errorf("Requested diagnostic(s) skipped; nothing to run. See --help and consider setting flags or providing config to enable running.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@juanvallejo
Copy link
Contributor

LGTM

cc @soltysh for /lgtm

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added several more nits, that can be addressed as a followup. Although this needs to wait for #17773 to merge first, so maybe you could address them in the mean time.

/lgtm

// Attempt to open file at path as client config
// If there is a problem and errmsg is set, log an error
func (o DiagnosticsOptions) canOpenConfigFile(path string, errmsg string) bool {
var file *os.File
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

var (
    file *os.File  
    err error
)

} else {
o.Logger().Error("DCli1008", fmt.Sprintf("%sbut there was an error opening it:\n%#v", errmsg, err))
}
if file != nil { // it is open for reading
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General rule of thumb is fail early, iow. this can be re-written as:

if file == nil {
    return false
}
// it is open for reading 
...

// BuildAndRunDiagnostics builds diagnostics based on the options and executes them, returning fatal error(s) only.
func (o PodDiagnosticsOptions) BuildAndRunDiagnostics() error {
var fatal error
var diagnostics []types.Diagnostic

func() { // don't trust discovery/build of diagnostics; wrap panic nicely in case of developer error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could set this up not here, but rather inside the commandRunFunc this way each command would not have to do it? Or generally at a higher level. 👍 for the reasons above.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 15, 2018
Adds the ability to specify parameters for individual diagnostics on the
command line (without proliferating flags).

Addresses openshift#14640
Improve the legibility of the code that builds and runs diagnostics.

The main confusion was the need to track and report the number of
diagnostic errors and warnings versus problems that halt execution
prematurely and the need to return a correct status code at completion.
In the end it seemed simplest to just have the logger report how many
diagnostic errors and warnings were seen, leaving function signatures to
return only build/run errors.

As a side effect, I looked at the ConfigLoading code that does an early
check to see if there is a client config, and concluded it was confusing
and unnecessary for it to be a diagnostic, so I refactored it away.

Main diagnostics as well as pod diagnostics are now implemented more uniformly.
@sosiouxme sosiouxme force-pushed the 20171217-diagnostic-refactor-summary branch from 3f10acb to 8059482 Compare January 22, 2018 03:24
@openshift-merge-robot openshift-merge-robot removed the lgtm Indicates that a PR is ready to be merged. label Jan 22, 2018
@sosiouxme
Copy link
Member Author

Rebased and addressed review comments. Thanks!

@0xmichalis
Copy link
Contributor

/retest

@sosiouxme
Copy link
Member Author

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17857/test_pull_request_origin_extended_conformance_gce/14722/

Tests | 0 failed / 610 succeeded

not even clear to me what failed much less how it failed :(

/retest

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 24, 2018
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@soltysh
Copy link
Contributor

soltysh commented Jan 24, 2018

The only thing I can't approve is man, which I'm addressing in #18267. I'm approving this as is.

@soltysh soltysh added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 24, 2018
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: soltysh, sosiouxme

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@openshift-merge-robot
Copy link
Contributor

/test all [submit-queue is verifying that this PR is safe to merge]

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue (batch tested with PRs 17857, 18252, 18198).

@openshift-merge-robot openshift-merge-robot merged commit ee846d3 into openshift:master Jan 24, 2018
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 24, 2018

@sosiouxme: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/openshift-jenkins/cmd 8059482 link /test cmd

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@sosiouxme sosiouxme deleted the 20171217-diagnostic-refactor-summary branch January 24, 2018 21:21
openshift-merge-robot added a commit that referenced this pull request Feb 17, 2018
Automatic merge from submit-queue (batch tested with PRs 16658, 18643).

AppCreate diagnostic

Implements https://trello.com/c/Zv4hVlyQ/130-diagnostic-to-recreate-app-create-loop-script as a diagnostic.

https://trello.com/c/Zv4hVlyQ/27-3-continue-appcreate-diagnostic-work
https://trello.com/c/aNWlMtMk/61-demo-merge-appcreate-diagnostic
https://trello.com/c/H0jsgQwu/63-3-complete-appcreate-diagnostic-functionality

Status:
- [x] Create and cleanup project
- [x] Deploy and cleanup app
- [x] Wait for app to start
- [x] Test ability to connect to app via service
- [x] Test that app responds correctly
- [x] Test ability to connect via route
- [x] Write stats/results to file as json

Not yet addressed in this PR (depending on how reviews progress vs development):
- [ ] Run a build to completion
- [ ] Test ability to attach storage
- [ ] Gather and write useful information (logs, status) on failure

Builds on top of #17773 for handling parameters to the diagnostic as well as #17857 which is a refactor on top of that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants