diagnostics: refactor build-and-run for clarity #17857

sosiouxme · 2017-12-18T17:05:59Z

This builds on #17773 which is the source of the first commit. Look at the second commit for the new changes.

Improve the legibility of the code that builds and runs diagnostics.

The main confusion was the need to track and report the number of diagnostic errors and warnings versus problems that halt execution prematurely and the need to return a correct status code at completion. In the end it seemed simplest to just have the logger report how many diagnostic errors and warnings were seen, leaving function signatures to return only build/run errors.

As a side effect, I looked at the ConfigLoading code that does an early check to see if there is a client config, and concluded it was confusing and unnecessary for it to be a diagnostic, so I refactored it away.

Commands for main diagnostics as well as pod diagnostics are now implemented more uniformly.

sosiouxme · 2017-12-19T14:31:55Z

apparent flakes #17875 and #17769 and #17556
/retest

sosiouxme · 2017-12-19T16:36:39Z

so. many. flakes. i can't even.
/retest

sosiouxme · 2017-12-19T19:12:18Z

infra bug should be resolved
/retest

mfojtik · 2018-01-04T09:16:37Z

@sosiouxme I will have look at this in afternoon while @juanvallejo is out. You can tag master team in future for the CLI reviews.

sosiouxme · 2018-01-04T20:23:52Z

@mfojtik that would be much appreciated.

sosiouxme · 2018-01-10T13:08:49Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17857/test_pull_request_origin_extended_conformance_install/5225/ is a flake
/retest

sosiouxme · 2018-01-10T13:10:15Z

@openshift/sig-master a review here would be appreciated; first commit is #17773 which ought to be ready to merge, so just look at second commit.

mfojtik · 2018-01-10T13:18:31Z

@juanvallejo can you please help review?

stevekuznetsov · 2018-01-10T15:21:26Z

/refresh

juanvallejo · 2018-01-10T19:38:02Z

pkg/oc/admin/diagnostics/config.go

+			foundPath = path
+		}
+	}
+	if foundPath != "" {


nit: len(foundPath) == 0

juanvallejo · 2018-01-10T19:38:20Z

pkg/oc/admin/diagnostics/config.go

+		}
+	}
+	if foundPath != "" {
+		if confFlagValue != "" && confFlagValue != foundPath {


same nit as above for confFlagValue

juanvallejo · 2018-01-10T19:39:01Z

pkg/oc/admin/diagnostics/config.go

+			}
+		}
+
+		if o.canOpenConfigFile(path, errmsg) && foundPath == "" {


nit: len(foundPath) == 0

juanvallejo · 2018-01-10T19:42:52Z

pkg/oc/admin/diagnostics/config.go

+	if foundPath != "" {
+		if confFlagValue != "" && confFlagValue != foundPath {
+			// found config but not where --config said
+			o.Logger().Error("DCli1001", fmt.Sprintf(`


should this be a fatal error? if a user provides an explicit flag for a config file location, but we instead infer and find it elsewhere, is it safe to assume we are using a configuration that the user expects?

I get the sense that we may be doing too much for the user in this case.
As a user, I would probably be fine with this failing out if I provide a --config value and it does not point to an actual config, and only have the actual configuration location discovered if I don't explicitly provide this flag

The effect is that all diagnostics requiring the client config are skipped. The other ones can still run, and you get a bad exit code and this message about your options to get client config working. I don't think it's necessary to actually halt execution at this point though I could easily see that argument.

Originally I was trying to think of users who don't even know where their kubeconfig is...

Originally I was trying to think of users who don't even know where their kubeconfig is

That is a fair point. Would it make sense to provide a list of found locations containing a kubeconfig in this case?

It does do that (although it's a short list), but does not actually use any found kubeconfig, as that crosses the line (IMHO) into "too helpful".

juanvallejo · 2018-01-10T19:50:09Z

pkg/oc/admin/diagnostics/pod.go

-	errors := []error{}
-	diagnostics := []types.Diagnostic{}
+// BuildAndRunDiagnostics builds diagnostics based on the options and executes them, returning fatal error(s) only.
+func (o PodDiagnosticsOptions) BuildAndRunDiagnostics() error {


consider renaming to just RunDiagnostics

Even though it's ultimately calling something else named RunDiagnostics? After building them...

Yeah, it would follow the pattern set in other cmds closer, and would be easier to think of this func as a wrapper for util.RunDiagnostics

juanvallejo · 2018-01-10T19:53:25Z

pkg/oc/admin/diagnostics/pod.go

+// BuildAndRunDiagnostics builds diagnostics based on the options and executes them, returning fatal error(s) only.
+func (o PodDiagnosticsOptions) BuildAndRunDiagnostics() error {
+	var fatal error
+	var diagnostics []types.Diagnostic

 	func() { // don't trust discovery/build of diagnostics; wrap panic nicely in case of developer error


this looks a bit unusual - is this in case of a runtime error?

Yes. My thinking was that when a user runs a diagnostic, they're probably already facing a frustrating problem, and the most frustrating thing in the world would be to have the diagnostic tool itself completely bomb out. And given the system being diagnosed is probably broken, the likelihood of unexpected conditions leading to panics is much higher than during "normal" operation where most diagnostic development occurs.

Since individual diagnostics do orthogonal things, it seemed advisable to recover from panics and at least give the user something beyond just an infuriatingly obtuse stack trace. And still run the diagnostics that didn't crash.

This has been this way since the beginning of diagnostics (it's not unique to the pod diagnostics), and deads wasn't super fond of it then either, but it still seems right to me.

Maybe you could set this up not here, but rather inside the commandRunFunc this way each command would not have to do it? Or generally at a higher level. 👍 for the reasons above.

To me, that doesn't quite match the conceptual level of where the problem is being isolated. Diagnostics come in several "areas" (avoiding overloading the terms "class", "kind", "type") -- client, cluster, host, etc. -- and the build routine for each has its own panic recovery; so one failing would still allow other areas to build and run. Additionally there is a panic recovery around the run of each individual diagnostic -- again, to maximize what can run in a diagnostic situation. Kicking it up to the command level would leave little improvement over just letting the panic halt execution entirely.

I can think of some further refactoring to get rid of the "areas" and put the panic isolation at clearer points, but I don't want this PR to wait on that.

juanvallejo · 2018-01-10T19:55:30Z

pkg/oc/admin/diagnostics/util/util.go

 		}()
 	}

-	return errorCount > 0, nil, warnCount, errorCount
+	if runCount == 0 {
+		return fmt.Errorf("Requested diagnostic(s) skipped; nothing to run. See --help and consider setting flags or providing config to enable running.")


juanvallejo · 2018-01-10T22:27:34Z

LGTM

cc @soltysh for /lgtm

soltysh

I've added several more nits, that can be addressed as a followup. Although this needs to wait for #17773 to merge first, so maybe you could address them in the mean time.

/lgtm

soltysh · 2018-01-15T14:02:50Z

pkg/oc/admin/diagnostics/config.go

+// Attempt to open file at path as client config
+// If there is a problem and errmsg is set, log an error
+func (o DiagnosticsOptions) canOpenConfigFile(path string, errmsg string) bool {
+	var file *os.File


nit:

var ( file *os.File err error )

soltysh · 2018-01-15T14:05:55Z

pkg/oc/admin/diagnostics/config.go

+	} else {
+		o.Logger().Error("DCli1008", fmt.Sprintf("%sbut there was an error opening it:\n%#v", errmsg, err))
+	}
+	if file != nil { // it is open for reading


General rule of thumb is fail early, iow. this can be re-written as:

if file == nil { return false } // it is open for reading ...

soltysh · 2018-01-15T14:12:48Z

pkg/oc/admin/diagnostics/pod.go

+// BuildAndRunDiagnostics builds diagnostics based on the options and executes them, returning fatal error(s) only.
+func (o PodDiagnosticsOptions) BuildAndRunDiagnostics() error {
+	var fatal error
+	var diagnostics []types.Diagnostic

 	func() { // don't trust discovery/build of diagnostics; wrap panic nicely in case of developer error


Maybe you could set this up not here, but rather inside the commandRunFunc this way each command would not have to do it? Or generally at a higher level. 👍 for the reasons above.

Adds the ability to specify parameters for individual diagnostics on the command line (without proliferating flags). Addresses openshift#14640

Improve the legibility of the code that builds and runs diagnostics. The main confusion was the need to track and report the number of diagnostic errors and warnings versus problems that halt execution prematurely and the need to return a correct status code at completion. In the end it seemed simplest to just have the logger report how many diagnostic errors and warnings were seen, leaving function signatures to return only build/run errors. As a side effect, I looked at the ConfigLoading code that does an early check to see if there is a client config, and concluded it was confusing and unnecessary for it to be a diagnostic, so I refactored it away. Main diagnostics as well as pod diagnostics are now implemented more uniformly.

sosiouxme · 2018-01-22T03:25:20Z

Rebased and addressed review comments. Thanks!

0xmichalis · 2018-01-22T08:58:52Z

/retest

sosiouxme · 2018-01-22T13:28:38Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17857/test_pull_request_origin_extended_conformance_gce/14722/

Tests | 0 failed / 610 succeeded

not even clear to me what failed much less how it failed :(

/retest

soltysh

/lgtm
/approve

soltysh · 2018-01-24T16:18:12Z

The only thing I can't approve is man, which I'm addressing in #18267. I'm approving this as is.

openshift-ci-robot · 2018-01-24T16:18:27Z

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: soltysh, sosiouxme

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

OWNERS

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

openshift-merge-robot · 2018-01-24T17:36:24Z

/test all [submit-queue is verifying that this PR is safe to merge]

openshift-merge-robot · 2018-01-24T17:45:57Z

Automatic merge from submit-queue (batch tested with PRs 17857, 18252, 18198).

openshift-ci-robot · 2018-01-24T18:24:12Z

@sosiouxme: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/openshift-jenkins/cmd	`8059482`	link	`/test cmd`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Automatic merge from submit-queue (batch tested with PRs 16658, 18643). AppCreate diagnostic Implements https://trello.com/c/Zv4hVlyQ/130-diagnostic-to-recreate-app-create-loop-script as a diagnostic. https://trello.com/c/Zv4hVlyQ/27-3-continue-appcreate-diagnostic-work https://trello.com/c/aNWlMtMk/61-demo-merge-appcreate-diagnostic https://trello.com/c/H0jsgQwu/63-3-complete-appcreate-diagnostic-functionality Status: - [x] Create and cleanup project - [x] Deploy and cleanup app - [x] Wait for app to start - [x] Test ability to connect to app via service - [x] Test that app responds correctly - [x] Test ability to connect via route - [x] Write stats/results to file as json Not yet addressed in this PR (depending on how reviews progress vs development): - [ ] Run a build to completion - [ ] Test ability to attach storage - [ ] Gather and write useful information (logs, status) on failure Builds on top of #17773 for handling parameters to the diagnostic as well as #17857 which is a refactor on top of that.

sosiouxme requested review from fabianofranz, pravisankar, juanvallejo and deads2k December 18, 2017 17:06

openshift-ci-robot requested review from jcantrill, knobunc and smarterclayton December 18, 2017 17:06

openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 18, 2017

sosiouxme force-pushed the 20171217-diagnostic-refactor-summary branch 2 times, most recently from 1839d10 to 871a337 Compare December 19, 2017 11:54

sosiouxme mentioned this pull request Jan 8, 2018

AppCreate diagnostic #16658

Merged

10 tasks

sosiouxme force-pushed the 20171217-diagnostic-refactor-summary branch 2 times, most recently from 09fd06d to 3f10acb Compare January 10, 2018 03:26

mfojtik assigned juanvallejo Jan 10, 2018

juanvallejo reviewed Jan 10, 2018

View reviewed changes

pkg/oc/admin/diagnostics/config.go Outdated

foundPath = path

}

}

if foundPath != "" {

Copy link

Contributor

juanvallejo Jan 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: len(foundPath) == 0

juanvallejo reviewed Jan 10, 2018

View reviewed changes

openshift-ci-robot assigned soltysh Jan 15, 2018

soltysh approved these changes Jan 15, 2018

View reviewed changes

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 15, 2018

sosiouxme added 2 commits January 19, 2018 16:46

diagnostics: enable per-diagnostic parameters

241fd4f

Adds the ability to specify parameters for individual diagnostics on the command line (without proliferating flags). Addresses openshift#14640

sosiouxme force-pushed the 20171217-diagnostic-refactor-summary branch from 3f10acb to 8059482 Compare January 22, 2018 03:24

openshift-merge-robot removed the lgtm Indicates that a PR is ready to be merged. label Jan 22, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 24, 2018

soltysh approved these changes Jan 24, 2018

View reviewed changes

soltysh added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 24, 2018

openshift-merge-robot merged commit ee846d3 into openshift:master Jan 24, 2018

sosiouxme deleted the 20171217-diagnostic-refactor-summary branch January 24, 2018 21:21

diagnostics: refactor build-and-run for clarity #17857

diagnostics: refactor build-and-run for clarity #17857

Conversation

sosiouxme commented Dec 18, 2017 • edited Loading

sosiouxme commented Dec 19, 2017 • edited Loading

sosiouxme commented Dec 19, 2017

sosiouxme commented Dec 19, 2017

mfojtik commented Jan 4, 2018

sosiouxme commented Jan 4, 2018

sosiouxme commented Jan 10, 2018

sosiouxme commented Jan 10, 2018

mfojtik commented Jan 10, 2018

stevekuznetsov commented Jan 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sosiouxme Jan 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juanvallejo commented Jan 10, 2018

soltysh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sosiouxme commented Jan 22, 2018

0xmichalis commented Jan 22, 2018

sosiouxme commented Jan 22, 2018

soltysh left a comment

Choose a reason for hiding this comment

soltysh commented Jan 24, 2018

openshift-ci-robot commented Jan 24, 2018

openshift-merge-robot commented Jan 24, 2018

openshift-merge-robot commented Jan 24, 2018

openshift-ci-robot commented Jan 24, 2018 • edited Loading

sosiouxme commented Dec 18, 2017 •

edited

Loading

sosiouxme commented Dec 19, 2017 •

edited

Loading

sosiouxme Jan 10, 2018 •

edited

Loading

openshift-ci-robot commented Jan 24, 2018 •

edited

Loading