Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/openshift-install/gather: Recognize "connection refused" #3615

Conversation

rna-afk
Copy link
Contributor

@rna-afk rna-afk commented May 18, 2020

Before this commit, bootstrap machines that failed to come up would
look like 1:

  level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.ci-op-6266tp8r-77109.origin-ci-int-aws.dev.rhcloud.com:6443..."
  level=error msg="Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get https://api.ci-op-6266tp8r-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/config.openshift.io/v1/clusteroperators: dial tcp 3.221.214.197:6443: connect: connection refused"
  level=info msg="Pulling debug logs from the bootstrap machine"
  level=error msg="Attempted to gather debug logs after installation failure: failed to create SSH client, ensure the proper ssh key is in your keyring or specify with --key: dial tcp 3.84.188.207:22: connect: connection refused"
  level=fatal msg="Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded"

With this commit, that last error will look like:

  level=error msg="Attempted to gather debug logs after installation failure: failed to connect to the bootstrap machine: dial tcp 3.84.188.207:22: connect: connection refused"

without the unrelated (to this failure mode) distraction about SSH
keys.

Updated the commit to match with the latest changes.
Clone of #2810

if errno, ok := errors.Unwrap(err).(syscall.Errno); ok && errno == syscall.ECONNREFUSED {
return errors.Wrap(err, "failed to connect to the bootstrap machine")
} else if len(gatherBootstrapOpts.sshKeys) == 0 {
return errors.Wrap(err, "failed to create SSH client, ensure the proper ssh key is in your keyring or specify with --key")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's better use to the Is function introduced in go1.13, see Overview in https://golang.org/pkg/errors/#Unwrap

github.com/pkg/errors which we use in installer is transparent replacement for std lib errors.

@@ -145,8 +146,14 @@ func logGatherBootstrap(bootstrap string, port int, masters []string, directory
if err != nil && strings.Contains(err.Error(), "ssh: handshake failed: ssh: unable to authenticate") {
return errors.Wrap(err, "failed to create SSH client, ensure the private key is added to your authentication agent (ssh-agent) or specified with the --key parameter")
} else if err != nil {
if errors.Is(err, syscall.ECONNREFUSED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this entire block https://github.com/openshift/installer/pull/3615/files#diff-be4790dcfa83ca14e7942b4c3b50c5f9R146-R156 has multi if else for err, can we standardize to using errors.Is for connectionrefused and unable to authenitcate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried checking for the root error type for the unable to authenticate scenario but the root error is fmt.Errorf which is a basic string error. Not sure if errors.Is will work for that scenario. I checked their test files for the function and the strings.Contains function is being used. Will work to make the code a little cleaner.

@cgwalters
Copy link
Member

Picking this as a victim PR to
/test e2e-azure
since it clearly can't break it and I want to check general Azure health for #3613

@rna-afk rna-afk force-pushed the gather-ssh-recognize-connection-refused branch from 4e0f4cf to e51cdd7 Compare May 19, 2020 13:54
Comment on lines 147 to 149
if strings.Contains(err.Error(), "ssh: handshake failed: ssh: unable to authenticate") {
return errors.Wrap(err, "failed to create SSH client, ensure the private key is added to your authentication agent (ssh-agent) or specified with the --key parameter")
} else if errors.Is(err, syscall.ECONNREFUSED) {
return errors.Wrap(err, "failed to connect to the bootstrap machine")
} else if len(gatherBootstrapOpts.sshKeys) == 0 {
return errors.Wrap(err, "failed to create SSH client, ensure the proper ssh key is in your keyring or specify with --key")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think it doesn't have to be if else if, it can be written as blocks of if that return.

  2. https://github.com/golang/crypto/blob/master/ssh/client_auth.go#L75 , https://github.com/golang/crypto/blob/master/ssh/client.go#L85 is where the authentication error is coming from and it doesn't look like it supports Unwrap so strings contains will have to do for now.

  3. for unauthenticated error there are 2 cases,

or something like that.

  1. len(gatherBootstrapOpts.sshKeys) == 0
    I don't know what this case helps us with.

  2. https://github.com/openshift/installer/blob/master/pkg/gather/ssh/agent.go#L19
    uses ssh agent when no keys are provided, but that doesn't match the doc in https://github.com/openshift/installer/blob/master/docs/user/troubleshootingbootstrap.md#authenticating-to-bootstrap-host-for-ipi
    also b91707e#diff-be4790dcfa83ca14e7942b4c3b50c5f9 made it so that there is always one key passed to this list making it this ssh agent not used at all.. that would be bad.. maybe we fix that too, drop the len(keys) == 0

Copy link
Contributor

@patrickdillon patrickdillon May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(gatherBootstrapOpts.sshKeys) == 0
I don't know what this case helps us with

I added that case in 00307d1 and I think it is a good idea to remove it in this PR. The generated SSH key solves the problem.

Copy link
Contributor Author

@rna-afk rna-afk May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for unauthenticated error there are 2 cases,

Determining which case happened seems to be a little hard based off of only the error message. https://github.com/openshift/installer/blob/master/pkg/gather/ssh/agent.go#L19 checks if the env is set and if it fails to dial, it goes with the other way of pulling the ssh keys from local. Any error message returned after that is in a general way and does not give details as to what case happened. I may be wrong.

The way I see it, I need to check if the env is set and if it dials right again like the above link, in this place here

if err != nil && strings.Contains(err.Error(), "ssh: handshake failed: ssh: unable to authenticate") {
which I think is repetitive and might take up time. Let me know if this is a viable option or if there is a better way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function that has the information what was used to create the SSH Agent for authentication is the ssh.NewClient and i think we should look at how that function that provide us the context on which one was used so that we can decide the next steps when we fail to authenticate.

maybe we try that?

@patrickdillon
Copy link
Contributor

Fixes #3188

@@ -36,6 +37,12 @@ func NewClient(user, address string, keys []string) (*ssh.Client, error) {
HostKeyCallback: ssh.InsecureIgnoreHostKey(),
})
if err != nil {
if strings.Contains(err.Error(), "ssh: handshake failed: ssh: unable to authenticate") {
if agentType == "agent" {
return nil, errors.Wrap(err, "failed to use pre-existing agent, try adding keys using the --keys option")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, errors.Wrap(err, "failed to use pre-existing agent, try adding keys using the --keys option")
return nil, errors.Wrap(err, "failed to use pre-existing agent, make sure the appropriate keys exist in the agent for authentication")

if agentType == "agent" {
return nil, errors.Wrap(err, "failed to use pre-existing agent, try adding keys using the --keys option")
}
return nil, errors.Wrap(err, "failed to use keys, ensure the private key is added to your authentication agent (ssh-agent) or specified with the --key parameter")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, errors.Wrap(err, "failed to use keys, ensure the private key is added to your authentication agent (ssh-agent) or specified with the --key parameter")
return nil, errors.Wrap(err, "failed to use the provided keys for authentication")

@abhinavdahiya abhinavdahiya added the version/4.6 Tracking changes that should end up in 4.6 release label May 28, 2020
Before this commit, bootstrap machines that failed to come up would
look like [1]:

  level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.ci-op-6266tp8r-77109.origin-ci-int-aws.dev.rhcloud.com:6443..."
  level=error msg="Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get https://api.ci-op-6266tp8r-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/config.openshift.io/v1/clusteroperators: dial tcp 3.221.214.197:6443: connect: connection refused"
  level=info msg="Pulling debug logs from the bootstrap machine"
  level=error msg="Attempted to gather debug logs after installation failure: failed to create SSH client, ensure the proper ssh key is in your keyring or specify with --key: dial tcp 3.84.188.207:22: connect: connection refused"
  level=fatal msg="Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded"

With this commit, that last error will look like:

  level=error msg="Attempted to gather debug logs after installation failure: failed to connect to the bootstrap machine: dial tcp 3.84.188.207:22: connect: connection refused"

without the unrelated (to this failure mode) distraction about SSH
keys.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12076

Updated the commit to match with the latest changes.
@rna-afk rna-afk force-pushed the gather-ssh-recognize-connection-refused branch from b84c448 to 97bb8a8 Compare May 28, 2020 21:09
@abhinavdahiya
Copy link
Contributor

/approve

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 29, 2020
@abhinavdahiya
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 1, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

4 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

@abhinavdahiya: Overrode contexts on behalf of abhinavdahiya: ci/prow/e2e-aws

In response to this:

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/3615/pull-ci-openshift-installer-master-e2e-aws/11950#1:build-log.txt%3A37321

the failing test is an e2e, known broken https://coreos.slack.com/archives/CNHC2DK2M/p1591025728087100

/override ci/prow/e2e-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@rna-afk: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-azure 4e0f4cf05e13b77b0375ac27942510e8b4a38ed8 link /test e2e-azure
ci/prow/e2e-aws-scaleup-rhel7 97bb8a8 link /test e2e-aws-scaleup-rhel7

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 334ea15 into openshift:master Jun 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. version/4.6 Tracking changes that should end up in 4.6 release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants