Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod: replace fatal logging by propagating errors #72703

Merged

Conversation

healthy-pod
Copy link
Contributor

Fatal logging prevented errors from being properly propagated back
to the binary error wrapper.

Release note: None

@healthy-pod healthy-pod requested a review from a team as a code owner November 12, 2021 19:05
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@healthy-pod healthy-pod requested review from tbg and rail November 12, 2021 19:06
@healthy-pod
Copy link
Contributor Author

Related to #72635 (not a fix for it but will help with debugging the issue)

@healthy-pod healthy-pod force-pushed the roachprod-remove-fatal-logging branch 4 times, most recently from a455554 to 0db537b Compare November 12, 2021 23:29
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is great! I like that you took the time to improve some of the existing error messages. This is going to help a lot making failure logs more permeable for everyone at CRL.

I left what feels like a lot of comments, but they are basically all the same thing, which is to scope errors as tightly as possible. In particular for very long blocks of code, it's difficult for a reviewer to spot the return err that we inevitably expect to see and the absence of which would imply a bug). I.e. instead of

err := foo(func() {
/* long block of code here */
}

return err

I greatly prefer

if err := foo(...); err != nil {
  return err
}

as I have more confidence that this is just doing vanilla Go-style error handling without any additional logic slipped in.

I tried to point out all places in the code where this applies. Not all of them are "bad" but as a reviewer I find it hard to shut up sometimes. As long as you address all of the "large code block" examples (c.Parallel mostly) I am not going to insist on all of the "small" trivial ones but you might as well do them too if you don't mind.

Reviewed 3 of 7 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @healthy-pod and @rail)


pkg/roachprod/roachprod.go, line 166 at r2 (raw file):

	}

	err := c.Prepare(clusterSettings)

nit: if err := ...


pkg/roachprod/roachprod.go, line 416 at r2 (raw file):

	} else {
		var err error
		err = c.Parallel("", len(nodes), 0, func(i int) ([]byte, error) {
if err := c.Parallel(...); err != nil {
  return nil, err
}

pkg/roachprod/roachprod.go, line 741 at r2 (raw file):

	} else {
		var err error
		err = c.Parallel("", len(nodes), 0, func(i int) ([]byte, error) {

if err := ...


pkg/roachprod/roachprod.go, line 1027 at r2 (raw file):

	}
	err = c.Wipe(false)
	if err != nil {

if err := ...


pkg/roachprod/install/cluster_synced.go, line 275 at r2 (raw file):

		display += " and waiting"
	}
	err := c.Parallel(display, len(c.Nodes), 0, func(i int) ([]byte, error) {

return c.Parallel(...


pkg/roachprod/install/cluster_synced.go, line 326 at r2 (raw file):

	display := fmt.Sprintf("%s: wiping", c.Name)
	err := c.Stop(9, true /* wait */)
	if err != nil {

if err :=


pkg/roachprod/install/cluster_synced.go, line 329 at r2 (raw file):

		return err
	}
	err = c.Parallel(display, len(c.Nodes), 0, func(i int) ([]byte, error) {

return c.Parallel


pkg/roachprod/install/cluster_synced.go, line 364 at r2 (raw file):

	display := fmt.Sprintf("%s: status", c.Name)
	results := make([]string, len(c.Nodes))
	err := c.Parallel(display, len(c.Nodes), 0, func(i int) ([]byte, error) {

return c.Parallel(


pkg/roachprod/install/cluster_synced.go, line 602 at r2 (raw file):

	errs := make([]error, len(nodes))
	results := make([]string, len(nodes))
	err := c.Parallel(display, len(nodes), 0, func(i int) ([]byte, error) {

if err := ... ; err != nil { return nil, err }


pkg/roachprod/install/cluster_synced.go, line 660 at r2 (raw file):

	})

	if !stream {

Do we really want to do this on error?


pkg/roachprod/install/cluster_synced.go, line 677 at r2 (raw file):

	display := fmt.Sprintf("%s: waiting for nodes to start", c.Name)
	errs := make([]error, len(c.Nodes))
	err := c.Parallel(display, len(c.Nodes), 0, func(i int) ([]byte, error) {

ditto


pkg/roachprod/install/cluster_synced.go, line 741 at r2 (raw file):

	// cluster in order to allow inter-node ssh.
	var sshTar []byte
	err := c.Parallel("generating ssh key", 1, 0, func(i int) ([]byte, error) {

ditto


pkg/roachprod/install/cluster_synced.go, line 775 at r2 (raw file):

	// Skip the first node which is where we generated the key.
	nodes := c.Nodes[1:]
	err = c.Parallel("distributing ssh key", len(nodes), 0, func(i int) ([]byte, error) {

ditto


pkg/roachprod/install/cluster_synced.go, line 798 at r2 (raw file):

	// (which is used in jepsen tests).
	ips := make([]string, len(c.Nodes), len(c.Nodes)*2)
	err = c.Parallel("retrieving hosts", len(c.Nodes), 0, func(i int) ([]byte, error) {

ditto


pkg/roachprod/install/cluster_synced.go, line 820 at r2 (raw file):

	}
	var knownHostsData []byte
	err = c.Parallel("scanning hosts", 1, 0, func(i int) ([]byte, error) {

ditto


pkg/roachprod/install/cluster_synced.go, line 863 at r2 (raw file):

	}

	err = c.Parallel("distributing known_hosts", len(c.Nodes), 0, func(i int) ([]byte, error) {

ditto


pkg/roachprod/install/cluster_synced.go, line 913 at r2 (raw file):

		// gce and the shared user on aws) as well as to the shared user on both
		// platforms.
		err = c.Parallel("adding additional authorized keys", len(c.Nodes), 0, func(i int) ([]byte, error) {

ditto


pkg/roachprod/install/cluster_synced.go, line 967 at r2 (raw file):

	var existsErr error
	display := fmt.Sprintf("%s: checking certs", c.Name)
	err := c.Parallel(display, 1, 0, func(i int) ([]byte, error) {

ditto


pkg/roachprod/install/cluster_synced.go, line 992 at r2 (raw file):

	if !c.IsLocal() {
		ips = make([]string, len(nodes))
		err = c.Parallel("", len(nodes), 0, func(i int) ([]byte, error) {

ditto


pkg/roachprod/install/cluster_synced.go, line 1003 at r2 (raw file):

	// Generate the ca, client and node certificates on the first node.
	err = c.Parallel(display, 1, 0, func(i int) ([]byte, error) {

ditto


pkg/roachprod/install/cluster_synced.go, line 1095 at r2 (raw file):

	display = c.Name + ": distributing certs"
	nodes = nodes[1:]
	err = c.Parallel(display, len(nodes), 0, func(i int) ([]byte, error) {

ditto


pkg/roachprod/install/cluster_synced.go, line 1699 at r2 (raw file):

	if haveErr {
		return errors.Newf("get %s failed", src)

This is odd, if haveErr is true, don't we have an error somewhere? Why are we swallowing it and printing just a more unhelpful error? Same applies to all of the other if haveErr branches in this code.


pkg/roachprod/install/cluster_synced.go, line 1718 at r2 (raw file):

func (c *SyncedCluster) pghosts(nodes []int) (map[int]string, error) {
	ips := make([]string, len(nodes))
	err := c.Parallel("", len(nodes), 0, func(i int) ([]byte, error) {

ditto


pkg/roachprod/install/cockroach.go, line 146 at r2 (raw file):

func (r Cockroach) Start(c *SyncedCluster, extraArgs []string) error {
	h := &crdbInstallHelper{c: c, r: r}
	err := h.distributeCerts()

ditto


pkg/roachprod/install/cockroach.go, line 158 at r2 (raw file):

	fmt.Printf("%s: starting nodes\n", c.Name)
	err = c.Parallel("", len(nodes), parallelism, func(nodeIdx int) ([]byte, error) {

ditto


pkg/roachprod/install/cockroach.go, line 229 at r2 (raw file):

		clusterSettingsOut, err := h.setClusterSettings(nodeIdx)
		if err != nil {
			return nil, errors.WithDetail(err, "unable to set cluster settings")

I don't think you want to use WithDetail anywhere in this code. The "detail" will not actually be printed by default. I think you would just want errors.Wrapf here and across the diff.


pkg/roachprod/install/cockroach.go, line 319 at r2 (raw file):

	display := fmt.Sprintf("%s: executing sql", c.Name)
	err := c.Parallel(display, len(c.Nodes), 0, func(nodeIdx int) ([]byte, error) {

ditto


pkg/roachprod/install/cockroach.go, line 714 at r2 (raw file):

	for _, node := range h.c.ServerNodes() {
		if node == 1 && h.c.Secure {
			err := h.c.DistributeCerts()

ditto

Fatal logging prevented errors from being properly propagated back
to the binary error wrapper.

Release note: None
@healthy-pod healthy-pod force-pushed the roachprod-remove-fatal-logging branch from 0db537b to 3d141a8 Compare November 16, 2021 04:19
Copy link
Contributor Author

@healthy-pod healthy-pod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @healthy-pod, @rail, and @tbg)


pkg/roachprod/install/cluster_synced.go, line 1699 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is odd, if haveErr is true, don't we have an error somewhere? Why are we swallowing it and printing just a more unhelpful error? Same applies to all of the other if haveErr branches in this code.

Absolutely, I should make that change in #72473 as I am trying to keep this PR simple and about removing calls to log.Fatal


pkg/roachprod/install/cockroach.go, line 229 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I don't think you want to use WithDetail anywhere in this code. The "detail" will not actually be printed by default. I think you would just want errors.Wrapf here and across the diff.

Done

@healthy-pod
Copy link
Contributor Author

TFTR!

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Reviewed 4 of 4 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rail)

Copy link
Member

@rail rail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 7 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @healthy-pod)

@healthy-pod
Copy link
Contributor Author

bors r=[tbg,rail]

@craig
Copy link
Contributor

craig bot commented Nov 16, 2021

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Nov 16, 2021

Build succeeded:

@craig craig bot merged commit 525322a into cockroachdb:master Nov 16, 2021
@healthy-pod healthy-pod deleted the roachprod-remove-fatal-logging branch September 20, 2022 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants