-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachprod: replace fatal logging by propagating errors #72703
roachprod: replace fatal logging by propagating errors #72703
Conversation
Related to #72635 (not a fix for it but will help with debugging the issue) |
a455554
to
0db537b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this is great! I like that you took the time to improve some of the existing error messages. This is going to help a lot making failure logs more permeable for everyone at CRL.
I left what feels like a lot of comments, but they are basically all the same thing, which is to scope errors as tightly as possible. In particular for very long blocks of code, it's difficult for a reviewer to spot the return err
that we inevitably expect to see and the absence of which would imply a bug). I.e. instead of
err := foo(func() {
/* long block of code here */
}
return err
I greatly prefer
if err := foo(...); err != nil {
return err
}
as I have more confidence that this is just doing vanilla Go-style error handling without any additional logic slipped in.
I tried to point out all places in the code where this applies. Not all of them are "bad" but as a reviewer I find it hard to shut up sometimes. As long as you address all of the "large code block" examples (c.Parallel
mostly) I am not going to insist on all of the "small" trivial ones but you might as well do them too if you don't mind.
Reviewed 3 of 7 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @healthy-pod and @rail)
pkg/roachprod/roachprod.go, line 166 at r2 (raw file):
} err := c.Prepare(clusterSettings)
nit: if err := ...
pkg/roachprod/roachprod.go, line 416 at r2 (raw file):
} else { var err error err = c.Parallel("", len(nodes), 0, func(i int) ([]byte, error) {
if err := c.Parallel(...); err != nil {
return nil, err
}
pkg/roachprod/roachprod.go, line 741 at r2 (raw file):
} else { var err error err = c.Parallel("", len(nodes), 0, func(i int) ([]byte, error) {
if err := ...
pkg/roachprod/roachprod.go, line 1027 at r2 (raw file):
} err = c.Wipe(false) if err != nil {
if err := ...
pkg/roachprod/install/cluster_synced.go, line 275 at r2 (raw file):
display += " and waiting" } err := c.Parallel(display, len(c.Nodes), 0, func(i int) ([]byte, error) {
return c.Parallel(...
pkg/roachprod/install/cluster_synced.go, line 326 at r2 (raw file):
display := fmt.Sprintf("%s: wiping", c.Name) err := c.Stop(9, true /* wait */) if err != nil {
if err :=
pkg/roachprod/install/cluster_synced.go, line 329 at r2 (raw file):
return err } err = c.Parallel(display, len(c.Nodes), 0, func(i int) ([]byte, error) {
return c.Parallel
pkg/roachprod/install/cluster_synced.go, line 364 at r2 (raw file):
display := fmt.Sprintf("%s: status", c.Name) results := make([]string, len(c.Nodes)) err := c.Parallel(display, len(c.Nodes), 0, func(i int) ([]byte, error) {
return c.Parallel(
pkg/roachprod/install/cluster_synced.go, line 602 at r2 (raw file):
errs := make([]error, len(nodes)) results := make([]string, len(nodes)) err := c.Parallel(display, len(nodes), 0, func(i int) ([]byte, error) {
if err := ... ; err != nil { return nil, err }
pkg/roachprod/install/cluster_synced.go, line 660 at r2 (raw file):
}) if !stream {
Do we really want to do this on error?
pkg/roachprod/install/cluster_synced.go, line 677 at r2 (raw file):
display := fmt.Sprintf("%s: waiting for nodes to start", c.Name) errs := make([]error, len(c.Nodes)) err := c.Parallel(display, len(c.Nodes), 0, func(i int) ([]byte, error) {
ditto
pkg/roachprod/install/cluster_synced.go, line 741 at r2 (raw file):
// cluster in order to allow inter-node ssh. var sshTar []byte err := c.Parallel("generating ssh key", 1, 0, func(i int) ([]byte, error) {
ditto
pkg/roachprod/install/cluster_synced.go, line 775 at r2 (raw file):
// Skip the first node which is where we generated the key. nodes := c.Nodes[1:] err = c.Parallel("distributing ssh key", len(nodes), 0, func(i int) ([]byte, error) {
ditto
pkg/roachprod/install/cluster_synced.go, line 798 at r2 (raw file):
// (which is used in jepsen tests). ips := make([]string, len(c.Nodes), len(c.Nodes)*2) err = c.Parallel("retrieving hosts", len(c.Nodes), 0, func(i int) ([]byte, error) {
ditto
pkg/roachprod/install/cluster_synced.go, line 820 at r2 (raw file):
} var knownHostsData []byte err = c.Parallel("scanning hosts", 1, 0, func(i int) ([]byte, error) {
ditto
pkg/roachprod/install/cluster_synced.go, line 863 at r2 (raw file):
} err = c.Parallel("distributing known_hosts", len(c.Nodes), 0, func(i int) ([]byte, error) {
ditto
pkg/roachprod/install/cluster_synced.go, line 913 at r2 (raw file):
// gce and the shared user on aws) as well as to the shared user on both // platforms. err = c.Parallel("adding additional authorized keys", len(c.Nodes), 0, func(i int) ([]byte, error) {
ditto
pkg/roachprod/install/cluster_synced.go, line 967 at r2 (raw file):
var existsErr error display := fmt.Sprintf("%s: checking certs", c.Name) err := c.Parallel(display, 1, 0, func(i int) ([]byte, error) {
ditto
pkg/roachprod/install/cluster_synced.go, line 992 at r2 (raw file):
if !c.IsLocal() { ips = make([]string, len(nodes)) err = c.Parallel("", len(nodes), 0, func(i int) ([]byte, error) {
ditto
pkg/roachprod/install/cluster_synced.go, line 1003 at r2 (raw file):
// Generate the ca, client and node certificates on the first node. err = c.Parallel(display, 1, 0, func(i int) ([]byte, error) {
ditto
pkg/roachprod/install/cluster_synced.go, line 1095 at r2 (raw file):
display = c.Name + ": distributing certs" nodes = nodes[1:] err = c.Parallel(display, len(nodes), 0, func(i int) ([]byte, error) {
ditto
pkg/roachprod/install/cluster_synced.go, line 1699 at r2 (raw file):
if haveErr { return errors.Newf("get %s failed", src)
This is odd, if haveErr
is true, don't we have an error
somewhere? Why are we swallowing it and printing just a more unhelpful error? Same applies to all of the other if haveErr
branches in this code.
pkg/roachprod/install/cluster_synced.go, line 1718 at r2 (raw file):
func (c *SyncedCluster) pghosts(nodes []int) (map[int]string, error) { ips := make([]string, len(nodes)) err := c.Parallel("", len(nodes), 0, func(i int) ([]byte, error) {
ditto
pkg/roachprod/install/cockroach.go, line 146 at r2 (raw file):
func (r Cockroach) Start(c *SyncedCluster, extraArgs []string) error { h := &crdbInstallHelper{c: c, r: r} err := h.distributeCerts()
ditto
pkg/roachprod/install/cockroach.go, line 158 at r2 (raw file):
fmt.Printf("%s: starting nodes\n", c.Name) err = c.Parallel("", len(nodes), parallelism, func(nodeIdx int) ([]byte, error) {
ditto
pkg/roachprod/install/cockroach.go, line 229 at r2 (raw file):
clusterSettingsOut, err := h.setClusterSettings(nodeIdx) if err != nil { return nil, errors.WithDetail(err, "unable to set cluster settings")
I don't think you want to use WithDetail
anywhere in this code. The "detail" will not actually be printed by default. I think you would just want errors.Wrapf
here and across the diff.
pkg/roachprod/install/cockroach.go, line 319 at r2 (raw file):
display := fmt.Sprintf("%s: executing sql", c.Name) err := c.Parallel(display, len(c.Nodes), 0, func(nodeIdx int) ([]byte, error) {
ditto
pkg/roachprod/install/cockroach.go, line 714 at r2 (raw file):
for _, node := range h.c.ServerNodes() { if node == 1 && h.c.Secure { err := h.c.DistributeCerts()
ditto
Fatal logging prevented errors from being properly propagated back to the binary error wrapper. Release note: None
0db537b
to
3d141a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @healthy-pod, @rail, and @tbg)
pkg/roachprod/install/cluster_synced.go, line 1699 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This is odd, if
haveErr
is true, don't we have anerror
somewhere? Why are we swallowing it and printing just a more unhelpful error? Same applies to all of the otherif haveErr
branches in this code.
Absolutely, I should make that change in #72473 as I am trying to keep this PR simple and about removing calls to log.Fatal
pkg/roachprod/install/cockroach.go, line 229 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I don't think you want to use
WithDetail
anywhere in this code. The "detail" will not actually be printed by default. I think you would just wanterrors.Wrapf
here and across the diff.
Done
TFTR! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rail)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 7 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @healthy-pod)
bors r=[tbg,rail] |
Build failed (retrying...): |
Build succeeded: |
Fatal logging prevented errors from being properly propagated back
to the binary error wrapper.
Release note: None