acceptance: fix spurious docker test failure `unexpected extra event &{0 die} (after [])` #86862

knz · 2022-08-25T12:31:27Z

Needed for #86049.

The root cause of the issue is that the docker container for the DB nodes were crashing, because the test framework was incorrectly deleting the data/log directory before shutting down the nodes.

However, this root cause was mostly hidden because of another mistake: the assertion that checks the health of the nodes was not doing its job properly in most cases. So it believed the nodes were healthy when in fact they were not.

This PR fixes both problems.

Informs #58955.
Informs #58951.
Informs #58218.

Release justification: non-production code changes

cockroach-teamcity · 2022-08-25T12:31:42Z

This change is

knz · 2022-08-25T12:39:49Z

As far as I can see, this problem has existed for a very long time. Throughout most of its existence, it was lucky that 15ms was nearly never enough in Assert to retrieve the docker failure event from the node container. I guess the docker controller has become faster in recent versions.

Release justification: non-production code changes Release note: None

Prior to this change, the `defer` for the directory deletion could run before the docker shutdown sequence. This was incorrect, and could cause the docker container to fail too early. The result was that the docker controller would report a `die` event before the test was expecting it (in the `Assert` method), and the assert would fail. This commit fixes it. Release justification: non-production code changes Release note: None

knz · 2022-08-25T13:27:46Z

NB CI failing due to #86838

rickystewart · 2022-08-25T15:30:39Z

Awesome, it's great to finally see this fixed.

renatolabs

Thanks for fixing this! 🎉 Left a few minor comments.

Reviewed 1 of 1 files at r1, 3 of 3 files at r8, 1 of 1 files at r9, 1 of 1 files at r10, 2 of 2 files at r11, 1 of 1 files at r12, 2 of 2 files at r13, 1 of 1 files at r14.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)

pkg/acceptance/cluster/dockercluster.go line 591 at r10 (raw file):

func (l *DockerCluster) monitor(ctx context.Context, monitorDone chan struct{}) {
	defer func() { close(monitorDone) }()

Nit: why not defer close(monitorDone)?

pkg/acceptance/cluster/dockercluster.go line 650 at r10 (raw file):

	l.monitorCtx, l.monitorCtxCancelFunc = context.WithCancel(context.Background())
	l.monitorDone = make(chan struct{})
	go l.monitor(ctx, l.monitorDone)

Nit: it would make sense if monitorDone were always associated with the monitor() call. In other words, not passing a channel to the monitor function and closing monitorDone there instead.

pkg/acceptance/cluster/dockercluster.go line 684 at r14 (raw file):

		// The expected event channel is buffered and should contain
		// all expected events already.
		const almostZero = 15 * time.Millisecond

Curious: if we expect the channel to contain all expected events at this point, why do we wait 15ms?

pkg/acceptance/cluster/dockercluster.go line 689 at r14 (raw file):

			break
		}
		t.Logf("expecting event: %v", exp)

Could we change these log messages to use %#v instead, for a more readable output?

The cluster stop() method was not waiting for the monitor goroutine to actually shut down before it would go on to kill the nodes ungracefully. The result could be spurious `die` events in the final assertion. Release justification: non-production code changes Release note: None

This clarifies that the container is not expected to run anymore at the end of a wait. Release justification: non-production code changes Release note: None

(Since the latter depend on the former) Also, this preserves the Nodes slice, so it can be inspected with a debugger. Release justification: non-production code changes Release note: None

Prior to this patch, the docker volume directory was deleted before the cluster nodes were shut down, which could cause them to crash abnormally and have the Assert method return an unexpected failure, even when the test would otherwise succeed. This patch fixes it. Release justification: non-production code changes Release note: None

Prior to this patch, the Assert code was reading events from the docker event channel too fast, possibly missing some events in the process. It is because the events were read too fast that the problem fixed in the previous commit did not always result in a test crash/failure (i.e. the assert was too fast to do its job properly in most cases). By slowing down the reading of events from docker, we can surface logic error in the surrounding test framework more clearly. Release justification: non-production code changes Release note: None

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @renatolabs, and @tbg)

pkg/acceptance/cluster/dockercluster.go line 591 at r10 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Nit: why not defer close(monitorDone)?

sure why not

pkg/acceptance/cluster/dockercluster.go line 684 at r14 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Curious: if we expect the channel to contain all expected events at this point, why do we wait 15ms?

If we specify zero, the timeout branch will be taken and the channel will not be read, resulting in incorrect behavior.

pkg/acceptance/cluster/dockercluster.go line 689 at r14 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Could we change these log messages to use %#v instead, for a more readable output?

In general I'd be friendly to this idea, but given how simple the struct is, it's not worth it here.

knz · 2022-08-25T16:19:55Z

bors r=rickystewart,renatolabs

craig · 2022-08-25T20:42:37Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2022-08-25T22:21:24Z

Build succeeded:

Bazel Essential CI (Cockroach)

knz requested review from rickystewart, tbg and a team August 25, 2022 12:31

knz mentioned this pull request Aug 25, 2022

*: upgrade to go 1.19.1 #86049

Merged

15 tasks

knz force-pushed the 20220825-acceptance branch from f82c8de to d7c3f28 Compare August 25, 2022 12:38

knz force-pushed the 20220825-acceptance branch from d7c3f28 to 56ce2eb Compare August 25, 2022 13:15

knz added 2 commits August 25, 2022 13:22

acceptance: link failing TestDockerCSharp to the proper issue

07c45b1

Release justification: non-production code changes Release note: None

knz force-pushed the 20220825-acceptance branch from 56ce2eb to ca82cb6 Compare August 25, 2022 13:22

knz requested a review from renatolabs August 25, 2022 14:46

rickystewart approved these changes Aug 25, 2022

View reviewed changes

renatolabs approved these changes Aug 25, 2022

View reviewed changes

knz added 5 commits August 25, 2022 18:18

acceptance/cluster: simplify the interface

45de70e

This clarifies that the container is not expected to run anymore at the end of a wait. Release justification: non-production code changes Release note: None

acceptance: remove the volume ctn after the node containers

551c2e1

(Since the latter depend on the former) Also, this preserves the Nodes slice, so it can be inspected with a debugger. Release justification: non-production code changes Release note: None

knz commented Aug 25, 2022

View reviewed changes

knz force-pushed the 20220825-acceptance branch from ca82cb6 to 9b4fd92 Compare August 25, 2022 16:19

craig bot merged commit 6b3cd4b into cockroachdb:master Aug 25, 2022

knz deleted the 20220825-acceptance branch September 6, 2022 08:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

acceptance: fix spurious docker test failure `unexpected extra event &{0 die} (after [])` #86862

acceptance: fix spurious docker test failure `unexpected extra event &{0 die} (after [])` #86862

knz commented Aug 25, 2022 •

edited

Loading

cockroach-teamcity commented Aug 25, 2022

knz commented Aug 25, 2022

knz commented Aug 25, 2022

rickystewart commented Aug 25, 2022

renatolabs left a comment

knz left a comment

knz commented Aug 25, 2022

craig bot commented Aug 25, 2022

craig bot commented Aug 25, 2022

acceptance: fix spurious docker test failure unexpected extra event &{0 die} (after []) #86862

acceptance: fix spurious docker test failure unexpected extra event &{0 die} (after []) #86862

Conversation

knz commented Aug 25, 2022 • edited Loading

cockroach-teamcity commented Aug 25, 2022

knz commented Aug 25, 2022

knz commented Aug 25, 2022

rickystewart commented Aug 25, 2022

renatolabs left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

knz commented Aug 25, 2022

craig bot commented Aug 25, 2022

craig bot commented Aug 25, 2022

acceptance: fix spurious docker test failure `unexpected extra event &{0 die} (after [])` #86862

acceptance: fix spurious docker test failure `unexpected extra event &{0 die} (after [])` #86862

knz commented Aug 25, 2022 •

edited

Loading