Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

acceptance: fix spurious docker test failure unexpected extra event &{0 die} (after []) #86862

Merged
merged 7 commits into from
Aug 25, 2022

Conversation

knz
Copy link
Contributor

@knz knz commented Aug 25, 2022

Needed for #86049.

The root cause of the issue is that the docker container for the DB nodes were crashing, because the test framework was incorrectly deleting the data/log directory before shutting down the nodes.

However, this root cause was mostly hidden because of another mistake: the assertion that checks the health of the nodes was not doing its job properly in most cases. So it believed the nodes were healthy when in fact they were not.

This PR fixes both problems.

Informs #58955.
Informs #58951.
Informs #58218.

Release justification: non-production code changes

@knz knz requested review from rickystewart, tbg and a team August 25, 2022 12:31
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz knz mentioned this pull request Aug 25, 2022
15 tasks
@knz knz force-pushed the 20220825-acceptance branch from f82c8de to d7c3f28 Compare August 25, 2022 12:38
@knz
Copy link
Contributor Author

knz commented Aug 25, 2022

As far as I can see, this problem has existed for a very long time. Throughout most of its existence, it was lucky that 15ms was nearly never enough in Assert to retrieve the docker failure event from the node container. I guess the docker controller has become faster in recent versions.

@knz knz force-pushed the 20220825-acceptance branch from d7c3f28 to 56ce2eb Compare August 25, 2022 13:15
knz added 2 commits August 25, 2022 13:22
Release justification: non-production code changes

Release note: None
Prior to this change, the `defer` for the directory deletion could
run before the docker shutdown sequence. This was incorrect,
and could cause the docker container to fail too early.

The result was that the docker controller would report a `die` event
before the test was expecting it (in the `Assert` method), and the
assert would fail.

This commit fixes it.

Release justification: non-production code changes

Release note: None
@knz knz force-pushed the 20220825-acceptance branch from 56ce2eb to ca82cb6 Compare August 25, 2022 13:22
@knz
Copy link
Contributor Author

knz commented Aug 25, 2022

NB CI failing due to #86838

@knz knz requested a review from renatolabs August 25, 2022 14:46
@rickystewart
Copy link
Collaborator

Awesome, it's great to finally see this fixed.

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this! 🎉 Left a few minor comments.

Reviewed 1 of 1 files at r1, 3 of 3 files at r8, 1 of 1 files at r9, 1 of 1 files at r10, 2 of 2 files at r11, 1 of 1 files at r12, 2 of 2 files at r13, 1 of 1 files at r14.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)


pkg/acceptance/cluster/dockercluster.go line 591 at r10 (raw file):

func (l *DockerCluster) monitor(ctx context.Context, monitorDone chan struct{}) {
	defer func() { close(monitorDone) }()

Nit: why not defer close(monitorDone)?


pkg/acceptance/cluster/dockercluster.go line 650 at r10 (raw file):

	l.monitorCtx, l.monitorCtxCancelFunc = context.WithCancel(context.Background())
	l.monitorDone = make(chan struct{})
	go l.monitor(ctx, l.monitorDone)

Nit: it would make sense if monitorDone were always associated with the monitor() call. In other words, not passing a channel to the monitor function and closing monitorDone there instead.


pkg/acceptance/cluster/dockercluster.go line 684 at r14 (raw file):

		// The expected event channel is buffered and should contain
		// all expected events already.
		const almostZero = 15 * time.Millisecond

Curious: if we expect the channel to contain all expected events at this point, why do we wait 15ms?


pkg/acceptance/cluster/dockercluster.go line 689 at r14 (raw file):

			break
		}
		t.Logf("expecting event: %v", exp)

Could we change these log messages to use %#v instead, for a more readable output?

knz added 5 commits August 25, 2022 18:18
The cluster stop() method was not waiting for the monitor
goroutine to actually shut down before it would go on
to kill the nodes ungracefully. The result could be spurious
`die` events in the final assertion.

Release justification: non-production code changes

Release note: None
This clarifies that the container is not expected
to run anymore at the end of a wait.

Release justification: non-production code changes

Release note: None
(Since the latter depend on the former)

Also, this preserves the Nodes slice, so it can be inspected
with a debugger.

Release justification: non-production code changes

Release note: None
Prior to this patch, the docker volume directory was deleted before
the cluster nodes were shut down, which could cause them to crash
abnormally and have the Assert method return an unexpected failure,
even when the test would otherwise succeed.

This patch fixes it.

Release justification: non-production code changes

Release note: None
Prior to this patch, the Assert code was reading events from
the docker event channel too fast, possibly missing some events
in the process.

It is because the events were read too fast that the
problem fixed in the previous commit did not always result
in a test crash/failure (i.e. the assert was too fast to
do its job properly in most cases).

By slowing down the reading of events from docker, we can
surface logic error in the surrounding test framework
more clearly.

Release justification: non-production code changes

Release note: None
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @renatolabs, and @tbg)


pkg/acceptance/cluster/dockercluster.go line 591 at r10 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Nit: why not defer close(monitorDone)?

sure why not


pkg/acceptance/cluster/dockercluster.go line 684 at r14 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Curious: if we expect the channel to contain all expected events at this point, why do we wait 15ms?

If we specify zero, the timeout branch will be taken and the channel will not be read, resulting in incorrect behavior.


pkg/acceptance/cluster/dockercluster.go line 689 at r14 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Could we change these log messages to use %#v instead, for a more readable output?

In general I'd be friendly to this idea, but given how simple the struct is, it's not worth it here.

@knz knz force-pushed the 20220825-acceptance branch from ca82cb6 to 9b4fd92 Compare August 25, 2022 16:19
@knz
Copy link
Contributor Author

knz commented Aug 25, 2022

bors r=rickystewart,renatolabs

@craig
Copy link
Contributor

craig bot commented Aug 25, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Aug 25, 2022

Build succeeded:

@craig craig bot merged commit 6b3cd4b into cockroachdb:master Aug 25, 2022
@knz knz deleted the 20220825-acceptance branch September 6, 2022 08:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants