-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
acceptance: fix spurious docker test failure unexpected extra event &{0 die} (after [])
#86862
Conversation
f82c8de
to
d7c3f28
Compare
As far as I can see, this problem has existed for a very long time. Throughout most of its existence, it was lucky that 15ms was nearly never enough in Assert to retrieve the docker failure event from the node container. I guess the docker controller has become faster in recent versions. |
d7c3f28
to
56ce2eb
Compare
Release justification: non-production code changes Release note: None
Prior to this change, the `defer` for the directory deletion could run before the docker shutdown sequence. This was incorrect, and could cause the docker container to fail too early. The result was that the docker controller would report a `die` event before the test was expecting it (in the `Assert` method), and the assert would fail. This commit fixes it. Release justification: non-production code changes Release note: None
56ce2eb
to
ca82cb6
Compare
NB CI failing due to #86838 |
Awesome, it's great to finally see this fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this! 🎉 Left a few minor comments.
Reviewed 1 of 1 files at r1, 3 of 3 files at r8, 1 of 1 files at r9, 1 of 1 files at r10, 2 of 2 files at r11, 1 of 1 files at r12, 2 of 2 files at r13, 1 of 1 files at r14.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)
pkg/acceptance/cluster/dockercluster.go
line 591 at r10 (raw file):
func (l *DockerCluster) monitor(ctx context.Context, monitorDone chan struct{}) { defer func() { close(monitorDone) }()
Nit: why not defer close(monitorDone)
?
pkg/acceptance/cluster/dockercluster.go
line 650 at r10 (raw file):
l.monitorCtx, l.monitorCtxCancelFunc = context.WithCancel(context.Background()) l.monitorDone = make(chan struct{}) go l.monitor(ctx, l.monitorDone)
Nit: it would make sense if monitorDone
were always associated with the monitor()
call. In other words, not passing a channel to the monitor
function and closing monitorDone
there instead.
pkg/acceptance/cluster/dockercluster.go
line 684 at r14 (raw file):
// The expected event channel is buffered and should contain // all expected events already. const almostZero = 15 * time.Millisecond
Curious: if we expect the channel to contain all expected events at this point, why do we wait 15ms?
pkg/acceptance/cluster/dockercluster.go
line 689 at r14 (raw file):
break } t.Logf("expecting event: %v", exp)
Could we change these log messages to use %#v
instead, for a more readable output?
The cluster stop() method was not waiting for the monitor goroutine to actually shut down before it would go on to kill the nodes ungracefully. The result could be spurious `die` events in the final assertion. Release justification: non-production code changes Release note: None
This clarifies that the container is not expected to run anymore at the end of a wait. Release justification: non-production code changes Release note: None
(Since the latter depend on the former) Also, this preserves the Nodes slice, so it can be inspected with a debugger. Release justification: non-production code changes Release note: None
Prior to this patch, the docker volume directory was deleted before the cluster nodes were shut down, which could cause them to crash abnormally and have the Assert method return an unexpected failure, even when the test would otherwise succeed. This patch fixes it. Release justification: non-production code changes Release note: None
Prior to this patch, the Assert code was reading events from the docker event channel too fast, possibly missing some events in the process. It is because the events were read too fast that the problem fixed in the previous commit did not always result in a test crash/failure (i.e. the assert was too fast to do its job properly in most cases). By slowing down the reading of events from docker, we can surface logic error in the surrounding test framework more clearly. Release justification: non-production code changes Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @renatolabs, and @tbg)
pkg/acceptance/cluster/dockercluster.go
line 591 at r10 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Nit: why not
defer close(monitorDone)
?
sure why not
pkg/acceptance/cluster/dockercluster.go
line 684 at r14 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Curious: if we expect the channel to contain all expected events at this point, why do we wait 15ms?
If we specify zero, the timeout branch will be taken and the channel will not be read, resulting in incorrect behavior.
pkg/acceptance/cluster/dockercluster.go
line 689 at r14 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Could we change these log messages to use
%#v
instead, for a more readable output?
In general I'd be friendly to this idea, but given how simple the struct is, it's not worth it here.
ca82cb6
to
9b4fd92
Compare
bors r=rickystewart,renatolabs |
Build failed (retrying...): |
Build succeeded: |
Needed for #86049.
The root cause of the issue is that the docker container for the DB nodes were crashing, because the test framework was incorrectly deleting the data/log directory before shutting down the nodes.
However, this root cause was mostly hidden because of another mistake: the assertion that checks the health of the nodes was not doing its job properly in most cases. So it believed the nodes were healthy when in fact they were not.
This PR fixes both problems.
Informs #58955.
Informs #58951.
Informs #58218.
Release justification: non-production code changes