functional: improve TestReconfigureServer test #1554

tixxdz · 2016-04-15T13:44:30Z

Remove the sleep call that allowed us to produce reliable tests, it
turns out it may fail in rare cases. I suspect this due to journald
delays and how things are serialized and started when running the test.

Instead, start a normal unit then unload it to reproduce a real behaviour
then follow up with the SIGHUG on fleetd.

If the "Reloading configuration" message do not show up on the logs, then
we do not fail since the signal was received and processed, instead
continue with the "list-units" check, if it succeed then the test should have
passed at this point. However, we continue and check the logs that we ignored
previously since the test also checks for the
"Failed serving HTTP on listener" which informs us that fleetd failed.
So we check the logs again if we find them we proceed with that message
and check it, otherwise we skip the test at this point. We do not want
developers to have failing functional tests results due to obscure
delays. However the test succeed now and is more reliable. I have been running
it for more than 1 hour now (more than 100 times), it did not fail and it was
never skipped.

dongsupark · 2016-04-15T13:57:07Z

Looks good. There are several typos like "sighug", but other than that it looks fine to me.

Remove the sleep call that allowed us to produce reliable tests, it turns out it may fail in rare cases. I suspect this due to journald delays and how things are serialized and started when running the test. Instead, start a normal unit then unload it to reproduce a real behaviour then follow up with the SIGHUP on fleetd. If the "Reloading configuration" message do not show up on the logs, then we do not fail since the signal was received and processed, instead continue with the "list-units" check, if it succeed then the test should have passed at this point. However, we continue and check the logs that we ignored previously since the test also checks for the "Failed serving HTTP on listener" which informs us that fleetd failed. So we check the logs again if we find them we proceed with that message and check it, otherwise we skip the test at this point. We do not want developers to have failing functional tests results due to obscure delays. However the test succeed now and is more reliable. I have been running it for more than 1 hour now (more than 100 times), it did not fail and it was never skipped.

antrik · 2016-04-15T14:40:31Z

functional/server_test.go

-	// Without this sleep, the entire fleetd test always ends up succeeding
-	// no matter whether SIGHUP came or not.
-	_, _ = cluster.MemberCommand(m0, "sh", "-c", `'sleep 2'`)
-
 	err = waitForFleetdSocket(cluster, m0)
 	if err != nil {
 		t.Fatalf("Failed to get a list of fleetd sockets: %v", err)
 	}


Considering that you now ensure fleetd is fully functional by executing the commands below, I don't think there is any point in keeping this check here?...

Please submit another PR and make sure that it works, my change was to remove the sleep and make the test reliable so other PRs can be merged.

antrik · 2016-04-15T15:08:15Z

@tixxdz as a general remark on procedure, I don't think it's a good idea to merge major changes like this hastily without time for a detailed review. If a test is problematic because of the intermittent failures, it's probably better to disable it temporarily to take out the urgency, and follow up with a separate PR for fixing the test that can be reviewed without hurry :-)

tixxdz · 2016-04-15T15:17:55Z

@tixxdz as a general remark on procedure, I don't think it's a good idea to merge major changes like this hastily without time for a detailed review. If a test is problematic because of the intermittent failures, it's probably better to disable it temporarily to take out the urgency, and follow up with a separate PR for fixing the test that can be reviewed without hurry :-)

That's precisely what this PR does, it tries its best to test the behaviour, and at the end if it can't then it will skip it which is perfectly sane from my point of view, and this is not a major change this is improving functional test which was succeeding 90%, I tried to improve it then if we can't determine we skip which didn't happen in my tests, but better safe than sorry. Now it would be really much productive if you could please open a new PR and investigate this and probably remove that t.Skip() call.

Thank you!

tixxdz force-pushed the tixxdz/server-reconfigure-improvements branch from 60d51f7 to 9b173e1 Compare April 15, 2016 14:00

tixxdz merged commit 071230a into coreos:master Apr 15, 2016

antrik reviewed Apr 15, 2016
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

functional: improve TestReconfigureServer test #1554

functional: improve TestReconfigureServer test #1554

tixxdz commented Apr 15, 2016

dongsupark commented Apr 15, 2016

antrik Apr 15, 2016

tixxdz Apr 15, 2016

antrik commented Apr 15, 2016

tixxdz commented Apr 15, 2016

functional: improve TestReconfigureServer test #1554

functional: improve TestReconfigureServer test #1554

Conversation

tixxdz commented Apr 15, 2016

dongsupark commented Apr 15, 2016

antrik Apr 15, 2016

Choose a reason for hiding this comment

tixxdz Apr 15, 2016

Choose a reason for hiding this comment

antrik commented Apr 15, 2016

tixxdz commented Apr 15, 2016