`kola run` should allow for exiting zero even if tests failed #1059

jlebon · 2019-09-17T18:56:54Z

Right now, there's no practical way to discern between a test failing and a golang panic, etc... Would be really nice to have a e.g. --exit-zero-on-test-failures so we can discern those two cases.

There are some gray areas of course (see prior discussions in coreos/fedora-coreos-pipeline#114 (comment)), though we can at least start with the obvious stuff (e.g. not passing the right kola run switches).

The text was updated successfully, but these errors were encountered:

arithx · 2019-09-17T19:06:18Z

Thanks for writing this up, somehow it seems to have slipped off my radar.

ajeddeloh · 2019-09-17T19:39:17Z

just spitballing: could assign different exit codes for different failures. 1 for panic, 2 for test failure, etc

jlebon · 2019-09-17T19:49:56Z

To clarify, I think we're mostly interested in discerning between two broad categories: test failures, and "everything else". Being able to tell kola to not exit nonzero on the former means we can let a shell script with set -e just naturally error out if any other type of failure occurs.

jlebon · 2019-09-26T20:12:45Z

On this topic, noticed this today:

kola -b fcos -p qemu-unpriv --qemu-image builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz --output-dir tmp/kola run
=== RUN   coreos.auth.verify
=== RUN   coreos.ignition.groups
=== RUN   coreos.selinux.enforce
=== RUN   coreos.ignition.once
=== RUN   systemd.sysusers.gshadow
=== RUN   rpmostree.status
=== RUN   fcos.python
=== RUN   fcos.basic
=== RUN   coreos.ignition.sethostname
=== RUN   rhcos.selinux.boolean.persist
=== RUN   coreos.selinux.boolean
=== RUN   rpmostree.upgrade-rollback
--- FAIL: rhcos.selinux.boolean.persist (0.04s)
        harness.go:507: Cluster failed starting machines: lstat /home/jenkins/workspace/fedora-coreos-config_PR-176-5KUJYYU75I7NAEJRFK3ZQBTT6ZKISHT6E4MEAV62GX3G6FCWWNCA/cosa/builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz: no such file or directory
--- FAIL: coreos.selinux.boolean (0.03s)
        harness.go:507: Cluster failed starting machines: lstat /home/jenkins/workspace/fedora-coreos-config_PR-176-5KUJYYU75I7NAEJRFK3ZQBTT6ZKISHT6E4MEAV62GX3G6FCWWNCA/cosa/builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz: no such file or directory
--- FAIL: fcos.basic (0.03s)
        harness.go:507: Cluster failed starting machines: lstat /home/jenkins/workspace/fedora-coreos-config_PR-176-5KUJYYU75I7NAEJRFK3ZQBTT6ZKISHT6E4MEAV62GX3G6FCWWNCA/cosa/builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz: no such file or directory
--- FAIL: coreos.auth.verify (0.03s)
        harness.go:507: Cluster failed starting machines: lstat /home/jenkins/workspace/fedora-coreos-config_PR-176-5KUJYYU75I7NAEJRFK3ZQBTT6ZKISHT6E4MEAV62GX3G6FCWWNCA/cosa/builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz: no such file or directory
--- FAIL: coreos.selinux.enforce (0.03s)
        harness.go:507: Cluster failed starting machines: lstat /home/jenkins/workspace/fedora-coreos-config_PR-176-5KUJYYU75I7NAEJRFK3ZQBTT6ZKISHT6E4MEAV62GX3G6FCWWNCA/cosa/builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz: no such file or directory
--- FAIL: fcos.python (0.03s)
        harness.go:507: Cluster failed starting machines: lstat /home/jenkins/workspace/fedora-coreos-config_PR-176-5KUJYYU75I7NAEJRFK3ZQBTT6ZKISHT6E4MEAV62GX3G6FCWWNCA/cosa/builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz: no such file or directory
--- FAIL: rpmostree.status (0.03s)
        harness.go:507: Cluster failed starting machines: lstat /home/jenkins/workspace/fedora-coreos-config_PR-176-5KUJYYU75I7NAEJRFK3ZQBTT6ZKISHT6E4MEAV62GX3G6FCWWNCA/cosa/builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz: no such file or directory
--- FAIL: rpmostree.upgrade-rollback (0.03s)
        harness.go:507: Cluster failed starting machines: lstat /home/jenkins/workspace/fedora-coreos-config_PR-176-5KUJYYU75I7NAEJRFK3ZQBTT6ZKISHT6E4MEAV62GX3G6FCWWNCA/cosa/builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz: no such file or directory
--- FAIL: coreos.ignition.once (0.03s)
        harness.go:507: Cluster failed starting machines: lstat /home/jenkins/workspace/fedora-coreos-config_PR-176-5KUJYYU75I7NAEJRFK3ZQBTT6ZKISHT6E4MEAV62GX3G6FCWWNCA/cosa/builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz: no such file or directory
--- FAIL: coreos.ignition.sethostname (0.03s)
        harness.go:507: Cluster failed starting machines: lstat /home/jenkins/workspace/fedora-coreos-config_PR-176-5KUJYYU75I7NAEJRFK3ZQBTT6ZKISHT6E4MEAV62GX3G6FCWWNCA/cosa/builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz: no such file or directory
--- FAIL: systemd.sysusers.gshadow (0.03s)
        harness.go:507: Cluster failed starting machines: lstat /home/jenkins/workspace/fedora-coreos-config_PR-176-5KUJYYU75I7NAEJRFK3ZQBTT6ZKISHT6E4MEAV62GX3G6FCWWNCA/cosa/builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz: no such file or directory
--- FAIL: coreos.ignition.groups (0.03s)
        harness.go:507: Cluster failed starting machines: lstat /home/jenkins/workspace/fedora-coreos-config_PR-176-5KUJYYU75I7NAEJRFK3ZQBTT6ZKISHT6E4MEAV62GX3G6FCWWNCA/cosa/builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz: no such file or directory

I guess as it's set up right now, a missing qcow2 translates to failing tests. Whereas I would've expected kola to error out upfront before even running one test.

arithx · 2019-09-27T03:10:10Z

I know I've talked with various people outside of this thread about this in the past but before I dive into a response let me give a dump of relevant information, feel free to reach out if you want to dive a bit more into any of the topics:

The way the runner currently works there are very few things that are performed outside of individual tests. Essentially the runner just constructs the platform Flight, a list of tests from the glob, and starts them.
Setup tasks (machine creation, initial validations, connecting via SSH) are ran as part of the test as well as the destruction / cleanup routines.
As such the only real failure states that could be caught and deciphered as not part of the test failing would be
1. Flight creation (not machine / cluster creation)
2. Test filtering
3. Creation of diagnostics / TAP files
Because of daisy chained error results it would require completely reworking the error handling of the entire project to be able to decipher platform errors from legitimate test failures (and even then I'm not sure I'd feel comfortable saying that the platform error isn't a legitimate failure in some cases)
I'd probably lean more towards what @ajeddeloh mentioned earlier in thread and provide a distinct error code for the case of test failure vs general failure.

I guess as it's set up right now, a missing qcow2 translates to failing tests. Whereas I would've expected kola to error out upfront before even running one test.

Are you advocating for adding platform specific checks inside of the runner? I'm not sure that I agree with that. The Platform/API code knows how to interact with the specific platform to validate that parameters are correct / exist while the runner specifically just knows how to create Tests and run them. Machine creation is essentially a setup task for the test, I don't see this failure as any different than a test failing because it's setup required it to connect to a database that wasn't routable. Yes, in this specific case qemu-image happens to be a file on the local filesystem, however I don't think we should special case for the platform just because that happens to be the case, this is essentially the same as kola being given an incorrect AMI ID or an invalid region + machine type configuration.

arithx · 2020-01-09T18:07:03Z

Closed in #1153

Now that coreos/mantle#1059 is fixed (see coreos/mantle#1153), we can use the new `--no-test-exit-error` switch to be more strict about kola error handling. This way, we immediately fail the build if something fundamental went wrong with kola.

jlebon mentioned this issue Sep 17, 2019

Jenkinsfiles: don't ignore errors from kola command coreos/fedora-coreos-pipeline#138

Closed

arithx added component/kola kind/enhancement labels Sep 17, 2019

arithx closed this as completed Jan 9, 2020

jlebon mentioned this issue Jan 15, 2020

Jenkinsfile: stop ignoring kola errors coreos/fedora-coreos-pipeline#188

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`kola run` should allow for exiting zero even if tests failed #1059

`kola run` should allow for exiting zero even if tests failed #1059

jlebon commented Sep 17, 2019

arithx commented Sep 17, 2019

ajeddeloh commented Sep 17, 2019

jlebon commented Sep 17, 2019

jlebon commented Sep 26, 2019

arithx commented Sep 27, 2019

arithx commented Jan 9, 2020

kola run should allow for exiting zero even if tests failed #1059

kola run should allow for exiting zero even if tests failed #1059

Comments

jlebon commented Sep 17, 2019

arithx commented Sep 17, 2019

ajeddeloh commented Sep 17, 2019

jlebon commented Sep 17, 2019

jlebon commented Sep 26, 2019

arithx commented Sep 27, 2019

arithx commented Jan 9, 2020

`kola run` should allow for exiting zero even if tests failed #1059

`kola run` should allow for exiting zero even if tests failed #1059