Skip to content
This repository has been archived by the owner on May 7, 2021. It is now read-only.

kola run should allow for exiting zero even if tests failed #1059

Closed
jlebon opened this issue Sep 17, 2019 · 6 comments
Closed

kola run should allow for exiting zero even if tests failed #1059

jlebon opened this issue Sep 17, 2019 · 6 comments

Comments

@jlebon
Copy link
Member

jlebon commented Sep 17, 2019

Right now, there's no practical way to discern between a test failing and a golang panic, etc... Would be really nice to have a e.g. --exit-zero-on-test-failures so we can discern those two cases.

There are some gray areas of course (see prior discussions in coreos/fedora-coreos-pipeline#114 (comment)), though we can at least start with the obvious stuff (e.g. not passing the right kola run switches).

@arithx
Copy link
Contributor

arithx commented Sep 17, 2019

Thanks for writing this up, somehow it seems to have slipped off my radar.

@ajeddeloh
Copy link
Contributor

just spitballing: could assign different exit codes for different failures. 1 for panic, 2 for test failure, etc

@jlebon
Copy link
Member Author

jlebon commented Sep 17, 2019

To clarify, I think we're mostly interested in discerning between two broad categories: test failures, and "everything else". Being able to tell kola to not exit nonzero on the former means we can let a shell script with set -e just naturally error out if any other type of failure occurs.

@jlebon
Copy link
Member Author

jlebon commented Sep 26, 2019

On this topic, noticed this today:

kola -b fcos -p qemu-unpriv --qemu-image builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz --output-dir tmp/kola run
=== RUN   coreos.auth.verify
=== RUN   coreos.ignition.groups
=== RUN   coreos.selinux.enforce
=== RUN   coreos.ignition.once
=== RUN   systemd.sysusers.gshadow
=== RUN   rpmostree.status
=== RUN   fcos.python
=== RUN   fcos.basic
=== RUN   coreos.ignition.sethostname
=== RUN   rhcos.selinux.boolean.persist
=== RUN   coreos.selinux.boolean
=== RUN   rpmostree.upgrade-rollback
--- FAIL: rhcos.selinux.boolean.persist (0.04s)
        harness.go:507: Cluster failed starting machines: lstat /home/jenkins/workspace/fedora-coreos-config_PR-176-5KUJYYU75I7NAEJRFK3ZQBTT6ZKISHT6E4MEAV62GX3G6FCWWNCA/cosa/builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz: no such file or directory
--- FAIL: coreos.selinux.boolean (0.03s)
        harness.go:507: Cluster failed starting machines: lstat /home/jenkins/workspace/fedora-coreos-config_PR-176-5KUJYYU75I7NAEJRFK3ZQBTT6ZKISHT6E4MEAV62GX3G6FCWWNCA/cosa/builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz: no such file or directory
--- FAIL: fcos.basic (0.03s)
        harness.go:507: Cluster failed starting machines: lstat /home/jenkins/workspace/fedora-coreos-config_PR-176-5KUJYYU75I7NAEJRFK3ZQBTT6ZKISHT6E4MEAV62GX3G6FCWWNCA/cosa/builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz: no such file or directory
--- FAIL: coreos.auth.verify (0.03s)
        harness.go:507: Cluster failed starting machines: lstat /home/jenkins/workspace/fedora-coreos-config_PR-176-5KUJYYU75I7NAEJRFK3ZQBTT6ZKISHT6E4MEAV62GX3G6FCWWNCA/cosa/builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz: no such file or directory
--- FAIL: coreos.selinux.enforce (0.03s)
        harness.go:507: Cluster failed starting machines: lstat /home/jenkins/workspace/fedora-coreos-config_PR-176-5KUJYYU75I7NAEJRFK3ZQBTT6ZKISHT6E4MEAV62GX3G6FCWWNCA/cosa/builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz: no such file or directory
--- FAIL: fcos.python (0.03s)
        harness.go:507: Cluster failed starting machines: lstat /home/jenkins/workspace/fedora-coreos-config_PR-176-5KUJYYU75I7NAEJRFK3ZQBTT6ZKISHT6E4MEAV62GX3G6FCWWNCA/cosa/builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz: no such file or directory
--- FAIL: rpmostree.status (0.03s)
        harness.go:507: Cluster failed starting machines: lstat /home/jenkins/workspace/fedora-coreos-config_PR-176-5KUJYYU75I7NAEJRFK3ZQBTT6ZKISHT6E4MEAV62GX3G6FCWWNCA/cosa/builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz: no such file or directory
--- FAIL: rpmostree.upgrade-rollback (0.03s)
        harness.go:507: Cluster failed starting machines: lstat /home/jenkins/workspace/fedora-coreos-config_PR-176-5KUJYYU75I7NAEJRFK3ZQBTT6ZKISHT6E4MEAV62GX3G6FCWWNCA/cosa/builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz: no such file or directory
--- FAIL: coreos.ignition.once (0.03s)
        harness.go:507: Cluster failed starting machines: lstat /home/jenkins/workspace/fedora-coreos-config_PR-176-5KUJYYU75I7NAEJRFK3ZQBTT6ZKISHT6E4MEAV62GX3G6FCWWNCA/cosa/builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz: no such file or directory
--- FAIL: coreos.ignition.sethostname (0.03s)
        harness.go:507: Cluster failed starting machines: lstat /home/jenkins/workspace/fedora-coreos-config_PR-176-5KUJYYU75I7NAEJRFK3ZQBTT6ZKISHT6E4MEAV62GX3G6FCWWNCA/cosa/builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz: no such file or directory
--- FAIL: systemd.sysusers.gshadow (0.03s)
        harness.go:507: Cluster failed starting machines: lstat /home/jenkins/workspace/fedora-coreos-config_PR-176-5KUJYYU75I7NAEJRFK3ZQBTT6ZKISHT6E4MEAV62GX3G6FCWWNCA/cosa/builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz: no such file or directory
--- FAIL: coreos.ignition.groups (0.03s)
        harness.go:507: Cluster failed starting machines: lstat /home/jenkins/workspace/fedora-coreos-config_PR-176-5KUJYYU75I7NAEJRFK3ZQBTT6ZKISHT6E4MEAV62GX3G6FCWWNCA/cosa/builds/30.20190920.dev.0/x86_64/fedora-coreos-30.20190920.dev.0-qemu.qcow2.xz: no such file or directory

I guess as it's set up right now, a missing qcow2 translates to failing tests. Whereas I would've expected kola to error out upfront before even running one test.

@arithx
Copy link
Contributor

arithx commented Sep 27, 2019

I know I've talked with various people outside of this thread about this in the past but before I dive into a response let me give a dump of relevant information, feel free to reach out if you want to dive a bit more into any of the topics:

  • The way the runner currently works there are very few things that are performed outside of individual tests. Essentially the runner just constructs the platform Flight, a list of tests from the glob, and starts them.
  • Setup tasks (machine creation, initial validations, connecting via SSH) are ran as part of the test as well as the destruction / cleanup routines.
  • As such the only real failure states that could be caught and deciphered as not part of the test failing would be
    1. Flight creation (not machine / cluster creation)
    2. Test filtering
    3. Creation of diagnostics / TAP files
  • Because of daisy chained error results it would require completely reworking the error handling of the entire project to be able to decipher platform errors from legitimate test failures (and even then I'm not sure I'd feel comfortable saying that the platform error isn't a legitimate failure in some cases)
  • I'd probably lean more towards what @ajeddeloh mentioned earlier in thread and provide a distinct error code for the case of test failure vs general failure.

I guess as it's set up right now, a missing qcow2 translates to failing tests. Whereas I would've expected kola to error out upfront before even running one test.

Are you advocating for adding platform specific checks inside of the runner? I'm not sure that I agree with that. The Platform/API code knows how to interact with the specific platform to validate that parameters are correct / exist while the runner specifically just knows how to create Tests and run them. Machine creation is essentially a setup task for the test, I don't see this failure as any different than a test failing because it's setup required it to connect to a database that wasn't routable. Yes, in this specific case qemu-image happens to be a file on the local filesystem, however I don't think we should special case for the platform just because that happens to be the case, this is essentially the same as kola being given an incorrect AMI ID or an invalid region + machine type configuration.

@arithx
Copy link
Contributor

arithx commented Jan 9, 2020

Closed in #1153

@arithx arithx closed this as completed Jan 9, 2020
jlebon added a commit to jlebon/fedora-coreos-pipeline that referenced this issue Jan 15, 2020
Now that coreos/mantle#1059 is fixed (see
coreos/mantle#1153), we can use the new
`--no-test-exit-error` switch to be more strict about kola error
handling.

This way, we immediately fail the build if something fundamental went
wrong with kola.
jlebon added a commit to coreos/fedora-coreos-pipeline that referenced this issue Jan 16, 2020
Now that coreos/mantle#1059 is fixed (see
coreos/mantle#1153), we can use the new
`--no-test-exit-error` switch to be more strict about kola error
handling.

This way, we immediately fail the build if something fundamental went
wrong with kola.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants