-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kola --parallel broken with qemu #1160
Comments
I don't think this is actually specific to qemu of course. Probably we just don't notice it as much. Hum...could this have something to do with subtests? |
Ah I see, this code came from coreos/mantle@d5f50b5 - it forked from the golang upstream, so of course they're going to be all "channels solve everything!". Ah, and golang/go@f04d583#diff-70e628298261565d825f7199d13042f2 looks relevant here. And it's Go so of course there's a pile of race conditions with unstructured goroutines, like
Blah. |
Notably the
Yeah it's not pretty unfortunately, the runner readme might help a bit. |
So...hum. How about |
Flights can have multiple clusters, and clusters can have multiple machines. If you're looking to keep to a particular maximum instance count it'll be very difficult as individual tests can spawn up additional machines outside of the declared amount (this is done in tests like the NFS ones to allow them to create machines with different configs) |
Hmm, weird. I'm not even seeing that. I just did a simple Also seeing this with e.g. |
You see ones in |
OK. Well, we can add at least |
Or a generic |
This is only implemented for qemu at the moment, though it'd be a mostly mechanical change to propagate it to the other providers. For our pipeline testing, we need to have a hard cap on the number of qemu instances we spawn, otherwise we can go over the RAM allocated to the pod. Actually the FCOS pipeline today doesn't impose a hard cap, and my test pipeline in the coreosci (nested GCP virt) ended up bringing down the node via the OOM killer. There were a few bugs here; first we were leaking the spawned qemu instance. We also need to invoke `Wait()` synchronously in destruction. Then, add a dependency on the `golang/x/semaphore` library, and use it to implement a max limit. Closes: https://github.com/coreos/mantle/issues/1157
PR in coreos/mantle#1161 |
This is only implemented for qemu at the moment, though it'd be a mostly mechanical change to propagate it to the other providers. For our pipeline testing, we need to have a hard cap on the number of qemu instances we spawn, otherwise we can go over the RAM allocated to the pod. Actually the FCOS pipeline today doesn't impose a hard cap, and my test pipeline in the coreosci (nested GCP virt) ended up bringing down the node via the OOM killer. There were a few bugs here; first we were leaking the spawned qemu instance. We also need to invoke `Wait()` synchronously in destruction. Then, add a dependency on the `golang/x/semaphore` library, and use it to implement a max limit. Closes: https://github.com/coreos/mantle/issues/1157
This is only implemented for qemu at the moment, though it'd be a mostly mechanical change to propagate it to the other providers. For our pipeline testing, we need to have a hard cap on the number of qemu instances we spawn, otherwise we can go over the RAM allocated to the pod. Actually the FCOS pipeline today doesn't impose a hard cap, and my test pipeline in the coreosci (nested GCP virt) ended up bringing down the node via the OOM killer. There were a few bugs here; first we were leaking the spawned qemu instance. We also need to invoke `Wait()` synchronously in destruction. Then, add a dependency on the `golang/x/semaphore` library, and use it to implement a max limit. Closes: https://github.com/coreos/mantle/issues/1157
OK so yeah we were definitely leaking qemu instances. And that may have been what was provoking a kernel bug at least on RHEL7 that was taking down the RHCOS pipeline. |
This is only implemented for qemu at the moment, though it'd be a mostly mechanical change to propagate it to the other providers. For our pipeline testing, we need to have a hard cap on the number of qemu instances we spawn, otherwise we can go over the RAM allocated to the pod. Actually the FCOS pipeline today doesn't impose a hard cap, and my test pipeline in the coreosci (nested GCP virt) ended up bringing down the node via the OOM killer. There were a few bugs here; first we were leaking the spawned qemu instance. We also need to invoke `Wait()` synchronously in destruction. Then, add a dependency on the `golang/x/semaphore` library, and use it to implement a max limit. Closes: https://github.com/coreos/mantle/issues/1157
Tested locally on x86, everything works as expected for any given
|
In my test FCOS pipeline, one of the nodes ran OOM. Looking at it, we're spawning too many qemu instances. If I do
cosa kola run -- --parallel 2
locally, I get 8 qemu proceses.Yet, I only see 1 concurrent if I don't specify
--parallel
. So the bug seems to occur when parallel > 1?I started reading the harness code around parallelism but my eyes glazed over - so many goroutines and barriers and mutexes and lions and tigers and bears...
The text was updated successfully, but these errors were encountered: