kola --parallel broken with qemu #1160

cgwalters · 2020-01-10T16:24:12Z

In my test FCOS pipeline, one of the nodes ran OOM. Looking at it, we're spawning too many qemu instances. If I do cosa kola run -- --parallel 2 locally, I get 8 qemu proceses.

Yet, I only see 1 concurrent if I don't specify --parallel. So the bug seems to occur when parallel > 1?

I started reading the harness code around parallelism but my eyes glazed over - so many goroutines and barriers and mutexes and lions and tigers and bears...

The text was updated successfully, but these errors were encountered:

cgwalters · 2020-01-10T17:37:48Z

I don't think this is actually specific to qemu of course. Probably we just don't notice it as much.

Hum...could this have something to do with subtests?

cgwalters · 2020-01-10T17:43:36Z

Ah I see, this code came from coreos/mantle@d5f50b5 - it forked from the golang upstream, so of course they're going to be all "channels solve everything!".

Ah, and golang/go@f04d583#diff-70e628298261565d825f7199d13042f2 looks relevant here.

And it's Go so of course there's a pile of race conditions with unstructured goroutines, like

Blah.

arithx · 2020-01-10T18:15:17Z

If I do cosa kola run -- --parallel 2 locally, I get 8 qemu proceses.

Notably the parallel flag concerns test count not machine count so theoretically you could have a much larger machine count than the value of the parallel flag if the tests used require multiple machines (though in practice I don't believe we have a test with more than 3 machines, and it might be less now after dropping legacy CL tests).

I started reading the harness code around parallelism but my eyes glazed over - so many goroutines and barriers and mutexes and lions and tigers and bears...

Yeah it's not pretty unfortunately, the runner readme might help a bit.

cgwalters · 2020-01-10T21:58:48Z

So...hum. How about --max-flights N which ensures that at most N Flight instances are created at a time?

arithx · 2020-01-10T22:17:17Z

So...hum. How about --max-flights N which ensures that at most N Flight instances are created at a time?

Flights can have multiple clusters, and clusters can have multiple machines. If you're looking to keep to a particular maximum instance count it'll be very difficult as individual tests can spawn up additional machines outside of the declared amount (this is done in tests like the NFS ones to allow them to create machines with different configs)

jlebon · 2020-01-13T17:08:19Z

Yet, I only see 1 concurrent if I don't specify --parallel. So the bug seems to occur when parallel > 1?

Hmm, weird. I'm not even seeing that. I just did a simple cosa kola and ended up with 19 qemu processes! Looking closely it seems like the qemu processes aren't actually being reaped after their associated tests complete.

Also seeing this with e.g. --parallel 4.

cgwalters · 2020-01-13T20:47:02Z

Looking closely it seems like the qemu processes aren't actually being reaped after their associated tests complete.

You see ones in Z state? I'm not seeing that here.

cgwalters · 2020-01-13T21:02:41Z

Flights can have multiple clusters, and clusters can have multiple machines. If you're looking to keep to a particular maximum instance count it'll be very difficult as individual tests can spawn up additional machines outside of the declared amount (this is done in tests like the NFS ones to allow them to create machines with different configs)

OK. Well, we can add at least --qemu-max-machines right? And implement it with a semaphore inside unprivqemu.

cgwalters · 2020-01-13T21:06:22Z

Or a generic --max-machines would be straightforward too I think, just a mechanical change.

This is only implemented for qemu at the moment, though it'd be a mostly mechanical change to propagate it to the other providers. For our pipeline testing, we need to have a hard cap on the number of qemu instances we spawn, otherwise we can go over the RAM allocated to the pod. Actually the FCOS pipeline today doesn't impose a hard cap, and my test pipeline in the coreosci (nested GCP virt) ended up bringing down the node via the OOM killer. There were a few bugs here; first we were leaking the spawned qemu instance. We also need to invoke `Wait()` synchronously in destruction. Then, add a dependency on the `golang/x/semaphore` library, and use it to implement a max limit. Closes: https://github.com/coreos/mantle/issues/1157

cgwalters · 2020-01-14T14:57:24Z

PR in coreos/mantle#1161

This is only implemented for qemu at the moment, though it'd be a mostly mechanical change to propagate it to the other providers. For our pipeline testing, we need to have a hard cap on the number of qemu instances we spawn, otherwise we can go over the RAM allocated to the pod. Actually the FCOS pipeline today doesn't impose a hard cap, and my test pipeline in the coreosci (nested GCP virt) ended up bringing down the node via the OOM killer. There were a few bugs here; first we were leaking the spawned qemu instance. We also need to invoke `Wait()` synchronously in destruction. Then, add a dependency on the `golang/x/semaphore` library, and use it to implement a max limit. Closes: https://github.com/coreos/mantle/issues/1157

cgwalters · 2020-01-14T15:11:20Z

OK so yeah we were definitely leaking qemu instances. And that may have been what was provoking a kernel bug at least on RHEL7 that was taking down the RHCOS pipeline.

This is only implemented for qemu at the moment, though it'd be a mostly mechanical change to propagate it to the other providers. For our pipeline testing, we need to have a hard cap on the number of qemu instances we spawn, otherwise we can go over the RAM allocated to the pod. Actually the FCOS pipeline today doesn't impose a hard cap, and my test pipeline in the coreosci (nested GCP virt) ended up bringing down the node via the OOM killer. There were a few bugs here; first we were leaking the spawned qemu instance. We also need to invoke `Wait()` synchronously in destruction. Then, add a dependency on the `golang/x/semaphore` library, and use it to implement a max limit. Closes: https://github.com/coreos/mantle/issues/1157

nikita-dubrovskii · 2023-10-16T11:01:35Z

Tested locally on x86, everything works as expected for any given N:

# cosa kola run -- --parallel N 
# pidof qemu-system-x86_64 | wc -w
N

cgwalters transferred this issue from coreos/mantle Feb 27, 2020

cgwalters mentioned this issue Sep 17, 2021

qemu: Bump timeout for qmp connection #2447

Merged

nikita-dubrovskii self-assigned this Oct 12, 2023

nikita-dubrovskii closed this as completed Oct 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kola --parallel broken with qemu #1160

kola --parallel broken with qemu #1160

cgwalters commented Jan 10, 2020

cgwalters commented Jan 10, 2020

cgwalters commented Jan 10, 2020

arithx commented Jan 10, 2020

cgwalters commented Jan 10, 2020 •

edited

Loading

arithx commented Jan 10, 2020

jlebon commented Jan 13, 2020

cgwalters commented Jan 13, 2020

cgwalters commented Jan 13, 2020 •

edited

Loading

cgwalters commented Jan 13, 2020

cgwalters commented Jan 14, 2020

cgwalters commented Jan 14, 2020

nikita-dubrovskii commented Oct 16, 2023

kola --parallel broken with qemu #1160

kola --parallel broken with qemu #1160

Comments

cgwalters commented Jan 10, 2020

cgwalters commented Jan 10, 2020

cgwalters commented Jan 10, 2020

arithx commented Jan 10, 2020

cgwalters commented Jan 10, 2020 • edited Loading

arithx commented Jan 10, 2020

jlebon commented Jan 13, 2020

cgwalters commented Jan 13, 2020

cgwalters commented Jan 13, 2020 • edited Loading

cgwalters commented Jan 13, 2020

cgwalters commented Jan 14, 2020

cgwalters commented Jan 14, 2020

nikita-dubrovskii commented Oct 16, 2023

cgwalters commented Jan 10, 2020 •

edited

Loading

cgwalters commented Jan 13, 2020 •

edited

Loading