Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kola --parallel broken with qemu #1160

Closed
cgwalters opened this issue Jan 10, 2020 · 12 comments · May be fixed by coreos/mantle#1161
Closed

kola --parallel broken with qemu #1160

cgwalters opened this issue Jan 10, 2020 · 12 comments · May be fixed by coreos/mantle#1161
Assignees

Comments

@cgwalters
Copy link
Member

In my test FCOS pipeline, one of the nodes ran OOM. Looking at it, we're spawning too many qemu instances. If I do cosa kola run -- --parallel 2 locally, I get 8 qemu proceses.

Yet, I only see 1 concurrent if I don't specify --parallel. So the bug seems to occur when parallel > 1?

I started reading the harness code around parallelism but my eyes glazed over - so many goroutines and barriers and mutexes and lions and tigers and bears...

@cgwalters
Copy link
Member Author

I don't think this is actually specific to qemu of course. Probably we just don't notice it as much.

Hum...could this have something to do with subtests?

@cgwalters
Copy link
Member Author

Ah I see, this code came from coreos/mantle@d5f50b5 - it forked from the golang upstream, so of course they're going to be all "channels solve everything!".

Ah, and golang/go@f04d583#diff-70e628298261565d825f7199d13042f2 looks relevant here.

And it's Go so of course there's a pile of race conditions with unstructured goroutines, like

Blah.

@arithx
Copy link
Contributor

arithx commented Jan 10, 2020

If I do cosa kola run -- --parallel 2 locally, I get 8 qemu proceses.

Notably the parallel flag concerns test count not machine count so theoretically you could have a much larger machine count than the value of the parallel flag if the tests used require multiple machines (though in practice I don't believe we have a test with more than 3 machines, and it might be less now after dropping legacy CL tests).

I started reading the harness code around parallelism but my eyes glazed over - so many goroutines and barriers and mutexes and lions and tigers and bears...

Yeah it's not pretty unfortunately, the runner readme might help a bit.

@cgwalters
Copy link
Member Author

cgwalters commented Jan 10, 2020

So...hum. How about --max-flights N which ensures that at most N Flight instances are created at a time?

@arithx
Copy link
Contributor

arithx commented Jan 10, 2020

So...hum. How about --max-flights N which ensures that at most N Flight instances are created at a time?

Flights can have multiple clusters, and clusters can have multiple machines. If you're looking to keep to a particular maximum instance count it'll be very difficult as individual tests can spawn up additional machines outside of the declared amount (this is done in tests like the NFS ones to allow them to create machines with different configs)

@jlebon
Copy link
Member

jlebon commented Jan 13, 2020

Yet, I only see 1 concurrent if I don't specify --parallel. So the bug seems to occur when parallel > 1?

Hmm, weird. I'm not even seeing that. I just did a simple cosa kola and ended up with 19 qemu processes! Looking closely it seems like the qemu processes aren't actually being reaped after their associated tests complete.

Also seeing this with e.g. --parallel 4.

@cgwalters
Copy link
Member Author

Looking closely it seems like the qemu processes aren't actually being reaped after their associated tests complete.

You see ones in Z state? I'm not seeing that here.

@cgwalters
Copy link
Member Author

cgwalters commented Jan 13, 2020

Flights can have multiple clusters, and clusters can have multiple machines. If you're looking to keep to a particular maximum instance count it'll be very difficult as individual tests can spawn up additional machines outside of the declared amount (this is done in tests like the NFS ones to allow them to create machines with different configs)

OK. Well, we can add at least --qemu-max-machines right? And implement it with a semaphore inside unprivqemu.

@cgwalters
Copy link
Member Author

Or a generic --max-machines would be straightforward too I think, just a mechanical change.

cgwalters referenced this issue in cgwalters/mantle Jan 14, 2020
This is only implemented for qemu at the moment, though it'd
be a mostly mechanical change to propagate it to the other
providers.

For our pipeline testing, we need to have a hard cap on the number
of qemu instances we spawn, otherwise we can go over the RAM
allocated to the pod.

Actually the FCOS pipeline today doesn't impose a hard cap, and
my test pipeline in the coreosci (nested GCP virt) ended up bringing
down the node via the OOM killer.

There were a few bugs here; first we were leaking the spawned
qemu instance.  We also need to invoke `Wait()` synchronously in
destruction.

Then, add a dependency on the `golang/x/semaphore` library, and
use it to implement a max limit.

Closes: https://github.com/coreos/mantle/issues/1157
@cgwalters
Copy link
Member Author

PR in coreos/mantle#1161

cgwalters referenced this issue in cgwalters/mantle Jan 14, 2020
This is only implemented for qemu at the moment, though it'd
be a mostly mechanical change to propagate it to the other
providers.

For our pipeline testing, we need to have a hard cap on the number
of qemu instances we spawn, otherwise we can go over the RAM
allocated to the pod.

Actually the FCOS pipeline today doesn't impose a hard cap, and
my test pipeline in the coreosci (nested GCP virt) ended up bringing
down the node via the OOM killer.

There were a few bugs here; first we were leaking the spawned
qemu instance.  We also need to invoke `Wait()` synchronously in
destruction.

Then, add a dependency on the `golang/x/semaphore` library, and
use it to implement a max limit.

Closes: https://github.com/coreos/mantle/issues/1157
cgwalters referenced this issue in cgwalters/mantle Jan 14, 2020
This is only implemented for qemu at the moment, though it'd
be a mostly mechanical change to propagate it to the other
providers.

For our pipeline testing, we need to have a hard cap on the number
of qemu instances we spawn, otherwise we can go over the RAM
allocated to the pod.

Actually the FCOS pipeline today doesn't impose a hard cap, and
my test pipeline in the coreosci (nested GCP virt) ended up bringing
down the node via the OOM killer.

There were a few bugs here; first we were leaking the spawned
qemu instance.  We also need to invoke `Wait()` synchronously in
destruction.

Then, add a dependency on the `golang/x/semaphore` library, and
use it to implement a max limit.

Closes: https://github.com/coreos/mantle/issues/1157
@cgwalters
Copy link
Member Author

OK so yeah we were definitely leaking qemu instances. And that may have been what was provoking a kernel bug at least on RHEL7 that was taking down the RHCOS pipeline.

cgwalters referenced this issue in cgwalters/mantle Jan 15, 2020
This is only implemented for qemu at the moment, though it'd
be a mostly mechanical change to propagate it to the other
providers.

For our pipeline testing, we need to have a hard cap on the number
of qemu instances we spawn, otherwise we can go over the RAM
allocated to the pod.

Actually the FCOS pipeline today doesn't impose a hard cap, and
my test pipeline in the coreosci (nested GCP virt) ended up bringing
down the node via the OOM killer.

There were a few bugs here; first we were leaking the spawned
qemu instance.  We also need to invoke `Wait()` synchronously in
destruction.

Then, add a dependency on the `golang/x/semaphore` library, and
use it to implement a max limit.

Closes: https://github.com/coreos/mantle/issues/1157
@cgwalters cgwalters transferred this issue from coreos/mantle Feb 27, 2020
@nikita-dubrovskii nikita-dubrovskii self-assigned this Oct 12, 2023
@nikita-dubrovskii
Copy link
Contributor

Tested locally on x86, everything works as expected for any given N:

# cosa kola run -- --parallel N 
# pidof qemu-system-x86_64 | wc -w
N

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants