-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
strange hanging behaviour with pmap #5065
Comments
@mattjj any ideas what this could be, and if there's something I can do to force |
I guess this doesn't receive attention because it's too long and it's not clear where the problem stems from. I'm closing. |
We would like to look into it, but we're busy and anything you can do to minimize the problem would help us out! |
Rereading this, I don't think there's enough information to debug it. Can you give us something that we can run? I suspect you can strip out most of the math; it's probably only the communication structure that's relevant. |
@hawkinsp Hi Peter, I wasn't able to isolate it for the reason of #5117. I tried to isolate a small part of code that replicated the problem. I would try to use |
This is an issue for me as well. Appreciate any help resolving. |
It seems like when
pmap
is called repeatedly it hangs at some point. This point seems to be deterministic.context
I'm using
pmap
(notsoft_pmap
) to distribute tasks over devices. I've created achunked_pmap
to do this:problem specifics
I am using
chunked_pmap
to distribute moderate workloads to devices. I have a functionunconstrained_solve
and arguments for it that get distributed like this:When I run this, with
chunksize=1
(so that it's just sequentially running problems) it hangs at chunk index223
(out of many more).diagnosis effort
Here I describe diagnosis angles and seek your input and help.
Is it a data/function problem or
pmap
problem?Seems to be a
pmap
problem.The problems being distributed by
pmap
run fine on their own.I have run the
unconstrained_solve
on each slice of data and it computes nominally. i.e. this works:works fine as expected.
Is it dependent on
chunksize
?The point that it hangs is dependent on
chunksize
. I have 8 cores on my CPU and here's what I get when I try using differentchunksize
'sWhen
chunksize=1
it hangs on slice223:224
.When
chunksize=2
it hangs on slice226:228
.When
chunksize=3
it hangs on slice315:318
.When
chunksize=4
it hangs on slice232:236
.When
chunksize=5
it hangs on slice235:240
.When
chunksize=6
it hangs on slice408:414
.When
chunksize=7
it hangs on slice441:448
.When
chunksize=8
it hangs on slice472:480
.There's something special about those slices when it hangs.
I tried starting the distribution at later slices, but before the problematic slices. And I get these peculiar behaviours:
With
chunksize=1
the problem was at slice223:224
so I tried starting at slice221:222
.This goes past
223
and hangs at570:571
. However, if I start at222:223
(one later) then it makes it to569:570
(one behind).Other information
These numbers above are consistent. I've run the same code ~10 times and get the same hanging slices. So it seems to be deterministic (less likely to be a race condition).
The functions being distributed take approximately 0.1 seconds to run.
The text was updated successfully, but these errors were encountered: