Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WrappingElllipsoid.ComputeEnlargement not returning for some MPI processes in parallel #73

Open
lazygun37 opened this issue Aug 28, 2022 · 4 comments

Comments

@lazygun37
Copy link

  • UltraNest version: 3.4.6
  • Python version: 3.8
  • Operating System: Ubuntu 20.04

Description

I'm using UltraNest in parallel mode to fit a (non-Gaussian) mixture-model to a set of 1-D data points. The weights of the mixture component are constrained to sum to unity, so I'm using the Dirichlet prior. Things work fine when I run this on a single processor, but when I run it on, say, 10 processors, the program falls over.

What I Did

I currently run my code via openmpi:
mpirun.openmpi -np 10 --hostfile hostfile ./DD-SD-SUP_vectorized_fitfix_dirichlet.py

hostfile contains just a single line:
localhost slots=10

My machine has 12 physical cores, so this should be fine. Note that I've also tried mpich, with the same results.

The result of this is 10x the following:
Traceback (most recent call last):
File "./DD-SD-SUP_vectorized_fitfix_dirichlet.py", line 1024, in
result = sampler.run(min_num_live_points=400)
File "/home/christian/Desktop/anaconda3/lib/python3.8/site-packages/ultranest/integrator.py", line 2226, in run
for result in self.run_iter(
File "/home/christian/Desktop/anaconda3/lib/python3.8/site-packages/ultranest/integrator.py", line 2438, in run_iter
region_fresh = self._update_region(
File "/home/christian/Desktop/anaconda3/lib/python3.8/site-packages/ultranest/integrator.py", line 1998, in _update_region
f = np.max(recv_enlarge)
File "<array_function internals>", line 180, in amax
File "/home/christian/Desktop/anaconda3/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 2791, in amax
return _wrapreduction(a, np.maximum, 'max', axis, None, out,
File "/home/christian/Desktop/anaconda3/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I've been able to trace the issue down a lot more though. The underlying problem seems to be that the np.max command fails because recv_enlarge isn't the simple 1-D array containing 10 numbers that it should be. Instead, the entry that corresponds to certain processes contains -- bizarrely -- the priors that were calculated by prior_transform. So then the shape of recv_enlarge is wrong, and hence the error.

Tracking this down further shows that what's happening is that the "gather" command doesn't return correctly for some processes (and for some reason doesn't correctly block progress). And this in turn can be traced to the tregion.compute_enlargement command not returning at all for some processes.

What's even more crazy is this: if I make prior_transform return the array that was passed into it -- i.e. untransformed random numbers -- the above problems do not occur. But I have checked that the transformed arrays that are actually returned have the same shape, and reasonable entries, in all cases.

@JohannesBuchner
Copy link
Owner

JohannesBuchner commented Aug 29, 2022

Thanks for reporting this and tracking it down.

On the last part: Did you return the passed array object or a copy of it? It may make a difference (reference to other object vs a new object).

Can you print what recv_enlarge contains exactly?

This is very odd indeed. Is it possible that the MPI commands somehow ran out of sync? I wonder if it would be possible to make a concise test code that triggers the bug, and report it upstream.

Maybe it could help to print out the shape of the MPI arrays passed before every MPI call, to find out which code segment injects the prior values that are then received elsewhere?

Which MPI implementation are you using? To circumvent the bug, maybe try switching to another implementation. just saw that you use openmpi and tried mpich

@lazygun37
Copy link
Author

Good suggestion about copy vs actual passed array. It turns out the issue is maybe even more bizarre than I thought. After messing around with things for a bit, I ended up with this sort of structure in my prior_transform:

def prior_transform(uni_rands):

#... body of the function transform the uniform random numbers into Dirichlet
#    distributed numbers and stuffs them into an array called par

 assert np.shape(par) == np.shape(uni_rands)

w = 1.e-7
par2 = (1.0 - w)*par + w*uni_rands

return par2

It appears that this always works fine for "sufficiently large" w -- but that can be really small. In fact, it's worked even for w = 0!

Yet if I simply set par2 = par, it never works, an as near as I can tell, even setting just par2 = 1.0*par never works -- even though presumably that should be the same as w = 0.

I have no idea what's going on there, but I guess I at least have some sort of workaround now -- i.e. for sufficiently small w, I guess I don't really care...

I still hope to take a look at your other suggestions as well. Any other ideas would be hugely welcome, of course.

@lazygun37
Copy link
Author

By the way: did you have any thoughts on what could cause tregion.compute_enlargement to fail returning for some processes. Because I'm pretty sure that is the underlying issue. E.g. if I stick an explicit comm.Barrier in prior to the gather/bcast stuff, I can prevent errors from happening, at the expense of the code hanging. And this is definitely because some processes didn't ever get out of the tregion.compute_enlargement call.

@JohannesBuchner
Copy link
Owner

can you find out what this line
self.build_tregion = not is_affine_transform(active_u, active_v)
is doing in each process?

The behaviour could be explained if it is set differently for some processes, then the if active_p is None or not self.build_tregion: would be entered inconsistently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants