-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelization only part-time #204
Comments
Hi, my guess is that some of your external function calls take much longer than others, and thus the parallelized for loop waits until all calls have finished. |
Ok there always 8 runs to be completed before the next 8 are started? It is possible that it does not block and the next already start. I have found: "OpenMP will automatically wait for all threads to finish before execution continues." |
If you'd like to reacquire the threads so that they can be used somehow by the still ongoing eval function calls, you may want to try using |
Inside the FitFunc it is blocked until the return value is calculated. Inside the FitFunc a thread is called but FitFunc blocked inside until the thread is finished and the return value for FitFunc is calculated. So the FitFunc is not bypassed and gets its return. But I think I found the reason. It's related to the value of lambda. So now i think the effect has to do with the algorithm of cmaes (value lambda) and not with openmp. |
Well, show us some code. |
The code doesn't matter because everything that takes some time will have this effect. You can put in the function a random sleep (e.g. one second to 2 minutes) and the same thing will happen. I think the problem is that the algorithm of LIBCMAES wants to finish the full number of lambda completely before going any further. It seems to block until a population is completely done. In the last runs of a population only a few or one thread is busy until a new population can start. This effect can be seen in the graphic above and prevents the CPU from being fully used. This slows down the total optimization time very much. LIBCMAES ist great and can effectively find the global optimum in a small number of steps, but this advantage is nullified because the parallelization is only part-time. As an example there is another algorithm Simulated Annealing for optimization that also uses OpenMP but that doesn't block anything: Would it be possible to design cmaes so that the multithreading is not limited? LIBCMAES is more effective algorithm and can effectively find the global optimum in a small number of steps, but this advantage is nullified because the parallelization is only part-time. |
No.
Without the knowledge about implementation details (that is without knowing how simple, complicated or impossible it may be to implement), I see out of my head two "simple" ways to go about this:
In principle, CMA-ES is robust to small changes in A note on the above figure: to me it suggests (if my interpretation is correct) that the loss in time is less than a factor of two: the reddish area is more than half of the white area. |
I guess that theoretically (and @nikohansen can validate/invalidate), you should be able to hold on the values of points for which simulation takes longer, and reinject them back in when their f-value becomes available. If this is a valid option, then you may be able to hack this from libcmaes code without too much trouble, approximatively like this:
In all cases, we can only guide you here with direction on how to hack the libcmaes so that it best serves your purpose, you'll have to code it up. |
I hope I got it. So if something takes too long, it'll continue independently and to continue the for-loop on libcmaes I have to return a pseudo-value, right? when the run is finished in the background i will overwrite this pseudo value in the libcmaes pool right? how can i access the pool? and is there an id to identify the correct entry? |
You'd need to build a pool outside of libcmaes: pool <-> fitfunc <-> libcmaes . You'd have to maintain whatever id you need. It's sketchy but it may work. @nikohansen hello! what would it mean for CMA-ES to deal with asynchronous f-values, meaning having CMA-ES sampling a bunch of points, but getting the point values back asynchronously ? The distribution at time t would use whatever points are available, possibly from past sampling rounds, but every sampled point would eventually be considered. Would that hamper convergence ? |
ok I can start a seperate independent thread in FitFunc that get the parameters and then in FitFunc return a pseudo value that the libcmaes-loop not blocking. FitFunc fsphere = [](const double *x, const int N) thread() |
@beniz Hi! Right, I didn't think about this. It is possible. It is hard to predict how reliable it is (it should usually work, but what do I know). What needs to be available is something akin to If I find a simple way to mimic the situation in a test function, I may check out how robust the Python module is when faced with such delayed evaluations.
I don't see how one can get around this easily. If 500 CPUs are populated from the current distribution, the update will reflect the current distribution for 500 evaluations. It seems impossible to mimic 10 iterations without updating the samples in between. I usually consider to run several independent optimizations in parallel in this scenario. It's hard to see, if setting Lambda=500 is not a viable option, how to fully exploit the available CPUs otherwise. |
A small update: if older solutions are fed back, it seems advisable to turn off active CMA (which updates the covariance matrix from bad solutions with negative weights) for understandable reasons. Also the This is the code I used to make a quick experiment: import numpy as np
import cma
class LaggingFitness:
"""delivers X, F like the input argument to tell
"""
def __init__(self, f, fraction, X):
"""`fraction` is the fraction of solutions with delayed evaluation"""
self.f = f
self.fraction = fraction
self.X = [np.array(x) for x in X]
def __call__(self, X):
"""return X_out, F, where X_out != X"""
Xout = []
lenX = len(X)
while len(Xout) < self.fraction * lenX:
# TODO: we may want to choose the delay by fitness value rather than uniform
Xout += [self.X.pop(np.random.randint(len(self.X)))]
while len(Xout) < lenX:
Xout += [X.pop(np.random.randint(len(X)))]
self.X += X # keep the rest for later
return Xout, [self.f(x) for x in Xout]
es = cma.CMAEvolutionStrategy(11 * [1], 1, {'ftarget': 1e-9, 'CMA_active':False})
fit = LaggingFitness(cma.ff.elli, 0.8, es.ask() + es.ask()[:1])
while not es.stop():
X = es.ask()
X, F = fit(X)
es.tell(X, F)
es.disp()
es.logger.add()
cma.plot() |
Is your fitness function itself parallelizable? I went that route with OpenMP and turned multithreading off for CMAES. That way the level of parallelism would not be dependent on the number of samples taken every iteration. |
yes, that's true. if the number of cpu cores becomes too high in relation to lambda then it will not work. maybe from about factor 3 on it will not make sense anymore. if you increase lambda the cpu usage is better but then the total time will be longer.
yes that would be a good idea but in my case it is not so easy to parallelize the function itself because the data is processed serially. |
Hello,
when I want multithreaded parallel function evaluations for 8 Threads i do this:
omp_set_dynamic(false); // omp setting
omp_set_num_threads(8); // omp setting
cmaparams.set_mt_feval(true); // libcmaes setting
This should aktivate in esostrategy.cc:
omp parallel for if (_parameters._mt_feval)
I have very expensive functions and some parameters take much longer than others. The problem is that often not all threads are used. Overall, this makes the optimization much longer because the CPU is often only used to 1/8 of its capacity. Therefore the effect of parallelization is only part-time.
It gives the impression that the parallelization is interrupted as soon as a "slow function execution" occurs in the overall process. As if the execution of new threads would have to wait.
What is the reason for this and is there a possibility that all 8 threads are always used and therefore the cpu is used at full capacity? This would significantly accelerate the total optimisation process.
Greetings
The text was updated successfully, but these errors were encountered: