-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wasserstein on Ubuntu 18 within Docker - Linux on Power (RHEL 7) #29
Comments
Thanks for the report! @yossibokor may have some insight. |
Yes, thanks for the report! @IBMRK, computing the Wasserstein distance between two persistence diagrams is computationally expensive, in particular as the number of points increases. Having a look at the size of your data sets, I am not surprised by the increased run time. To see if there are not any other issues, could you see how long it takes to calculate the (2,2)-Wasserstein distance between either BC_Sky_COM or BC_NL_50G to the persistence diagram containing a single point, (1,1). |
Thanks, Yossi, I started a run as you've suggested. Will keep you posted on the results. A few questions please (out of interest): 1.) Is there any way to peek into the running process to access what it's doing? I would think that Linux "strace" could be used on the running "Julia" process, but then it would be helpful to know something of the program flow ... 2.) Is there any way to 'estimate' (very back-of-envelope) how much RAM each of the Eirene functions Homology/Betticurve/Wasserstein will take based the input (N x M) size? This could help a lot in my upcoming experiments ... naturally, (where possible) I'd like to push the barrier to a large as possible data set, (hopefully) without having to wait a few hours/days only to find out that it crashed "Julia". 3.) Is there any way to parallelize the Wasserstein function (split at the beginning, process in parallel, then join results from each part to finalise at the end)? I'd be keen to hear your thoughts on this in particular ... I'm sure this a bit of a Holy Grail and you might be getting asked this a lot by us newbies. |
Great! @Eetion might have more things to say in response to your questions, but here is at least the start of answers. @Chr1sWilliams might have some things to say as well. 2.) I can't speak for anything other than the wasserstein_distance, but to calculate the Wasserstein distance, we create a (size(D_1)+size(D_2))x(size(D_1)+size(D_2)) matrix, so depending on your point of view, a lot? 3.) For the (inf, q)-Wassterstein distance, @Chr1sWilliams and I are aware of a method for optimising the computation, but we have implemented more general Wasserstein distances. As yet, there is no way to parallelise the calculation of the Wasserstein distance as implemented here. If anything else comes up, please let us know. |
Hi Some feedback - the program is still running, and progressing fine, just slowly. I expect (from the current runtime) that it will need a few more days. Will revert will results as soon as they are available, thanks. |
Hello all, and apologies for the long delay. As regards your list, IBMRK,
Thanks! |
I'm running Eirene on Linux on Power PPC, based on Julia 1.41 which I compiled from source (what a mission!).
This is my high-level configuration (happy to share more detail as you may need):
Hardware: Power 9 CPU x 20, 150GB RAM
OS: RHEL Openshift OS (Linux Kernel 3.10)
Software: Docker 18.03
Docker Container: Ubuntu 18.04 (64bit), Julia 1.41 (compiled from source), Eirene
Beyond the above, I need to share a caveat early because I'm not sure if my issue is related to this aspect:
When I compiled Julia v1.41, HDF5 (precompiled Julia library v0.13.x) was 'broken', so I had to downgrade it (based on an obscure note that recommended that v0.11.x works). Not sure if this is relevant to my issue ... I'm actually planning a separate test (yet to do) of my workflow (below) on INTEL (with the standard HDF5, and smaller data set because I have much less RAM available on my laptop).
Now to the issue.
I've written a rudimentary Eirene program (just for testing) in the following way:
BTW, I have tested this workflow for a smaller data sets (25x9, 50x9 instead of 100x9 above) and the programs ran quickly (5-12 minutes) and terminated without errors, providing the Wasserstein distance, which is the desired output result.
The problem is that Step 7 in the dataset (100x9) above, was running for over 2 days and I eventually killed it (suspecting a problem). What I did notice was that only 1 of our CPU's was running "julia" at 100% (overall the machine was 97% idle).
I hope I've at least managed to share at least something high level about the problem.
Appreciate if you could please let me know if you have any suggestions - happy to provide any additional info as you think best.
The text was updated successfully, but these errors were encountered: