-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vtk multiprocessing issues #258
Comments
@doutriaux1 I was able to run your test to completion on my machine. It hogged my machine prettly badly for a couple of minutes. I think the main bottleneck is openning so many windows. OSs were not designed to have hundreds of windows open in the same time. This is why OSMesa works much better - it does not open a window.
|
@danlipsa updating my test as follow on ubuntu as follow import vcs
import cdms2
import cdat_info
import sys
if len(sys.argv)>1:
N = int(sys.argv[1])
else:
N = 120
def myplot(index):
f=cdms2.open(cdat_info.get_sampledata_path()+"/clt.nc")
s=f["clt"][index % 120]
gm = vcs.createisofill()
gm2 = vcs.createisoline()
x=vcs.init()
x.plot(s,gm)
x.plot(s,gm2)
x.clear()
x.png("pngs/out_%i.png" % index)
x.clear()
x.close()
del(x)
from dask.distributed import Client
import dask.bag
import dask.multiprocessing
bag = dask.bag.from_sequence(range(N))
with dask.set_options(get=dask.multiprocessing.get):
results = bag.map(myplot).compute()
print "Done" log before hanging:
|
does not hang with the partition bit though. @zshaheen is looking into clear/close del(canvas) to see if it helps |
I still think that there's some memory corruption somewhere, look at @zshaheen original error |
@doutriaux1 Linux, on-screen (nvidia card/driver). Your error seems like a graphics driver issue - nvidia is the way to go on linux. |
nvidia may not be the way to go on the remote servers. But then it would be foolish to open multiple windows on a remote server. Except maybe before going on a long vacation |
I agree with @jypeter beside the original issue is spotted when using |
@doutriaux1 @jypeter I agree with you guys, osmesa has its uses. My point was that if you want hardware acceleration on linux you should choose nvidia as they have the most stable drivers on linux from all card manufacturers. Regarding the diags bug I would try to save the data for all plots and rerun the plot that fails by itself. I have the feeling that this is where the problem is rather than the parallelism. |
@danlipsa I still think it's parallelism, for example if I run the test suite pretty much anywhere (my redhat6 , travisci, circleci) with the full number of cpus I get segmentation fault or some sort of errors, rerunning only the failed test makes them pass. There's some sort of vcs combination that makes it crash. @zshaheen is trying to isolate the perfect storm combination |
@doutriaux1 I see. Are the same tests that randomly fail? How many are there that fail? The printout on the acme_diags issue looks as 'vcs does not like the data' issue rather than a memory access issue. Is the error always manifest itself that way? We could add some debug code at that location to learn more about the data that vcs gets. |
@danlipsa yes that what really worries me, it seems that running it in parallel somehow tweaks the data used by VTK! Hence my strong inclination toward some kind of memory shared by multiple VTK instance accross processes or just bad memory access. I'll run the test suite in full multiple time and see if it's always the same ones that fail, good idea. Will report soon. |
@danlipsa ping. |
@doutriaux1 I think here we need more information to be able to reproduce the problem. I think the way to move forward here is to add some print statements in vcs when running the acme_diags test to see why we get the assert error. |
Can we run the acme_diags tests on our local machine? |
@danlipsa We'll work on creating the scripts and instructions for you to recreate the issue. |
@zshaheen instruction are pasted bellow: Installing the diagnostics softwareThe following must be done on acme1.llnl.gov, since that’s where the data defined in the attached Python file is located.
Running the softwareSince the problem is non-deterministic, we recommend running it ~10 times and viewing the results. A working run should take about 10 minutes and will have one of the last lines be:
When it fails, it’ll usually break within 3 minutes and the last lines will be something like this:
|
multiprocessing.bash
multiprocessing-fix.py reference_data_path = '/p/cscratch/acme/data/obs_for_acme_diags/'
test_data_path = '/p/cscratch/acme/data/test_model_data_for_acme_diags/'
test_name = '20161118.beta0.FC5COSP.ne30_ne30.edison'
sets = ['zonal_mean_xy', 'zonal_mean_2d', 'lat_lon', 'polar', 'cosp_histogram']
backend = 'vcs' # 'mpl' is for the matplotlib plots.
results_dir = 'multiprocess_fix_03_01_2018' # name of folder where all results will be stored
multiprocessing = True
num_workers = 32
debug = True
granulate = [] |
@danlipsa running this with mesa appears fine, but using a large number (10000) I start to get LLVM errors
Runnnig this with non mesa on my ubuntu leads to all windows going to the screen (never deleted). And windows do stay up if you CTRL+C kill the script.
This is related to: E3SM-Project/e3sm_diags#88
The text was updated successfully, but these errors were encountered: