vtk multiprocessing issues #258

doutriaux1 · 2017-09-14T16:19:26Z

import vcs

import cdms2

import cdat_info

import sys

if len(sys.argv)>1:
    N = int(sys.argv[1])
else:
    N = 120

def myplot(index):
    f=cdms2.open(cdat_info.get_sampledata_path()+"/clt.nc")
    s=f["clt"][index % 120]
    gm  = vcs.createisofill()
    gm2  = vcs.createisoline()
    x=vcs.init()
    x.plot(s,gm)
    x.plot(s,gm2)
    x.clear()
    x.png("pngs/out_%i.png" % index)

from dask.distributed import Client

import dask.bag
import dask.multiprocessing

bag = dask.bag.from_sequence(range(N))
with dask.set_options(get=dask.multiprocessing.get):
     results = bag.map(myplot).compute()

print "Done"

@danlipsa running this with mesa appears fine, but using a large number (10000) I start to get LLVM errors

Runnnig this with non mesa on my ubuntu leads to all windows going to the screen (never deleted). And windows do stay up if you CTRL+C kill the script.

This is related to: E3SM-Project/e3sm_diags#88

danlipsa · 2017-09-14T18:13:15Z

@doutriaux1 I was able to run your test to completion on my machine. It hogged my machine prettly badly for a couple of minutes. I think the main bottleneck is openning so many windows. OSs were not designed to have hundreds of windows open in the same time. This is why OSMesa works much better - it does not open a window.
To adress this:

add a x.close() at the end of your myplot function
Use npartitions to say how many windows you want in the same time. I got significant speedup with
bag = dask.bag.from_sequence(range(N), npartitions=4) but you can play with other numbers.

doutriaux1 · 2017-09-14T18:39:15Z

@danlipsa updating my test as follow on ubuntu as follow
run it with 120 and it doesn't come back. Will try the partition thing next.
Als are you running on mac or linux? mesa or non-mesa?

import vcs

import cdms2

import cdat_info

import sys

if len(sys.argv)>1:
    N = int(sys.argv[1])
else:
    N = 120

def myplot(index):
    f=cdms2.open(cdat_info.get_sampledata_path()+"/clt.nc")
    s=f["clt"][index % 120]
    gm  = vcs.createisofill()
    gm2  = vcs.createisoline()
    x=vcs.init()
    x.plot(s,gm)
    x.plot(s,gm2)
    x.clear()
    x.png("pngs/out_%i.png" % index)
    x.clear()
    x.close()
    del(x)

from dask.distributed import Client

import dask.bag
import dask.multiprocessing

bag = dask.bag.from_sequence(range(N))
with dask.set_options(get=dask.multiprocessing.get):
     results = bag.map(myplot).compute()

print "Done"

log before hanging:

radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: Failed to allocate a buffer:
radeon:    size      : 140574720 bytes
radeon:    alignment : 32768 bytes
radeon:    domains   : 4
radeon:    flags     : 20

doutriaux1 · 2017-09-14T19:03:46Z

does not hang with the partition bit though. @zshaheen is looking into clear/close del(canvas) to see if it helps

doutriaux1 · 2017-09-14T19:04:17Z

I still think that there's some memory corruption somewhere, look at @zshaheen original error

danlipsa · 2017-09-14T19:34:36Z

@doutriaux1 Linux, on-screen (nvidia card/driver). Your error seems like a graphics driver issue - nvidia is the way to go on linux.

jypeter · 2017-09-15T07:28:34Z

nvidia may not be the way to go on the remote servers. But then it would be foolish to open multiple windows on a remote server. Except maybe before going on a long vacation

doutriaux1 · 2017-09-15T14:44:50Z

I agree with @jypeter beside the original issue is spotted when using mesalib so I really think it has something to do with VTK or vcs. Beside running the acme_diags with matplotlib is fine.

danlipsa · 2017-09-15T15:20:03Z

@doutriaux1 @jypeter I agree with you guys, osmesa has its uses. My point was that if you want hardware acceleration on linux you should choose nvidia as they have the most stable drivers on linux from all card manufacturers. Regarding the diags bug I would try to save the data for all plots and rerun the plot that fails by itself. I have the feeling that this is where the problem is rather than the parallelism.

doutriaux1 · 2017-09-15T15:42:41Z

@danlipsa I still think it's parallelism, for example if I run the test suite pretty much anywhere (my redhat6 , travisci, circleci) with the full number of cpus I get segmentation fault or some sort of errors, rerunning only the failed test makes them pass. There's some sort of vcs combination that makes it crash. @zshaheen is trying to isolate the perfect storm combination

danlipsa · 2017-09-15T15:59:05Z

@doutriaux1 I see. Are the same tests that randomly fail? How many are there that fail? The printout on the acme_diags issue looks as 'vcs does not like the data' issue rather than a memory access issue. Is the error always manifest itself that way? We could add some debug code at that location to learn more about the data that vcs gets.

doutriaux1 · 2017-09-15T16:01:49Z

@danlipsa yes that what really worries me, it seems that running it in parallel somehow tweaks the data used by VTK! Hence my strong inclination toward some kind of memory shared by multiple VTK instance accross processes or just bad memory access. I'll run the test suite in full multiple time and see if it's always the same ones that fail, good idea. Will report soon.

doutriaux1 · 2018-02-21T15:52:39Z

@danlipsa ping.

danlipsa · 2018-02-22T13:31:12Z

@doutriaux1 I think here we need more information to be able to reproduce the problem. I think the way to move forward here is to add some print statements in vcs when running the acme_diags test to see why we get the assert error.

danlipsa · 2018-02-22T13:58:49Z

Can we run the acme_diags tests on our local machine?

zshaheen · 2018-02-22T16:50:19Z

@danlipsa We'll work on creating the scripts and instructions for you to recreate the issue.

doutriaux1 · 2018-03-02T00:07:35Z

@zshaheen instruction are pasted bellow:

Installing the diagnostics software

The following must be done on acme1.llnl.gov, since that’s where the data defined in the attached Python file is located.
If you have any issues running the software or environmental issues, let us know.

Make sure you’re using the latest version of Anaconda
conda update conda
Get the *yml file to create the env
wget https://raw.githubusercontent.com/ACMEClimate/acme_diags/master/conda/acme_diags_env_dev.yml
Remove any cached Anaconda package
conda clean --all
Create the env
conda env create -f acme_diags_env_dev.yml
source activate acme_diags_env_dev
Get the latest code from master and install in
git clone https://github.com/ACME-Climate/acme_diags.git
cd acme_diags
git checkout broken_multiprocessing
python setup.py install

Running the software

Since the problem is non-deterministic, we recommend running it ~10 times and viewing the results.
Use the attached bash script to run the diags software, with the attached parameter file, 10 times:
bash -x multiprocess-fix.sh
Then, view the results like so:
tail mp_results*

A working run should take about 10 minutes and will have one of the last lines be:

...
1.08137564507 0.912
0.951984539914 0.852
1.30706979881 0.982
Viewer HTML generated at /export/shaheen2/multiprocess_fix_02_28_2018/viewer/index.html

When it fails, it’ll usually break within 3 minutes and the last lines will be something like this:

...
  File "/export/shaheen2/anaconda2/envs/acme_diags_env_dev_anaconda_channel_first/lib/python2.7/site-packages/vcs/vcs2vtk.py", line 547, in genGrid
    wc, [xm, xM, ym, yM], wrap))
  File "/export/shaheen2/anaconda2/envs/acme_diags_env_dev_anaconda_channel_first/lib/python2.7/site-packages/vcs/vcs2vtk.py", line 2017, in getWrappedBounds
    assert (x1 < x2 and y1 < y2)
AssertionError

doutriaux1 · 2018-03-02T00:08:29Z

multiprocessing.bash

acme_diags_driver.py -p multiprocess-fix.py &> mp_results1.txt
acme_diags_driver.py -p multiprocess-fix.py &> mp_results2.txt
acme_diags_driver.py -p multiprocess-fix.py &> mp_results3.txt
acme_diags_driver.py -p multiprocess-fix.py &> mp_results4.txt
acme_diags_driver.py -p multiprocess-fix.py &> mp_results5.txt
acme_diags_driver.py -p multiprocess-fix.py &> mp_results6.txt
acme_diags_driver.py -p multiprocess-fix.py &> mp_results7.txt
acme_diags_driver.py -p multiprocess-fix.py &> mp_results8.txt
acme_diags_driver.py -p multiprocess-fix.py &> mp_results9.txt
acme_diags_driver.py -p multiprocess-fix.py &> mp_results10.txt

multiprocessing-fix.py

reference_data_path = '/p/cscratch/acme/data/obs_for_acme_diags/'
test_data_path = '/p/cscratch/acme/data/test_model_data_for_acme_diags/'

test_name = '20161118.beta0.FC5COSP.ne30_ne30.edison'

sets = ['zonal_mean_xy', 'zonal_mean_2d', 'lat_lon', 'polar', 'cosp_histogram']


backend = 'vcs'  # 'mpl' is for the matplotlib plots.

results_dir = 'multiprocess_fix_03_01_2018'  # name of folder where all results will be stored

multiprocessing = True
num_workers = 32

debug = True

granulate = []

doutriaux1 added bug high Highest priority issue labels Sep 14, 2017

doutriaux1 added this to the 3.0 milestone Sep 14, 2017

doutriaux1 assigned aashish24, doutriaux1 and danlipsa Sep 14, 2017

zshaheen mentioned this issue Sep 14, 2017

Error running full sets with vcs backend. E3SM-Project/e3sm_diags#88

Closed

doutriaux1 modified the milestones: 3.0, Next Release Mar 29, 2018

doutriaux1 modified the milestones: 8.1, 8.2 Mar 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vtk multiprocessing issues #258

vtk multiprocessing issues #258

doutriaux1 commented Sep 14, 2017 •

edited

Loading

danlipsa commented Sep 14, 2017

doutriaux1 commented Sep 14, 2017

doutriaux1 commented Sep 14, 2017

doutriaux1 commented Sep 14, 2017

danlipsa commented Sep 14, 2017

jypeter commented Sep 15, 2017

doutriaux1 commented Sep 15, 2017

danlipsa commented Sep 15, 2017

doutriaux1 commented Sep 15, 2017

danlipsa commented Sep 15, 2017

doutriaux1 commented Sep 15, 2017

doutriaux1 commented Feb 21, 2018

danlipsa commented Feb 22, 2018

danlipsa commented Feb 22, 2018

zshaheen commented Feb 22, 2018

doutriaux1 commented Mar 2, 2018 •

edited by zshaheen

Loading

doutriaux1 commented Mar 2, 2018

vtk multiprocessing issues #258

vtk multiprocessing issues #258

Comments

doutriaux1 commented Sep 14, 2017 • edited Loading

danlipsa commented Sep 14, 2017

doutriaux1 commented Sep 14, 2017

doutriaux1 commented Sep 14, 2017

doutriaux1 commented Sep 14, 2017

danlipsa commented Sep 14, 2017

jypeter commented Sep 15, 2017

doutriaux1 commented Sep 15, 2017

danlipsa commented Sep 15, 2017

doutriaux1 commented Sep 15, 2017

danlipsa commented Sep 15, 2017

doutriaux1 commented Sep 15, 2017

doutriaux1 commented Feb 21, 2018

danlipsa commented Feb 22, 2018

danlipsa commented Feb 22, 2018

zshaheen commented Feb 22, 2018

doutriaux1 commented Mar 2, 2018 • edited by zshaheen Loading

Installing the diagnostics software

Running the software

doutriaux1 commented Mar 2, 2018

doutriaux1 commented Sep 14, 2017 •

edited

Loading

doutriaux1 commented Mar 2, 2018 •

edited by zshaheen

Loading