Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vtk multiprocessing issues #258

Open
doutriaux1 opened this issue Sep 14, 2017 · 17 comments
Open

vtk multiprocessing issues #258

doutriaux1 opened this issue Sep 14, 2017 · 17 comments
Assignees
Labels
bug high Highest priority issue
Milestone

Comments

@doutriaux1
Copy link
Contributor

doutriaux1 commented Sep 14, 2017

import vcs

import cdms2

import cdat_info

import sys

if len(sys.argv)>1:
    N = int(sys.argv[1])
else:
    N = 120

def myplot(index):
    f=cdms2.open(cdat_info.get_sampledata_path()+"/clt.nc")
    s=f["clt"][index % 120]
    gm  = vcs.createisofill()
    gm2  = vcs.createisoline()
    x=vcs.init()
    x.plot(s,gm)
    x.plot(s,gm2)
    x.clear()
    x.png("pngs/out_%i.png" % index)

from dask.distributed import Client

import dask.bag
import dask.multiprocessing

bag = dask.bag.from_sequence(range(N))
with dask.set_options(get=dask.multiprocessing.get):
     results = bag.map(myplot).compute()

print "Done"

@danlipsa running this with mesa appears fine, but using a large number (10000) I start to get LLVM errors

Runnnig this with non mesa on my ubuntu leads to all windows going to the screen (never deleted). And windows do stay up if you CTRL+C kill the script.

This is related to: E3SM-Project/e3sm_diags#88

@danlipsa
Copy link
Contributor

@doutriaux1 I was able to run your test to completion on my machine. It hogged my machine prettly badly for a couple of minutes. I think the main bottleneck is openning so many windows. OSs were not designed to have hundreds of windows open in the same time. This is why OSMesa works much better - it does not open a window.
To adress this:

  1. add a x.close() at the end of your myplot function
  2. Use npartitions to say how many windows you want in the same time. I got significant speedup with
    bag = dask.bag.from_sequence(range(N), npartitions=4) but you can play with other numbers.

@doutriaux1
Copy link
Contributor Author

@danlipsa updating my test as follow on ubuntu as follow
run it with 120 and it doesn't come back. Will try the partition thing next.
Als are you running on mac or linux? mesa or non-mesa?

import vcs

import cdms2

import cdat_info

import sys

if len(sys.argv)>1:
    N = int(sys.argv[1])
else:
    N = 120

def myplot(index):
    f=cdms2.open(cdat_info.get_sampledata_path()+"/clt.nc")
    s=f["clt"][index % 120]
    gm  = vcs.createisofill()
    gm2  = vcs.createisoline()
    x=vcs.init()
    x.plot(s,gm)
    x.plot(s,gm2)
    x.clear()
    x.png("pngs/out_%i.png" % index)
    x.clear()
    x.close()
    del(x)

from dask.distributed import Client

import dask.bag
import dask.multiprocessing

bag = dask.bag.from_sequence(range(N))
with dask.set_options(get=dask.multiprocessing.get):
     results = bag.map(myplot).compute()

print "Done"

log before hanging:

radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: The kernel rejected CS, see dmesg for more information (-22).
radeon: Failed to allocate a buffer:
radeon:    size      : 140574720 bytes
radeon:    alignment : 32768 bytes
radeon:    domains   : 4
radeon:    flags     : 20

@doutriaux1
Copy link
Contributor Author

does not hang with the partition bit though. @zshaheen is looking into clear/close del(canvas) to see if it helps

@doutriaux1
Copy link
Contributor Author

I still think that there's some memory corruption somewhere, look at @zshaheen original error

@danlipsa
Copy link
Contributor

@doutriaux1 Linux, on-screen (nvidia card/driver). Your error seems like a graphics driver issue - nvidia is the way to go on linux.

@jypeter
Copy link
Member

jypeter commented Sep 15, 2017

nvidia may not be the way to go on the remote servers. But then it would be foolish to open multiple windows on a remote server. Except maybe before going on a long vacation

@doutriaux1
Copy link
Contributor Author

I agree with @jypeter beside the original issue is spotted when using mesalib so I really think it has something to do with VTK or vcs. Beside running the acme_diags with matplotlib is fine.

@danlipsa
Copy link
Contributor

@doutriaux1 @jypeter I agree with you guys, osmesa has its uses. My point was that if you want hardware acceleration on linux you should choose nvidia as they have the most stable drivers on linux from all card manufacturers. Regarding the diags bug I would try to save the data for all plots and rerun the plot that fails by itself. I have the feeling that this is where the problem is rather than the parallelism.

@doutriaux1
Copy link
Contributor Author

@danlipsa I still think it's parallelism, for example if I run the test suite pretty much anywhere (my redhat6 , travisci, circleci) with the full number of cpus I get segmentation fault or some sort of errors, rerunning only the failed test makes them pass. There's some sort of vcs combination that makes it crash. @zshaheen is trying to isolate the perfect storm combination

@danlipsa
Copy link
Contributor

@doutriaux1 I see. Are the same tests that randomly fail? How many are there that fail? The printout on the acme_diags issue looks as 'vcs does not like the data' issue rather than a memory access issue. Is the error always manifest itself that way? We could add some debug code at that location to learn more about the data that vcs gets.

@doutriaux1
Copy link
Contributor Author

@danlipsa yes that what really worries me, it seems that running it in parallel somehow tweaks the data used by VTK! Hence my strong inclination toward some kind of memory shared by multiple VTK instance accross processes or just bad memory access. I'll run the test suite in full multiple time and see if it's always the same ones that fail, good idea. Will report soon.

@doutriaux1
Copy link
Contributor Author

@danlipsa ping.

@danlipsa
Copy link
Contributor

@doutriaux1 I think here we need more information to be able to reproduce the problem. I think the way to move forward here is to add some print statements in vcs when running the acme_diags test to see why we get the assert error.

@danlipsa
Copy link
Contributor

Can we run the acme_diags tests on our local machine?

@zshaheen
Copy link

@danlipsa We'll work on creating the scripts and instructions for you to recreate the issue.

@doutriaux1
Copy link
Contributor Author

doutriaux1 commented Mar 2, 2018

@zshaheen instruction are pasted bellow:

Installing the diagnostics software

The following must be done on acme1.llnl.gov, since that’s where the data defined in the attached Python file is located.
If you have any issues running the software or environmental issues, let us know.

  1. Make sure you’re using the latest version of Anaconda
    conda update conda

  2. Get the *yml file to create the env
    wget https://raw.githubusercontent.com/ACMEClimate/acme_diags/master/conda/acme_diags_env_dev.yml

  3. Remove any cached Anaconda package
    conda clean --all

  4. Create the env
    conda env create -f acme_diags_env_dev.yml
    source activate acme_diags_env_dev

  5. Get the latest code from master and install in
    git clone https://github.com/ACME-Climate/acme_diags.git
    cd acme_diags
    git checkout broken_multiprocessing
    python setup.py install

Running the software

Since the problem is non-deterministic, we recommend running it ~10 times and viewing the results.
Use the attached bash script to run the diags software, with the attached parameter file, 10 times:
bash -x multiprocess-fix.sh
Then, view the results like so:
tail mp_results*

A working run should take about 10 minutes and will have one of the last lines be:

...
1.08137564507 0.912
0.951984539914 0.852
1.30706979881 0.982
Viewer HTML generated at /export/shaheen2/multiprocess_fix_02_28_2018/viewer/index.html

When it fails, it’ll usually break within 3 minutes and the last lines will be something like this:

...
  File "/export/shaheen2/anaconda2/envs/acme_diags_env_dev_anaconda_channel_first/lib/python2.7/site-packages/vcs/vcs2vtk.py", line 547, in genGrid
    wc, [xm, xM, ym, yM], wrap))
  File "/export/shaheen2/anaconda2/envs/acme_diags_env_dev_anaconda_channel_first/lib/python2.7/site-packages/vcs/vcs2vtk.py", line 2017, in getWrappedBounds
    assert (x1 < x2 and y1 < y2)
AssertionError

@doutriaux1
Copy link
Contributor Author

multiprocessing.bash

acme_diags_driver.py -p multiprocess-fix.py &> mp_results1.txt
acme_diags_driver.py -p multiprocess-fix.py &> mp_results2.txt
acme_diags_driver.py -p multiprocess-fix.py &> mp_results3.txt
acme_diags_driver.py -p multiprocess-fix.py &> mp_results4.txt
acme_diags_driver.py -p multiprocess-fix.py &> mp_results5.txt
acme_diags_driver.py -p multiprocess-fix.py &> mp_results6.txt
acme_diags_driver.py -p multiprocess-fix.py &> mp_results7.txt
acme_diags_driver.py -p multiprocess-fix.py &> mp_results8.txt
acme_diags_driver.py -p multiprocess-fix.py &> mp_results9.txt
acme_diags_driver.py -p multiprocess-fix.py &> mp_results10.txt

multiprocessing-fix.py

reference_data_path = '/p/cscratch/acme/data/obs_for_acme_diags/'
test_data_path = '/p/cscratch/acme/data/test_model_data_for_acme_diags/'

test_name = '20161118.beta0.FC5COSP.ne30_ne30.edison'

sets = ['zonal_mean_xy', 'zonal_mean_2d', 'lat_lon', 'polar', 'cosp_histogram']


backend = 'vcs'  # 'mpl' is for the matplotlib plots.

results_dir = 'multiprocess_fix_03_01_2018'  # name of folder where all results will be stored

multiprocessing = True
num_workers = 32

debug = True

granulate = []

@doutriaux1 doutriaux1 modified the milestones: 3.0, Next Release Mar 29, 2018
@doutriaux1 doutriaux1 modified the milestones: 8.1, 8.2 Mar 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug high Highest priority issue
Projects
None yet
Development

No branches or pull requests

5 participants