-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ADIOS crash on taurus #2861
Comments
offline discussion with @psychocoderHPC: Either setting I will test setting no aggregators. @BeyondEspresso How could you allocate 128 aggregators if you only used 28 GPUs? |
Yes, in order to aggregate output from N GPUs (devices) during output over M<N aggregators, the combined write from several devices (depending on the N/M ratio) needs to fit in host-RAM of the nodes that are (by ADIOS) selected for aggregation. (Chapter 6.1.5 in ADIOS manual) If a cluster is designed to have too little host-ram (e.g. not a good multiple of the device RAM) aggregation might not be possible. E.g. on Titan the ratio is 6:32GByte (1:4) which works well for us. An alternative way on such systems is to perform (off-node) staging, but we are not fully there yet (MA thesis for ADIOS2 staging starting soon with @franzpoeschel, when we switch output to openPMD-api). On which queue of Taurus do you run, how much host-RAM and how many GPUs of which kind are used per node and how many GPUs did you use in the run you describe ( That said, for the low number of nodes on Taurus you can probably skip aggregation as it should not bring much benefit besides loading the FS with a little less files, which are not a lot here (see manual above again for detailed reasoning, usually needed for few-thousand devices and more). |
I use Rerunning with no aggregators set lead to the same error. What still confuses me: they only want to allocate |
Just to summarize in device-speech: "256 NVIDIA Tesla K80 GPUs in 64 nodes" means in ZIH's docs actually 2x K80 per node means
|
I want to add to this issue. I try to run a simulation on the taurus My simulations runs fine with hdf5 output, but is terribly slow since the scratch filesystem is mounted via NFS only. Due to compression, I thought I could speed things up at least a little by using ADIOS. Therefore, I changed my # ADIOS params
TBG_adios_agg="0"
TBG_adios_ost="32"
TBG_adios_transport_params="'stripe_count=4;stripe_size=1048576;block_size=1048576'"
TBG_adios_compression="'blosc:threshold=2048,shuffle=bit,lvl=1,threads=10,compressor=zstd'"
TBG_adios_additional_params="--adios.aggregators !TBG_adios_agg \
--adios.ost !TBG_adios_ost \
--adios.transport-params !TBG_adios_transport_params \
--adios.compression !TBG_adios_compression \
--adios.disable-meta 1" But the simulation fails when writing the initial output at timestep 0. The error message is terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
[taurusml6:116624] *** Process received signal ***
[taurusml6:116624] Signal: Aborted (6)
[taurusml6:116624] Signal code: (-6)
[taurusml6:116624] [ 0] [0x2000000504d8]
[taurusml6:116624] [ 1] /usr/lib64/libc.so.6(abort+0x2b4)[0x200000da1f94]
[taurusml6:116624] [ 2] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x1c4)[0x200000b3f7c4]
[taurusml6:116624] [ 3] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(+0xba524)[0x200000b3a524]
[taurusml6:116624] [ 4] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(_ZSt9terminatev+0x20)[0x200000b3a5e0]
[taurusml6:116624] [ 5] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(__cxa_throw+0x80)[0x200000b3aa90]
[taurusml6:116624] [ 6] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(_Znwm+0xa4)[0x200000b3b504]
[taurusml6:116624] [ 7] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(_Znam+0x18)[0x200000b3b5f8]
[taurusml6:116624] [ 8] /scratch/p_electron/2019-02_Bunch-through-foil/runs/005_mini-example-relocated-bunch-no-png-adios/input/bin/picongpu(_ZN8picongpu5adios11ADIOSWriter16CallWriteSpeciesINS_7plugins4misc13SpeciesFilterINS_9ParticlesIN5pmacc11compileTime6StringIJLc98ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0EEEEN5boost3mpl6vectorINS_24placeholder_definition3914particlePusherINS_9particles6pusher12AccelerationENS7_24placeholder_definition2213pmacc_isAliasEEENS_24placeholder_definition385shapeINSG_6shapes3P4SESK_EENS_24placeholder_definition4513interpolationINS_28FieldToParticleInterpolationISP_NS_30AssignedTrilinearInterpolationEEESK_EENS_24placeholder_definition467currentINS_13currentSolver3EmZISP_EESK_EENS_24placeholder_definition5212densityRatioINS_25placeholder_definition14117DensityRatioBunchESK_EENS_24placeholder_definition509massRatioINS_25placeholder_definition13918MassRatioElectronsESK_EENS_24placeholder_definition5111chargeRatioINS_25placeholder_definition14020ChargeRatioElectronsESK_EEN4mpl_2naES1J_S1J_S1J_S1J_S1J_S1J_S1J_S1J_S1J_S1J_S1J_S1J_EENSC_6v_itemINS_24placeholder_definition309weightingENS1L_INS_24placeholder_definition288momentumENS1L_INS_24placeholder_definition258positionINS_24placeholder_definition2712position_picESK_EENSC_7vector0IS1J_EELi0EEELi0EEELi0EEEEENSG_6filter3AllEEEEclINS7_9DataSpaceILj2EEEEEvRKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS2E_EEPNS0_12ThreadParamsET_+0x718)[0x11489a18]
[taurusml6:116624] [ 9] /scratch/p_electron/2019-02_Bunch-through-foil/runs/005_mini-example-relocated-bunch-no-png-adios/input/bin/picongpu(_ZN8picongpu5adios11ADIOSWriter10writeAdiosEPvNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x8d84)[0x11672324]
[taurusml6:116624] [10] /scratch/p_electron/2019-02_Bunch-through-foil/runs/005_mini-example-relocated-bunch-no-png-adios/input/bin/picongpu(_ZN8picongpu5adios11ADIOSWriter8dumpDataEj+0xca8)[0x116c1a98]
[taurusml6:116624] [11] /scratch/p_electron/2019-02_Bunch-through-foil/runs/005_mini-example-relocated-bunch-no-png-adios/input/bin/picongpu(_ZN8picongpu5adios11ADIOSWriter6notifyEj+0x130)[0x116c2720]
[taurusml6:116624] [12] /scratch/p_electron/2019-02_Bunch-through-foil/runs/005_mini-example-relocated-bunch-no-png-adios/input/bin/picongpu(_ZN5pmacc16SimulationHelperILj2EE11dumpOneStepEj+0x174)[0x1136f0e4]
[taurusml6:116624] [13] /scratch/p_electron/2019-02_Bunch-through-foil/runs/005_mini-example-relocated-bunch-no-png-adios/input/bin/picongpu(_ZN5pmacc16SimulationHelperILj2EE15startSimulationEv+0x310)[0x115c8d30]
[taurusml6:116624] [14] /scratch/p_electron/2019-02_Bunch-through-foil/runs/005_mini-example-relocated-bunch-no-png-adios/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE5startEv+0x12c)[0x115c971c]
[taurusml6:116624] [15] /scratch/p_electron/2019-02_Bunch-through-foil/runs/005_mini-example-relocated-bunch-no-png-adios/input/bin/picongpu(main+0x118)[0x111d6ad8]
[taurusml6:116624] [16] /usr/lib64/libc.so.6(+0x25100)[0x200000d85100]
[taurusml6:116624] [17] /usr/lib64/libc.so.6(__libc_start_main+0xc4)[0x200000d852f4]
[taurusml6:116624] *** End of error message ***
srun: error: taurusml6: task 11: Killed Since memory was already discussed, here some information: CPUS FREE_MEM MEMORY GRES
176 350655 254000 gpu:6 Honestly, I do not know where to start. Is this a configuration error, something wrong with ADIOS library, a cluster or a picongpu problem? In order to exclude random errors, I resubmitted the same simulation twice. Interestingly, both simulations crashed without giving any reasonable error message. In both cases the The following modules were not unloaded:
(Use "module --force purge" to unload all):
1) modenv/ml
Module libpng/1.6.34-GCCcore-7.3.0, git/2.18.0-GCCcore-6.4.0, CMake/3.11.4-GCCcore-7.3.0, fosscuda/2018b and 20 dependencies unloaded.
Module fosscuda/2018b and 13 dependencies loaded.
Module CMake/3.11.4-GCCcore-7.3.0 and 1 dependency loaded.
The following have been reloaded with a version change:
1) GCCcore/7.3.0 => GCCcore/6.4.0
2) ncurses/6.1-GCCcore-7.3.0 => ncurses/6.0-GCCcore-6.4.0
3) zlib/1.2.11-GCCcore-7.3.0 => zlib/1.2.11-GCCcore-6.4.0
Module git/2.18.0-GCCcore-6.4.0 and 9 dependencies loaded.
The following have been reloaded with a version change:
1) GCCcore/6.4.0 => GCCcore/7.3.0
2) zlib/1.2.11-GCCcore-6.4.0 => zlib/1.2.11-GCCcore-7.3.0
Module libpng/1.6.34-GCCcore-7.3.0 and 2 dependencies loaded.
srun: error: taurusml4: task 8: Killed
srun: Terminating job step 8271608.1
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 8271608.1 ON taurusml3 CANCELLED AT 2019-04-12T11:49:22 ***
srun: error: taurusml3: task 0: Killed
srun: error: taurusml4: tasks 9-10: Killed
srun: error: taurusml7: tasks 18,20-23: Killed
srun: error: taurusml7: task 19: Killed
srun: error: taurusml4: task 7: Killed
srun: error: taurusml4: task 6: Killed
srun: error: taurusml4: task 11: Killed
srun: error: Timed out waiting for job step to complete
srun: error: taurusml6: task 12: Killed while the PIConGPU: 0.5.0-dev
Build-Type: Release
Third party:
OS: Linux-4.14.0-49.13.1.el7a.ppc64le
arch: ppc64le
CXX: GNU (7.3.0)
CMake: 3.11.4
CUDA: 9.2.148
mallocMC: 2.3.0
Boost: 1.68.0
MPI:
standard: 3.1
flavor: OpenMPI (3.1.1)
PNGwriter: 0.7.0
libSplash: 1.7.0 (Format 4.0)
ADIOS: 1.13.1
PIConGPUVerbose PHYSICS(1) | Sliding Window is ON
PIConGPUVerbose PHYSICS(1) | used Random Number Generator: RNGProvider2XorMin seed: 42
PIConGPUVerbose PHYSICS(1) | Courant c*dt <= 1.00056 ? 1
PIConGPUVerbose PHYSICS(1) | species b: omega_p * dt <= 0.1 ? 0.0999333
PIConGPUVerbose PHYSICS(1) | species e: omega_p * dt <= 0.1 ? 0.0999333
PIConGPUVerbose PHYSICS(1) | species H: omega_p * dt <= 0.1 ? 0.00233215
PIConGPUVerbose PHYSICS(1) | species C: omega_p * dt <= 0.1 ? 0.00403957
PIConGPUVerbose PHYSICS(1) | species N: omega_p * dt <= 0.1 ? 0.00436214
PIConGPUVerbose PHYSICS(1) | species O: omega_p * dt <= 0.1 ? 0.00468138
PIConGPUVerbose PHYSICS(1) | y-cells per wavelength: 704.846
PIConGPUVerbose PHYSICS(1) | macro particles per device: 1562880000
PIConGPUVerbose PHYSICS(1) | typical macro particle weighting: 25.6359
PIConGPUVerbose PHYSICS(1) | UNIT_SPEED 2.99792e+08
PIConGPUVerbose PHYSICS(1) | UNIT_TIME 2.67558e-18
PIConGPUVerbose PHYSICS(1) | UNIT_LENGTH 8.0212e-10
PIConGPUVerbose PHYSICS(1) | UNIT_MASS 2.33527e-29
PIConGPUVerbose PHYSICS(1) | UNIT_CHARGE 4.10732e-18
PIConGPUVerbose PHYSICS(1) | UNIT_EFIELD 6.3706e+14
PIConGPUVerbose PHYSICS(1) | UNIT_BFIELD 2.12501e+06
PIConGPUVerbose PHYSICS(1) | UNIT_ENERGY 2.09884e-12
initialization time: 35sec 978msec = 35 sec Wait a second, now I see that the modules are loaded and unloaded in different versions which do not seem to fit to each other since they were compiled with different GCC versions. I will investigate... |
The Furthermore, I recompiled all libraries which are not provided by the module system, see install script at (https://gist.github.com/steindev/cc02eae81f465833afa27fc8880f3473), recompiled PIConGPU, and ran the simulation again. Unfortunately, it still crashes on write of time step zero, although I have a new runtime error message.
I have tried this several times on different nodes. The error persists. However, I am wondering why this error only occurs when using ADIOS output? I started the same simulation with HDF5 output and the new libraries as well yesterday and it runs fine. |
Just as a stupid guess. Can you check your ADIOS build used correct MPI? I once had some ADIOS + MPI problem on a Juelich machine and had to set values of |
Thanks for the hint. I added a definition of the The situation disimproved. Instead of receiving the The only thing I am left with is the The following is for your reference:
|
Could still be a memory issue. Both the first error
and getting killed by the watchdog daemon could imply you are running out of host-side RAM. Can you try just using half the number of GPUs per node to verify? Also, avoid going wild on |
@ax3l: Great intuition that this could be a memory issue! Yesterday I tried as a first attempt to double the amount of GPUs for the simulation, which allowed the simulation to run for 7.25 hours. Although the simulation crashed then with the known errors:
Now at least, ADIOS tells that it needs more memory. I have a few more questions:
Could you please explain? Second, is there something in the ADIOS configuration (e.g. Aggregators, OSTs, transport params etc. see above) that I can adjust to reduce ADIOS' memory footprint per node? Also, currently the |
@steindev Do you use aggregator with adios? If so please remove the aggregation since the memory on the host is not large enough for this feature. |
😞 |
@psychocoderHPC I use |
Update: Unfortunately, the simulation using half the GPUs per node crasher after about 40min due to a node failure. It was automatically resubmitted But the same happened (at least) two more times. I canceled the simulation since I am not sure whether these node failures are caused by my PIConGPU simulation. I am wondering whether it is possible that somebody else received one of the remaining GPUs on the node and the jobs interfered. I added |
@steindev What is the current status on this issue with the |
Update-2
seems to be correct. A simulation using only half of the GPUs per node, i.e. 3 instead of 6, ran without any error message. Therefore, it seems like the
was triggered by too few memory on the host-side. For the records, in order to run with half the gpus per node, the usage of What actually did the trick was to set Furthermore, the taurus admin updated OpenMPI to version 3.1.4 by now. So it may be possible, that future errors due to a lack of host-side memory will present a different error message. |
By now the simulations runs using 4 GPUs / Node. I experienced no further problems so far during I/O. I think we can close the issue. |
Add a getNode alias for gpu2 partition equipped with k80. Add comment to V100.tpl for taurus ml partition equipped with V100 in order to provide a solution for issues due to too few hostmemory on the taurusml nodes. See ComputationalRadiationPhysics#2861. Add --cpus-per-task to V100_picongpu.profile.example
Add a getNode alias to `k80_picongpu.profile.example` requesting a node from the gpu2-interactive partition equipped with k80. Add comment to `V100.tpl` for taurus ml partition equipped with V100 gpus in order to provide a solution for issues due to too few hostmemory on the taurusml nodes. See ComputationalRadiationPhysics#2861. Add `--cpus-per-task` to `V100_picongpu.profile.example` and do module switch.
Add a getNode alias to `k80_picongpu.profile.example` requesting a node from the gpu2-interactive partition equipped with k80. Add comment to `V100.tpl` for taurus ml partition equipped with V100 gpus in order to provide a solution for issues due to too few hostmemory on the taurusml nodes. See ComputationalRadiationPhysics#2861. Add `--cpus-per-task` to `V100_picongpu.profile.example` and do module switch.
I recently switched from HDF5 to ADIOS output (for data dumps and checkpoints) on the taurus system. The code compiles flawlessly and also starts to run. However 45 minutes into the simulation,
picongpu
crashes with the following error:Due to the error, I switched back to HDF5 and that runs flawlessly.
I build ADIOS myself following the instructions of the
dev
version of readthedocs. I am running however on the0.4.2
version. (Might this already cause the error?) I installed c-blosc and adios as described in the manual.The version output of
picongpu -v
is:ldd
shows no missing libraries.My ADIOS OST setup for the run is:
Might the error be caused by the not specified
adios.transport-params
? For the runs by @BeyondEspresso on the V100, this setup worked.The text was updated successfully, but these errors were encountered: