Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BackTransformDiagnostics: position offset does not have consistent number of macroparticle #3389

Closed
RemiLehe opened this issue Sep 12, 2022 · 31 comments · Fixed by #3657
Closed
Assignees
Labels
bug: affects latest release Bug also exists in latest release version bug Something isn't working component: boosted frame boosted frame components & logic component: diagnostics all types of outputs

Comments

@RemiLehe
Copy link
Member

When running this input file:
inputs.txt
it seems that the metadata of the BackTransformed diagnostics is inconsistent. For instance, running the following Python code:

from openpmd_viewer import OpenPMDTimeSeries
ts = OpenPMDTimeSeries('./diags/diag_btd/')
ts.get_particle(['x'], iteration=3)

results in the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-4c6f54023500> in <module>
----> 1 ts.get_particle(['x'], iteration=3)

~/miniconda3/lib/python3.9/site-packages/openpmd_viewer/openpmd_timeseries/main.py in get_particle(self, var_list, species, t, iteration, select, plot, nbins, plot_range, use_field_mesh, histogram_deposition, **kw)
    272         data_list = []
    273         for quantity in var_list:
--> 274             data_list.append( self.data_reader.read_species_data(
    275                 iteration, species, quantity, self.extensions))
    276         # Apply selection if needed

~/miniconda3/lib/python3.9/site-packages/openpmd_viewer/openpmd_timeseries/data_reader/data_reader.py in read_species_data(self, iteration, species, record_comp, extensions)
    281                     filename, iteration, species, record_comp, extensions )
    282         elif self.backend == 'openpmd-api':
--> 283             return io_reader.read_species_data(
    284                     self.series, iteration, species, record_comp, extensions )
    285 

~/miniconda3/lib/python3.9/site-packages/openpmd_viewer/openpmd_timeseries/data_reader/io_reader/particle_reader.py in read_species_data(series, iteration, species_name, component_name, extensions)
     87     if component_name in ['x', 'y', 'z']:
     88         offset = get_data(series, species['positionOffset'][component_name])
---> 89         data += offset
     90     # - Return momentum in normalized units
     91     elif component_name in ['ux', 'uy', 'uz' ]:

ValueError: operands could not be broadcast together with shapes (41886,) (41894,) (41886,) 
@ax3l ax3l added bug Something isn't working bug: affects latest release Bug also exists in latest release version component: diagnostics all types of outputs component: boosted frame boosted frame components & logic labels Sep 20, 2022
@RTSandberg RTSandberg reopened this Sep 23, 2022
@RTSandberg
Copy link
Member

I still see this issue running WarpX on Summit, even with the new fix:

Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 17:18:21) [GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from openpmd_viewer.addons import LpaDiagnostics
>>> ts = LpaDiagnostics('diags/diag_lab')
>>> ii = 20
>>> x, z, uz = ts.get_particle(species='beam', iteration=ts.iterations[ii], var_list = ['x', 'z','uz'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/ccs/home/rsandberg/.conda/envs/opmd_movie/lib/python3.10/site-packages/openpmd_viewer/openpmd_timeseries/main.py", line 274, in get_particle
    data_list.append( self.data_reader.read_species_data(
  File "/ccs/home/rsandberg/.conda/envs/opmd_movie/lib/python3.10/site-packages/openpmd_viewer/openpmd_timeseries/data_reader/data_reader.py", line 283, in read_species_data
    return io_reader.read_species_data(
  File "/ccs/home/rsandberg/.conda/envs/opmd_movie/lib/python3.10/site-packages/openpmd_viewer/openpmd_timeseries/data_reader/io_reader/particle_reader.py", line 89, in read_species_data
    data += offset
ValueError: operands could not be broadcast together with shapes (49989,) (50000,) (49989,) 

I rebuilt with the latest branch as of today, 9/23/2022. The input file is attached
inputs.txt

@RemiLehe
Copy link
Member Author

RemiLehe commented Sep 26, 2022

Thanks for reporting this @RTSandberg.

I just tried the above script with a recent version of WarpX (git hash 6febc63b4d58) on 6 4 GPUs and was not able to reproduce the bug. (In other words, the code runs fine and I am able to run the above Python post-processing code without any error.

Could you double check this issue by re-running it with the latest version of WarpX?

Nevermind, I was actually able to reproduce this bug when running on 6 GPUs.

@RemiLehe
Copy link
Member Author

I took at closer look at this issue (by adding lots of Print statements!), and the results are perplexing:

  • It seems that openPMD does call resetDataset with size 50000 (not 49989) when it last touches iteration 20, based on adding the following Print statement
amrex::Print() << "SetupPos: size " << np << std::endl;

here: https://github.com/ECP-WarpX/WarpX/blob/development/Source/Diagnostics/WarpXOpenPMD.cpp#L985
and

amrex::Print() << "resetDataset " << comp << std::endl;

here: https://github.com/ECP-WarpX/WarpX/blob/development/Source/Diagnostics/WarpXOpenPMD.cpp#L992

  • It also seems that we do call storeChunk immediately afterwards.

@ax3l
Copy link
Member

ax3l commented Sep 29, 2022

Let's try amrex::AllPrint() and see if all ranks reach the location of resetDataset and also call it with the same values.

@RemiLehe
Copy link
Member Author

Thanks @ax3l !
I just checked: all MPI ranks are indeed calling resetDataset with the right size (50000). I am still confused as to why the array does not seem to have the right size when reading it.

@RemiLehe
Copy link
Member Author

Note that this error still occurs even when turning load-balancing by commenting this line:

#algo.load_balance_intervals = 200

@RemiLehe
Copy link
Member Author

RemiLehe commented Oct 3, 2022

It seems that when I do:

diag_lab.fields_to_plot = none

then the problem disappears.

However, I am a bit confused: why does turning the fields off result in a fix on the particle metadata?

@RevathiJambunathan
Copy link
Member

Thank you for this hint and test.
I wonder if when we call Redistribute, the bounds set by the fields multifab is removing some particles.
I will test this and report here

@RemiLehe
Copy link
Member Author

RemiLehe commented Oct 3, 2022

@RevathiJambunathan That's a good suggestion, but it does not seem that there is any issue with the Redistribute: as described above, adding Print statements show that we are making the right openPMD calls with the right sizes.
It seems that the problem is at a lower level...

@RTSandberg
Copy link
Member

RTSandberg commented Nov 2, 2022

I have another input deck that sees this issue running on 1 node, 6 GPUs on Summit:
ts = LpaDiagnostics('./diags/diag_btd/')
ts.get_particle(species='beam',var_list=['w','x'],iteration=ts.iterations[1])
Output:
[array([0.12483018, 0.12483018, 0.12483018, ..., 0.12483018, 0.12483018, 0.12483018]), array([ 6.60118937e-07, -3.75790632e-07, 6.18016430e-07, ..., 2.38995954e-07, -2.38995954e-07, -2.38995954e-07])]
ts.get_particle(species='beam',var_list=['w','x'],iteration=ts.iteration[2])
Output:
...

File ~/src/openPMD-viewer/openpmd_viewer/openpmd_timeseries/data_reader/io_reader/particle_reader.py:89, in read_species_data(series, iteration, species_name, component_name, extensions)
     87 if component_name in ['x', 'y', 'z']:
     88     offset = get_data(series, species['positionOffset'][component_name])
---> 89     data += offset
     90 # - Return momentum in normalized units
     91 elif component_name in ['ux', 'uy', 'uz' ]:

ValueError: operands could not be broadcast together with shapes (49992,) (50000,) (49992,) 

inputs.txt

@RemiLehe
Copy link
Member Author

On one of the production simulations, I just noticed that using:

<diag>.openpmd_backend = h5

allows to circumvent the problem.

@RTSandberg Are you able to reproduce this observation with the above input deck?

@ax3l
Copy link
Member

ax3l commented Nov 14, 2022

Thanks Remi!
@RTSandberg, as a work-around for ADIOS2, can you please try if using two diagnostics - one only particles and one only fields - works for ADIOS2 as well?

I'll take this on in the 2nd half of the week with the inputs set reported by Ryan:
#3389 (comment)

Update: could not dig into this during the week - will pick up after my vacations.

@RemiLehe
Copy link
Member Author

RemiLehe commented Nov 14, 2022

@ax3l Yes, it seems that the following work-around did work in my case.
Thanks for the suggestions.

Here is the corresponding input script:
inputs.txt

So in summary, there seems to be 2 potential work-arounds:

  • splitting the diagnostics into a fields-specific and a particle-specific diagnostic
  • switching to HDF5

@RTSandberg
Copy link
Member

The HDF5 workaround works for me. At one time I thought I had a case where the split diagnostic workaround didn't work, but at the present it works on every case I have tried lately

@RTSandberg
Copy link
Member

RTSandberg commented Nov 30, 2022

I haven't observed this issue when using HDF5. However, with ADIOS the split diagnostic workaround is not guaranteed; I still get incomplete particle metadata sometimes.

from openpmd_viewer.addons import LpaDiagnostics
btd = LpaDiagnostics('diags/diag_btd')
btd.get_particle(iteration=0, var_list=['id','x','y','z','ux','uy','uz'])

results in

ValueError: operands could not be broadcast together with shapes (49988,) (49992,) (49988,)

Here are the batch and input script for running on 1 node, 6 GPUs on Summit:
summit_batch.txt
inputs.txt

It is very interesting to note that this issue does not arise if the field BTD is removed, for example by changing the line
diagnostics.diags_names = diag diag_btd fld_btd to diagnostics.diags_names = diag diag_btd.

I observed both behaviors on Perlmutter as well

@RevathiJambunathan
Copy link
Member

that is indeed quite weird!
Thanks for the input file @RTSandberg . I can take a look at this

@ax3l
Copy link
Member

ax3l commented Dec 16, 2022

I am running a even smaller version of the above inputs on my laptop: inputs.txt

cmake -S . -B build -DWarpX_PSATD=ON -DWarpX_DIMS=RZ
cmake --build build -j 3

This finishes after 512 steps and does, e.g. for lab station 1, not write all the meta-data for particles (e.g., charge & mass) properly, even though it does shut down cleanly.

That means WarpXOpenPMDPlot::SetConstParticleRecordsEDPIC is not properly called, which means the control variable is_last_flush_to_step is not properly tracked, which means isLastBTDFlush is not consequently passed to the IO backend when its time.

@RevathiJambunathan
Copy link
Member

ah I see. thanks. I can trace this with the input file u shared and see whats happening with isLastBTDFlush

@RevathiJambunathan
Copy link
Member

RevathiJambunathan commented Dec 16, 2022

@ax3l I used your input file and ran this locally.

At 512th timestep, the snapshots are not full and hence isLastBTDFlush will not be set to 1
This is also indicated by the warning at the beginning of the output, which suggests that the snapshots will be full at 2836th timestep.

**** WARNINGS ******************************************************************
* GLOBAL warning list  after  [ FIRST STEP ]
*
* --> [!  ] [BTD] [raised twice]
*
*     Simulation might not run long enough to fill all BTD snapshots.
*     Final step: 512
*     Last BTD snapshot fills around step: 2836
*     @ Raised by: ALL
*
********************************************************************************

So I ran the simulation upto 2850 timesteps and I am able to visualize the data using openpmd-viewer
image

@ax3l
Copy link
Member

ax3l commented Dec 19, 2022

I think that's an orthogonal issue to fix - I think we should close out BTD steps (finalize particle meta-data & zero-out partly filled fields) on simulation end.

With our restart strategy, we would copy and keep the partly filled lab stations in checkpoints open (un-finalized) in case we want to continue.

Will continue focusing on the above issue for now posted by Ryan.

@ax3l
Copy link
Member

ax3l commented Dec 19, 2022

#3389 (comment)

@RTSandberg I cannot reproduce the issue with the latest openPMD-viewer + openPMD-api and running on CPU. (Works w/o an error.) Will try again on GPU on Summit...

@ax3l
Copy link
Member

ax3l commented Dec 19, 2022

Can reproduce on Summit 👍

bpls -l diag_btd/openpmd_000000.bp
  uint64_t  /data/0/particles/beam/id          {49712} = 0 / 0
  double    /data/0/particles/beam/momentum/x  {49712} = 0 / 0
  double    /data/0/particles/beam/momentum/y  {49712} = 0 / 0
  double    /data/0/particles/beam/momentum/z  {49712} = 0 / 0
  double    /data/0/particles/beam/position/x  {49712} = 0 / 0
  double    /data/0/particles/beam/position/y  {49712} = 0 / 0
  double    /data/0/particles/beam/position/z  {49712} = 0 / 0
  double    /data/0/particles/beam/theta       {49712} = 0 / 0
  double    /data/0/particles/beam/weighting   {49712} = 0 / 0
bpls -A -l diag_btd/openpmd_000000.bp
  string    /basePath                                             attr   = "/data/%T/"
  uint8_t   /data/0/closed                                        attr   = 1
  double    /data/0/dt                                            attr   = 1
  uint32_t  /data/0/particles/beam/charge/macroWeighted           attr   = 0
  uint64_t  /data/0/particles/beam/charge/shape                   attr   = 49992
  float     /data/0/particles/beam/charge/timeOffset              attr   = 0
  double    /data/0/particles/beam/charge/unitDimension           attr   = {0, 0, 1, 1, 0, 0, 0}
  double    /data/0/particles/beam/charge/unitSI                  attr   = 1
  double    /data/0/particles/beam/charge/value                   attr   = -1.60218e-19
  double    /data/0/particles/beam/charge/weightingPower          attr   = 1
  string    /data/0/particles/beam/currentDeposition              attr   = "directMorseNielson"
  uint32_t  /data/0/particles/beam/id/macroWeighted               attr   = 0
  float     /data/0/particles/beam/id/timeOffset                  attr   = 0
  double    /data/0/particles/beam/id/unitDimension               attr   = {0, 0, 0, 0, 0, 0, 0}
  double    /data/0/particles/beam/id/unitSI                      attr   = 1
  double    /data/0/particles/beam/id/weightingPower              attr   = 0
  uint32_t  /data/0/particles/beam/mass/macroWeighted             attr   = 0
  uint64_t  /data/0/particles/beam/mass/shape                     attr   = 49992
  float     /data/0/particles/beam/mass/timeOffset                attr   = 0
  double    /data/0/particles/beam/mass/unitDimension             attr   = {0, 1, 0, 0, 0, 0, 0}
  double    /data/0/particles/beam/mass/unitSI                    attr   = 1
  double    /data/0/particles/beam/mass/value                     attr   = 9.10938e-31
  double    /data/0/particles/beam/mass/weightingPower            attr   = 1
  uint32_t  /data/0/particles/beam/momentum/macroWeighted         attr   = 0
  float     /data/0/particles/beam/momentum/timeOffset            attr   = 0
  double    /data/0/particles/beam/momentum/unitDimension         attr   = {1, 1, -1, 0, 0, 0, 0}
  double    /data/0/particles/beam/momentum/weightingPower        attr   = 1
  double    /data/0/particles/beam/momentum/x/unitSI              attr   = 1
  double    /data/0/particles/beam/momentum/y/unitSI              attr   = 1
  double    /data/0/particles/beam/momentum/z/unitSI              attr   = 1
  string    /data/0/particles/beam/particleInterpolation          attr   = "momentumConserving"
  string    /data/0/particles/beam/particlePush                   attr   = "Vay"
  double    /data/0/particles/beam/particleShape                  attr   = 3
  double    /data/0/particles/beam/particleShapes                 attr   = {3, 3}
  string    /data/0/particles/beam/particleSmoothing              attr   = "none"
  uint32_t  /data/0/particles/beam/position/macroWeighted         attr   = 0
  float     /data/0/particles/beam/position/timeOffset            attr   = 0
  double    /data/0/particles/beam/position/unitDimension         attr   = {1, 0, 0, 0, 0, 0, 0}
  double    /data/0/particles/beam/position/weightingPower        attr   = 0
  double    /data/0/particles/beam/position/x/unitSI              attr   = 1
  double    /data/0/particles/beam/position/y/unitSI              attr   = 1
  double    /data/0/particles/beam/position/z/unitSI              attr   = 1
  uint32_t  /data/0/particles/beam/positionOffset/macroWeighted   attr   = 0
  float     /data/0/particles/beam/positionOffset/timeOffset      attr   = 0
  double    /data/0/particles/beam/positionOffset/unitDimension   attr   = {1, 0, 0, 0, 0, 0, 0}
  double    /data/0/particles/beam/positionOffset/weightingPower  attr   = 0
  uint64_t  /data/0/particles/beam/positionOffset/x/shape         attr   = 49992
  double    /data/0/particles/beam/positionOffset/x/unitSI        attr   = 1
  double    /data/0/particles/beam/positionOffset/x/value         attr   = 0
  uint64_t  /data/0/particles/beam/positionOffset/y/shape         attr   = 49992
  double    /data/0/particles/beam/positionOffset/y/unitSI        attr   = 1
  double    /data/0/particles/beam/positionOffset/y/value         attr   = 0
  uint64_t  /data/0/particles/beam/positionOffset/z/shape         attr   = 49992
  double    /data/0/particles/beam/positionOffset/z/unitSI        attr   = 1
  double    /data/0/particles/beam/positionOffset/z/value         attr   = 0
  uint32_t  /data/0/particles/beam/theta/macroWeighted            attr   = 0
  float     /data/0/particles/beam/theta/timeOffset               attr   = 0
  double    /data/0/particles/beam/theta/unitDimension            attr   = {0, 0, 0, 0, 0, 0, 0}
  double    /data/0/particles/beam/theta/unitSI                   attr   = 1
  double    /data/0/particles/beam/theta/weightingPower           attr   = 0
  uint32_t  /data/0/particles/beam/weighting/macroWeighted        attr   = 1
  float     /data/0/particles/beam/weighting/timeOffset           attr   = 0
  double    /data/0/particles/beam/weighting/unitDimension        attr   = {0, 0, 0, 0, 0, 0, 0}
  double    /data/0/particles/beam/weighting/unitSI               attr   = 1
  double    /data/0/particles/beam/weighting/weightingPower       attr   = 1
  double    /data/0/time                                          attr   = 0
  double    /data/0/timeUnitSI                                    attr   = 1
  string    /date                                                 attr   = "2022-12-19 10:15:36 -0500"
  string    /iterationEncoding                                    attr   = "fileBased"
  string    /iterationFormat                                      attr   = "openpmd_%06T"
  string    /meshesPath                                           attr   = "fields/"
  string    /openPMD                                              attr   = "1.1.0"
  uint32_t  /openPMDextension                                     attr   = 1
  string    /particlesPath                                        attr   = "particles/"
  string    /software                                             attr   = "WarpX"
  string    /softwareVersion                                      attr   = "22.12-7-gc3eb6ea1efef"
  uint64_t  __openPMD_internal/openPMD2_adios2_schema             attr   = 0

@ax3l
Copy link
Member

ax3l commented Dec 19, 2022

Since this shows up in the first lab station, we can simplify to:

...
diagnostics.diags_names = diag_btd fld_btd  # was: diag diag_btd fld_btd
...
diag_btd.num_snapshots_lab = 1  # was: 100
fld_btd.num_snapshots_lab = 1  # was: 100
...

and run faster: inputs.txt

The problem disappears if I remove fld_btd, so there is some side-effect going on here.

Seems important to run it from 6 MPI ranks (and/or with CUDA) to trigger with the inputs.

@RTSandberg
Copy link
Member

I am generally able to use the particle BTD in RZ with ADIOS as the openPMD backend without issue as long as I don't also get field output, i.e. set fields_to_plot = none. I just found an instance where this configuration still leads to inconsistent numbers of macroparticles. It is difficult to reproduce and this is not a great example. I record it here just as evidence that the ADIOS + fields_to_plot = none workaround is not guaranteed to work

inputs.txt

@RTSandberg
Copy link
Member

Met with @RemiLehe and @n01r to discuss the issue further. We realized that it is not the metadata that is necessarily inconsistent, but that the particle arrays themselves are corrupted. For example, if we run
for ii, iteration in enumerate(ts_l.iterations): beam_ws, = ts_l.get_particle(iteration = iteration, var_list=['w'], species='beam') print(f'iteration {iteration}: # particles = {len(beam_ws)}'),
we get the output
iteration 0: # particles = 15239 iteration 1: # particles = 49478 iteration 2: # particles = 50000 iteration 3: # particles = 50000 iteration 4: # particles = 50000 iteration 5: # particles = 50000 iteration 6: # particles = 50000 iteration 7: # particles = 1375

input and batch scripts:
batch.txt
inputs.txt

@ax3l
Copy link
Member

ax3l commented Jan 18, 2023

Yes, that is correct. I forgot to post this here and only mentioned it on Slack:
In my debugging, for one corrupted BTD step I saw, the last resizing storeChunk calls for particle data are called but do not end up on disk.

That means, meta-data that is set in that last append of particles is correct but the variable (data) is too small.

@guj
Copy link
Contributor

guj commented Jan 19, 2023

Since this shows up in the first lab station, we can simplify to:

...
diagnostics.diags_names = diag_btd fld_btd  # was: diag diag_btd fld_btd
...
diag_btd.num_snapshots_lab = 1  # was: 100
fld_btd.num_snapshots_lab = 1  # was: 100
...

and run faster: inputs.txt

The problem disappears if I remove fld_btd, so there is some side-effect going on here.

Seems important to run it from 6 MPI ranks (and/or with CUDA) to trigger with the inputs.

I used this input file, and I saw same num of particles is correct:

../bpls -l diag_btd/openpmd_000000.bp
uint64_t /data/0/particles/beam/id {49992} = 1 / 50000
double /data/0/particles/beam/momentum/x {49992} = -1.17892e-19 / 1.17892e-19
double /data/0/particles/beam/momentum/y {49992} = -1.10237e-19 / 1.10237e-19
double /data/0/particles/beam/momentum/z {49992} = 4.15973e-19 / 6.32712e-19
double /data/0/particles/beam/position/x {49992} = -7.29641e-05 / 7.29641e-05
double /data/0/particles/beam/position/y {49992} = -8.36205e-05 / 8.36205e-05
double /data/0/particles/beam/position/z {49992} = -0.000179486 / -1.4209e-05
double /data/0/particles/beam/theta {49992} = -3.14159 / 3.14159
double /data/0/particles/beam/weighting {49992} = 0.12483 / 0.12483

../bpls -A -l diag_btd/openpmd_000000.bp |grep shape
uint64_t /data/0/particles/beam/charge/shape attr = 49992
uint64_t /data/0/particles/beam/mass/shape attr = 49992
uint64_t /data/0/particles/beam/positionOffset/x/shape attr = 49992
uint64_t /data/0/particles/beam/positionOffset/y/shape attr = 49992
uint64_t /data/0/particles/beam/positionOffset/z/shape attr = 49992

@ax3l
Copy link
Member

ax3l commented Jan 31, 2023

Testing a fix in #3657

@ax3l
Copy link
Member

ax3l commented Feb 1, 2023

Idea for a fast test case of the issue (if we need it again):

  • simulation box filled to 1/3rd with particles: the lower 2/3rd has no particles
  • 2 MPI ranks, distributed along x
  • BTD diagnostics for particles

@pnorbert
Copy link

pnorbert commented Feb 3, 2023

Please note that ADIOS2 master now has support for Joined Arrays (in BP4 only for now), where you don't need to specify Shape on the writing side (nor Starts), and the reader will put all blocks into a Global Array. This is basically Local Arrays where the blocks are joined in one dimension together to form a Global Array, and is a good match for this use case.
ornladios/ADIOS2#3466

@ax3l
Copy link
Member

ax3l commented Feb 4, 2023

Thanks a lot Norbert, I'll check this out next week!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug: affects latest release Bug also exists in latest release version bug Something isn't working component: boosted frame boosted frame components & logic component: diagnostics all types of outputs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants