Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data partitioning mechanism used in E3SM #6701

Open
wkliao opened this issue Oct 18, 2024 · 15 comments
Open

Data partitioning mechanism used in E3SM #6701

wkliao opened this issue Oct 18, 2024 · 15 comments
Assignees

Comments

@wkliao
Copy link
Member

wkliao commented Oct 18, 2024

I am a developer of PnetCDF and have created E3SM-IO benchmark to study the
I/O performance of E3SM on DOE parallel computers. This benchmark has
also been used to help Scorpio to improve its design.

However, E3SM-IO currently includes only 3 cases, namely, F, G, and I cases.
All requires an input file containing the pre-generated data partitioning patterns,
i.e. file offset-length pairs per MPI process (such offset-length pairs are also
referred to as 'decomposition maps' in Scorpio). Using the pre-generated
map files prevents from testing other E3SM problem domain sizes.

I wonder if I can obtain some assistance from the E3SM team to help me understand
the data partitioning mechanism used in E3SM codes. My goal is to add a function in
E3SM-IO that can generate the decomposition maps at the run time, given any
problem domain sizes.

@philipwjones
Copy link
Contributor

@wkliao Not sure you really want to take this on, but ... I have code to do online partitioning of the ocean meshes (and also code to compute the offsets for parallel IO) that we wrote for the new Omega model that matches the MPAS-Ocean decompositions. It requires linking/calling Metis routines. The cells are partitioned first and the edge/vertex locations are partitioned based on the cell partitioning. The MPAS-seaice partitioning is more complicated because they do some further balancing after Metis partitioning and I can't help with that. You'd end up with a good size chunk of code to maintain.

@rljacob
Copy link
Member

rljacob commented Oct 21, 2024

Are the "pre-generated data partitioning patterns" your refer to files like "mpas-o.graph.info.230422.part" ? Because an I-case should not require any of those.

@wkliao
Copy link
Member Author

wkliao commented Oct 21, 2024

Hi, @rljacob
The "data partitioning patterns" I was referring to are I/O patterns, which
describe the file offsets each MPI process writes to. PIO uses the term
decomposition.
A decomposition contains P lists of array indices, where P is the number
of MPI processes. Given the array sizes, those indices can be transformed
into file offsets.

My understanding of the I case is that there are 5 such I/O decompositions and
a total of 560 variables. 546 of them are partitioned among all MPI processes,
each using one of the 5 decompositions. Those decompositions were provided
to me by the Scorpio developer team as inputs to my E3SM-IO benchmark.
I was told those decompositions were created as a byproduct from an E3SM
production run.

Hi, @philipwjones
If I understand correctly, the domain decomposition in MPAS-Ocean takes an
input in a graph form and partitions it among MPI processes. I assume the
partitioned subgraphs are then embedded into a 2D Euclidean space. The file
offsets are generated from this transformed Euclidean space, because all
variables stored in the output netcdf files are multi-dimensional arrays.

I am hoping to be able to generate such file offset lists, through a library
function call (on-line approach). I am also fine with a utility program that can
generated file offsets into a file off-line.

I do not intend to obtain a library or utility program that can generate the exact
decompositions used by all E3SM models. Instead, I am looking for a mechanism
that can generate a similar degree of file offset non-contiguity, given a problem
size and the number of MPI processes. Depending on E3SM production runs
to create the decompositions is not feasible, as the sizes can too big for
large-scale runs.

I learned that in E3SM the file offsets written by each process are highly
non-contiguous, which is very challenging for I/O to achieve a good
performance. My goal is to investigate the performance issues in the I/O
software stack (PnetCDF and MPI-IO) and develop better I/O solutions for
E3SM and its related software.

@dqwu
Copy link
Contributor

dqwu commented Oct 21, 2024

@wkliao
It seems that the E3SM code uses a method called 'Cubed Sphere' to generate decomposition maps for F cases. So, you would like to extract this algorithm from E3SM and directly use it in your E3SM-IO program to generate the same decomposition maps internally. With this approach, E3SM-IO would no longer need to read large decomposition map files. Is my understanding correct?

@jayeshkrishna
Copy link
Contributor

jayeshkrishna commented Oct 21, 2024

@wkliao : Different model components in E3SM use different grids (which result in different I/O decompositions and I/O patterns) for its computation. Maybe we can start working on adding support for the ATM/EAM grid (Cubed sphere grid) in the E3SM I/O benchmark code first and then proceed with other components (It might be tricky with some components that use other software to efficiently partition the grids across processes). Also note that within the same component the model variables can be written out with different decompositions (So its not enough to just capture the grid but also capture any load balancing mechanism used).

How soon do you need these changes in the benchmark? We can work on enhancing the benchmark to support/simulate different grids. But like @philipwjones noted above, please keep in mind that this would increase the amount/complexity of the code in the I/O benchmark.

@jayeshkrishna
Copy link
Contributor

Another thing to keep in mind is that from the E3SM perspective we are mostly interested in I/O performance for a limited set of configuration settings (grid resolution/configuration, number of MPI processes, model output configuration) which can right now be analyzed by reading out the I/O decomposition files dumped out by SCORPIO (that is already supported by the E3SM I/O benchmark tool).

@wkliao
Copy link
Member Author

wkliao commented Oct 21, 2024

@jayeshkrishna
Starting with ATM/EAM grids sounds fine with me.

@dqwu and I have been investigating an MPI-IO hanging problem we
encountered on Aurora, Frontier, and Perlmutter, when running the F case
on 21600 processes. I think the problem can also happen to other scales
of runs and problem sizes. But, without a map generator, we could not test
those cases.

@jayeshkrishna
Copy link
Contributor

ok. For now to debug the hang issue @dqwu can you generate the I/O decomposition maps with a lower resolution run (ne256?) and use it with the I/O benchmark code?

@dqwu
Copy link
Contributor

dqwu commented Oct 21, 2024

@wkliao Since your alltomany.c test can indeed reproduce the hanging issue on Aurora (with len set to 1200), I think maybe only the F case dataset (with a large number of small non-contiguous sub-array requests) of E3SM-IO can reproduce the hanging issue.

@dqwu
Copy link
Contributor

dqwu commented Oct 21, 2024

@wkliao Also, since your column_wise.c can also reproduce the hanging issue with a large len arg, I think you can use it as a special map generator as well.

@dqwu
Copy link
Contributor

dqwu commented Oct 21, 2024

ok. For now to debug the hang issue @dqwu can you generate the I/O decomposition maps with a lower resolution run (ne256?) and use it with the I/O benchmark code?

@jayeshkrishna To debug the hang issue, we already have some pure MPI programs (with or without I/O) as simple reproducers.

@wkliao
Copy link
Member Author

wkliao commented Oct 21, 2024

My guess is the hanging may happen to other problem sizes or run scales.
Further testings are needed to know, which makes the generator more critical.

Based on my understanding of ROMIO, I developed 'alltomany.c' and
'column_wise.c' to create a communication close to the one resulted
from running the F case. But they are not exactly the same one. I hope
the fix (by libfabric folks) for the two test programs can also fix the
real I/O of E3SM.

@dqwu
Copy link
Contributor

dqwu commented Oct 21, 2024

My guess is the hanging may happen to other problem sizes or run scales. Further testings are needed to know, which makes the generator more critical.

Based on my understanding of ROMIO, I developed 'alltomany.c' and 'column_wise.c' to create a communication close to the one resulted from running the F case. But they are not exactly the same one. I hope the fix (by libfabric folks) for the two test programs can also fix the real I/O of E3SM.

@wkliao Also note that by using BOX rearranger by default (PnetCDF lib receives contiguous sub-array requests from SCORPIO), real I/O of E3SM might not be able to reproduce this hanging issue.

@ndkeen
Copy link
Contributor

ndkeen commented Oct 22, 2024

Is there an issue for the hang? This would at least be easy experiment to see if it makes a difference:
#6687

@dqwu
Copy link
Contributor

dqwu commented Oct 22, 2024

Is there an issue for the hang? This would at least be easy experiment to see if it makes a difference: #6687

@ndkeen The hang is reproducible on Aurora but not on Perlmutter, and I have created a tickect (RITM0381937) for ALCF.

Steps to reproduce the hang on Aurora:

[Download the reproduction code]
wget https://raw.githubusercontent.com/Parallel-NetCDF/E3SM-IO/refs/heads/one_map/tests/alltomany.c

[Compile the program]
mpicc alltomany.c

[Create a job script]
Create a PBS job script (e.g., test_alltomany_pbs.sh) as follows:

#!/bin/bash -e

#PBS -A XXXXXX
#PBS -N test_alltomany_21600p
#PBS -l select=208
#PBS -l walltime=0:05:00
#PBS -k doe
#PBS -l place=scatter
#PBS -q EarlyAppAccess

export FI_MR_CACHE_MONITOR=disabled

export FI_UNIVERSE_SIZE=21600

export FI_CXI_RX_MATCH_MODE=software
export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_CXI_OFLOW_BUF_SIZE=8388608
export FI_CXI_CQ_FILL_PERCENT=20

cd $PBS_O_WORKDIR

date

mpiexec -np 21600 -ppn 104 ./a.out -v -s -n 1 -l 1200 -r 104 -m 128

date

[Submit the job]
qsub test_alltomany_pbs.sh

The hang is from MPI_Waitall call. Changing FI_MR_CACHE_MONITOR from disabled to kdreg2 does not work, either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants