Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

P3 lookup text file is being read by all MPI ranks -- can cause issue with filesystems #6654

Open
ndkeen opened this issue Oct 1, 2024 · 12 comments
Labels
eam EAMxx PRs focused on capabilities for EAMxx help wanted input file inputdata Changes affecting inputdata collection on blues Machine Files pm-cpu Perlmutter at NERSC (CPU-only nodes)

Comments

@ndkeen
Copy link
Contributor

ndkeen commented Oct 1, 2024

For cases that use P3 (note I assumed there were such cases in E3SM, but I'm not currently finding any in the set I've been testing...), we are reading a small text file in a poor parallel method (by letting each MPI rank read the same file).
I was surprised to find we are doing this and surely it was a mistake as this is never a good idea.
While file is small, it still causes issues with the filesystems and NERSC admins are noticing.
It could also cause a slowdown (or even stall/hang).

I have been testing a quick fix to have rank 0 read the file and broadcast data to others, which seems to be BFB, but will need work to properly implement.

NERSC has even suggesting we move our inputdata from CFS (which uses DVS) to scratch (Lustre).
They have said we can have scratch space that is not purged for this purpose.
In general, I've been testing performance of reading from scratch and it seems about the same, but if sole reason for moving is to avoid complications such as this, hopefully we can just fix.

I also made an issue in scream (will link) as same problem exists there, but implementation of fix may be slightly different.

@ndkeen ndkeen added Machine Files inputdata Changes affecting inputdata collection on blues pm-cpu Perlmutter at NERSC (CPU-only nodes) input file labels Oct 1, 2024
@mahf708
Copy link
Contributor

mahf708 commented Oct 1, 2024

xref #5953 --- I hope we can fix that too when we fix this. In general, I find the P3 table business to be quite complicated and caused a bit of issues for me and others.

@mahf708 mahf708 added EAMxx PRs focused on capabilities for EAMxx eam labels Oct 1, 2024
@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 1, 2024

Is this data something that could be placed directly on the heap at init time?

@mahf708
Copy link
Contributor

mahf708 commented Oct 1, 2024

Yes; also my understanding is that the whole table could actually be calculated at runtime ...

@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 1, 2024

Oh that would even be better. The file in question is I think this one. It's 32MB and 31k lines (of text floats).

perlmutter-login37% ls -lh /global/cfs/cdirs/e3sm/inputdata/atm/scream/tables/p3_lookup_table_1.dat-v4.1.1
-rw-rw-r--+ 1 ndk ndk 3.2M Sep 28 11:37 /global/cfs/cdirs/e3sm/inputdata/atm/scream/tables/p3_lookup_table_1.dat-v4.1.1

@erinethomas
Copy link
Contributor

erinethomas commented Oct 3, 2024

A very similar issue is occurring with E3SM+WW3 coupled simulations on perlmutter. It takes over 45 minutes to initialize the wave model with the current settings. This problem with WW3 reading the input file also linearly scales with the number of nodes requested : double the number of nodes = twice the initialization time. This is very problematic. The only solution I have found is moving the file WAVEWATCHIII reads out of the CFS directory.

@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 4, 2024

Erin: For this issue, I'm just wanting to find a solution for a specific place in the code where it's clear we are having all rank read the same file. In general, that's a bad idea, but in practice, might only run into problems with large MPI's. It sounds like what you are seeing is general slowness of reading files from CFS, which could be caused by different things -- though if you do know all ranks reading it serially, you certainly want to fix that. Reading from scratch would still have the same problem. It might be better to create a new issue to describe the problem you are seeing and how to reproduce.

@erinethomas
Copy link
Contributor

hi Noel - Thanks for the feedback - I can make a new issue. Reading the file from scratch does NOT seem to have the same issue. so it seems to be just CFS.

@mahf708
Copy link
Contributor

mahf708 commented Oct 4, 2024

hi Noel - Thanks for the feedback - I can make a new issue. Reading the file from scratch does NOT seem to have the same issue. so it seems to be just CFS.

I too would be very curious to see your example, so I hope you could open an issue and point us to the code and reproducer. I assume you don't see any issue on chrysalis or any other hpc, right?

@mahf708
Copy link
Contributor

mahf708 commented Oct 4, 2024

@erinethomas when you have a chance, could you please test with the file on CFS but change the type to cdf5? Long story short, I had a similar issue (but for me, it was getting completely stuck in reading the file) and then when I move the file from CFS to SCRATCH, it worked. When I changed the file type from classic to cdf5, it also worked.

@erinethomas
Copy link
Contributor

@mahf708 - the file is an ascii text file... I will open a new issue to further discuss the specifics soon.

@sarats
Copy link
Member

sarats commented Oct 8, 2024

NERSC has even suggesting we move our inputdata from CFS (which uses DVS) to scratch (Lustre).
They have said we can have scratch space that is not purged for this purpose.

I think this is good to do as Lustre is better suited for this purpose.

@ndkeen
Copy link
Contributor Author

ndkeen commented Nov 14, 2024

A branch to read P3 lookup table by 1 rank and broadcast to others is here.
ndk/p3/read-txt-table-with-1rank

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
eam EAMxx PRs focused on capabilities for EAMxx help wanted input file inputdata Changes affecting inputdata collection on blues Machine Files pm-cpu Perlmutter at NERSC (CPU-only nodes)
Projects
None yet
Development

No branches or pull requests

4 participants