Corrupted data using parallel hdf5 #12718

tpadioleau · 2024-07-27T14:21:10Z

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v5.0.3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Using spack 0.22.1

Please describe the system on which you are running

Operating system/version: Ubuntu 20.04
Computer hardware: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz

Details of the problem

I am using parallel hdf5 to write a 2D distributed array. If I pass a cartesian communicator to hdf5, I sometimes notice that the dataset in the hdf5 file is corrupted when using 3 processes. You can find attached (hdf5_reproducer.tar.gz) a small reproducer in C (< 100 LOC) with a hdf5 file I got running the reproducer. You will also find the result of the ompi_info command.

Without understanding the logic behind, I also noticed different situations where I seem to never get corrupted data:

requiring MPI_THREAD_MULTIPLE during MPI initialization,
passing a non-cartesian communicator,
using an other MPI implementation like MPICH.

Thank you,
Thomas

The text was updated successfully, but these errors were encountered:

edgargabriel · 2024-07-27T14:35:18Z

@tpadioleau thank you for the bug report, what file system is this on? I will have a look in the next few days, and a reproducer is definitely super helpful

edgargabriel · 2024-07-27T14:42:48Z

we do have an optimization in the code specifically for cartesian communicators, I am wondering whether something in that logic is slightly off for 3 process, which is a bit of an unusual number for cartesian communicators.

tpadioleau · 2024-07-27T14:54:11Z

@tpadioleau thank you for the bug report, what file system is this on? I will have a look in the next few days, and a reproducer is definitely super helpful

I have just edited the issue to add the missing archive. I am working on my laptop, no parallel filesystem. I can also mention that I was not able to reproduce the error on this supercomputer https://mesocentre.pages.centralesupelec.fr/user_doc/ruche/01_cluster_overview with Open MPI.

we do have an optimization in the code specifically for cartesian communicators, I am wondering whether something in that logic is slightly off for 3 process, which is a bit of an unusual number for cartesian communicators.

I could also try with 4 processes and it also gives corrupted results after a few attempts.

edgargabriel · 2024-08-05T20:19:42Z

I did some preliminary analysis of this issue, and I am not yet sure what to make of it. I ran the testcode with 3 processes on my local workstation using the romio component as a reference, and all relevant collective components of ompio for comparison. The output file is according to h5diff always identical, so either they are all wrong (including romio on Open MPI), or all correct. Please note, that they do not match the sample h5 file that was in the tar file, not sure whether that was supposed to be an example for the correct output, or for an erroneous one.

$mpirun --mca io romio341  -np 3 ./a.out
$mv distributed_array.h5 distributed_array.h5.romio
$mpirun --mca fcoll individual -np 3 ./a.out 
$mv distributed_array.h5 distributed_array.h5.individual
$ h5diff -v distributed_array.h5.individual distributed_array.h5.romio
file1     file2
---------------------------------------
    x      x    /              
    x      x    /distributed_array

group  : </> and </>
0 differences found
dataset: </distributed_array> and </distributed_array>
0 differences found

$mpirun --mca fcoll vulcan -np 3 ./a.out 
$ mv distributed_array.h5 distributed_array.h5.vulcan
$ h5diff -v distributed_array.h5.vulcan distributed_array.h5.romio
file1     file2
---------------------------------------
    x      x    /              
    x      x    /distributed_array

group  : </> and </>
0 differences found
dataset: </distributed_array> and </distributed_array>
0 differences found


$mpirun --mca fcoll dynamic_gen2 -np 3 ./a.out 
$ mv distributed_array.h5 distributed_array.h5.dynamic_gen2
$ h5diff -v distributed_array.h5.dynamic_gen2 distributed_array.h5.romio

file1     file2
---------------------------------------
    x      x    /              
    x      x    /distributed_array

group  : </> and </>
0 differences found
dataset: </distributed_array> and </distributed_array>
0 differences found

tpadioleau · 2024-08-06T08:22:39Z

Thank you for your time. The difficulty is that I also get the correct result from time to time. The sample h5 in the tar is an example of an erroneous result.

If you are interested, I can try to generate a docker image to get closer to my environment ?

tpadioleau · 2024-08-06T09:19:50Z

Here is an archive environment.zip that contains a Dockerfile that was generated (slightly modified to create a toto user) from a spack.yaml environment.

Inside the container and mounting the directory that contains the reproducer in /src, I build with cmake -B build -S /src && cmake --build build. Then repeating multiple times the commands rm -f distributed_array.h5 && mpirun -np 3 build/main && h5dump distributed_array.h5, I see that the content changes.

The expected result with 3 mpi processes should look like this

HDF5 "distributed_array.h5" {
GROUP "/" {
   DATASET "distributed_array" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 3, 5 ) / ( 3, 5 ) }
      DATA {
      (0,0): 1, 1, 1, 1, 1,
      (1,0): 2, 2, 2, 2, 2,
      (2,0): 3, 3, 3, 3, 3
      }
   }
}
}

edgargabriel · 2024-08-06T15:37:36Z

Ok, I can confirm that inside of the docker image I can reproduce the issue with the fcoll/vulcan component. Using the other fcoll components (i.e. individual, dynamic_gen2) produces the correct output. So the question is why is that occurring. I noticed that the precompiled Open MPI library in the docker image is configured quite differently than what I usually do, I am wondering whether one of them is contributing to this (e.g. --disable-builtin-atomics, --without-cma, --enable-mpi1-compatibility). I will look into this later this week.

edgargabriel · 2024-08-06T23:04:02Z

I know what is triggering the issue. I just need to decide whether an if-statement in the code is erroneous or whether I need to add some locking protection around a particular write operation. Both make the test pass reliably, but since the code that includes the if-statement in question was written many years ago, I don't remember all the details (which would be important to decide whether the if-statement is erroneous or not). Either way, it is a legitimate bug, not a fluke or configure option issue.

tpadioleau · 2024-08-07T08:41:50Z

That is good news, thank you!

edgargabriel · 2024-08-07T14:22:08Z

luckily the commit message from 5 years ago was helpful, the if-statement is correct in that it does what it was supposed to do.

tpadioleau · 2024-08-09T07:03:40Z

Do you know if the bug can affect other communication/write operations ?

edgargabriel · 2024-08-09T12:05:11Z

yes, it could, but it depends on the file system how likely it is. I will have a fix ready either later today or tomorrow, and I will backport it to both 5.0.x and 4.1.x series

The fs/ufs component by default disabled all file locking before read/write operations (except for NFS file systems). This was based on the assumption, that the file system itself performs the required locking operation and hence we don't have to add to it. This assumption is incorrect when using data sieving. In data sieving, the code 'ignore' small gaps when we write to a file, and perform instead a read-modify-write sequence ourselves for performance reasons. The problem is however that even within a collective operation not all aggregators might want to use data sieving. Hence, enabling locking just for the data-sieving routines is insufficient, all processes have to perform the locking. Therefore, our two options are: a) either disable write data-sieving by default, or b) enable range-locking by default. After some testing, I think enabling range-locking be default is the safer and better approach. It doesn't seem to show any significant performance impact on my test systems. Note, that on Lustre file systems, we can keep the default to no-locking as far as I can see, since the collective algorithm used by Lustre is unlikely to produce this pattern. I did add in however an mca parameter that allows us to control the locking algorithm used by the Lustre component as well, in case we need to change that for a particular use-case or platform. Fixes Issue open-mpi#12718 Signed-off-by: Edgar Gabriel <[email protected]>

edgargabriel · 2024-08-13T20:37:23Z

@tpadioleau I filed a pr that fixes the issue. I spent quite some time thinking about the issue and the various options, I am 99% sure that real application scenario will not hit this problem. Part of the reason why you saw this error is actually because the data volume is so small that it all ended up in a single file system block, which caused the inconsistency. In a real life application scenario with data volumes are not this tiny, I don't think this issue would have occurred. That being said, we still want to fix it, hence the PR.

The fs/ufs component by default disabled all file locking before read/write operations (except for NFS file systems). This was based on the assumption, that the file system itself performs the required locking operation and hence we don't have to add to it. This assumption is incorrect when using data sieving. In data sieving, the code 'ignore' small gaps when we write to a file, and perform instead a read-modify-write sequence ourselves for performance reasons. The problem is however that even within a collective operation not all aggregators might want to use data sieving. Hence, enabling locking just for the data-sieving routines is insufficient, all processes have to perform the locking. Therefore, our two options are: a) either disable write data-sieving by default, or b) enable range-locking by default. After some testing, I think enabling range-locking be default is the safer and better approach. It doesn't seem to show any significant performance impact on my test systems. Note, that on Lustre file systems, we can keep the default to no-locking as far as I can see, since the collective algorithm used by Lustre is unlikely to produce this pattern. I did add in however an mca parameter that allows us to control the locking algorithm used by the Lustre component as well, in case we need to change that for a particular use-case or platform. Fixes Issue open-mpi#12718 Signed-off-by: Edgar Gabriel <[email protected]> (cherry picked from commit c697f28)

tpadioleau · 2024-08-27T08:07:18Z

@tpadioleau I filed a pr that fixes the issue. I spent quite some time thinking about the issue and the various options, I am 99% sure that real application scenario will not hit this problem. Part of the reason why you saw this error is actually because the data volume is so small that it all ended up in a single file system block, which caused the inconsistency. In a real life application scenario with data volumes are not this tiny, I don't think this issue would have occurred. That being said, we still want to fix it, hence the PR.

You may be right, I only noticed this issue when developing on my laptop hence with small test cases. Thank you again for your time!

edgargabriel self-assigned this Jul 27, 2024

wenduwan added Target: main Target: v5.0.x labels Jul 30, 2024

edgargabriel added the bug label Aug 6, 2024

wenduwan added the Target: v4.1.x label Aug 13, 2024

edgargabriel mentioned this issue Aug 13, 2024

fs/ufs: change default locking protocol #12756

Merged

This was referenced Aug 14, 2024

fs/ufs: change default locking protocol -v5.0.x #12759

Merged

fs/ufs: change default locking protocol - v4.1.x #12760

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrupted data using parallel hdf5 #12718

Corrupted data using parallel hdf5 #12718

tpadioleau commented Jul 27, 2024 •

edited

Loading

edgargabriel commented Jul 27, 2024

edgargabriel commented Jul 27, 2024

tpadioleau commented Jul 27, 2024 •

edited

Loading

edgargabriel commented Aug 5, 2024 •

edited

Loading

tpadioleau commented Aug 6, 2024 •

edited

Loading

tpadioleau commented Aug 6, 2024

edgargabriel commented Aug 6, 2024 •

edited

Loading

edgargabriel commented Aug 6, 2024

tpadioleau commented Aug 7, 2024

edgargabriel commented Aug 7, 2024

tpadioleau commented Aug 9, 2024

edgargabriel commented Aug 9, 2024 •

edited

Loading

edgargabriel commented Aug 13, 2024

tpadioleau commented Aug 27, 2024

Corrupted data using parallel hdf5 #12718

Corrupted data using parallel hdf5 #12718

Comments

tpadioleau commented Jul 27, 2024 • edited Loading

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

edgargabriel commented Jul 27, 2024

edgargabriel commented Jul 27, 2024

tpadioleau commented Jul 27, 2024 • edited Loading

edgargabriel commented Aug 5, 2024 • edited Loading

tpadioleau commented Aug 6, 2024 • edited Loading

tpadioleau commented Aug 6, 2024

edgargabriel commented Aug 6, 2024 • edited Loading

edgargabriel commented Aug 6, 2024

tpadioleau commented Aug 7, 2024

edgargabriel commented Aug 7, 2024

tpadioleau commented Aug 9, 2024

edgargabriel commented Aug 9, 2024 • edited Loading

edgargabriel commented Aug 13, 2024

tpadioleau commented Aug 27, 2024

tpadioleau commented Jul 27, 2024 •

edited

Loading

tpadioleau commented Jul 27, 2024 •

edited

Loading

edgargabriel commented Aug 5, 2024 •

edited

Loading

tpadioleau commented Aug 6, 2024 •

edited

Loading

edgargabriel commented Aug 6, 2024 •

edited

Loading

edgargabriel commented Aug 9, 2024 •

edited

Loading