Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupted data using parallel hdf5 #12718

Open
tpadioleau opened this issue Jul 27, 2024 · 14 comments
Open

Corrupted data using parallel hdf5 #12718

tpadioleau opened this issue Jul 27, 2024 · 14 comments

Comments

@tpadioleau
Copy link

tpadioleau commented Jul 27, 2024

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v5.0.3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Using spack 0.22.1

Please describe the system on which you are running

  • Operating system/version: Ubuntu 20.04
  • Computer hardware: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz

Details of the problem

I am using parallel hdf5 to write a 2D distributed array. If I pass a cartesian communicator to hdf5, I sometimes notice that the dataset in the hdf5 file is corrupted when using 3 processes. You can find attached (hdf5_reproducer.tar.gz) a small reproducer in C (< 100 LOC) with a hdf5 file I got running the reproducer. You will also find the result of the ompi_info command.

Without understanding the logic behind, I also noticed different situations where I seem to never get corrupted data:

  • requiring MPI_THREAD_MULTIPLE during MPI initialization,
  • passing a non-cartesian communicator,
  • using an other MPI implementation like MPICH.

Thank you,
Thomas

@edgargabriel
Copy link
Member

@tpadioleau thank you for the bug report, what file system is this on? I will have a look in the next few days, and a reproducer is definitely super helpful

@edgargabriel edgargabriel self-assigned this Jul 27, 2024
@edgargabriel
Copy link
Member

we do have an optimization in the code specifically for cartesian communicators, I am wondering whether something in that logic is slightly off for 3 process, which is a bit of an unusual number for cartesian communicators.

@tpadioleau
Copy link
Author

tpadioleau commented Jul 27, 2024

@tpadioleau thank you for the bug report, what file system is this on? I will have a look in the next few days, and a reproducer is definitely super helpful

I have just edited the issue to add the missing archive. I am working on my laptop, no parallel filesystem. I can also mention that I was not able to reproduce the error on this supercomputer https://mesocentre.pages.centralesupelec.fr/user_doc/ruche/01_cluster_overview with Open MPI.

we do have an optimization in the code specifically for cartesian communicators, I am wondering whether something in that logic is slightly off for 3 process, which is a bit of an unusual number for cartesian communicators.

I could also try with 4 processes and it also gives corrupted results after a few attempts.

@edgargabriel
Copy link
Member

edgargabriel commented Aug 5, 2024

I did some preliminary analysis of this issue, and I am not yet sure what to make of it. I ran the testcode with 3 processes on my local workstation using the romio component as a reference, and all relevant collective components of ompio for comparison. The output file is according to h5diff always identical, so either they are all wrong (including romio on Open MPI), or all correct. Please note, that they do not match the sample h5 file that was in the tar file, not sure whether that was supposed to be an example for the correct output, or for an erroneous one.

$mpirun --mca io romio341  -np 3 ./a.out
$mv distributed_array.h5 distributed_array.h5.romio
$mpirun --mca fcoll individual -np 3 ./a.out 
$mv distributed_array.h5 distributed_array.h5.individual
$ h5diff -v distributed_array.h5.individual distributed_array.h5.romio
file1     file2
---------------------------------------
    x      x    /              
    x      x    /distributed_array

group  : </> and </>
0 differences found
dataset: </distributed_array> and </distributed_array>
0 differences found

$mpirun --mca fcoll vulcan -np 3 ./a.out 
$ mv distributed_array.h5 distributed_array.h5.vulcan
$ h5diff -v distributed_array.h5.vulcan distributed_array.h5.romio
file1     file2
---------------------------------------
    x      x    /              
    x      x    /distributed_array

group  : </> and </>
0 differences found
dataset: </distributed_array> and </distributed_array>
0 differences found


$mpirun --mca fcoll dynamic_gen2 -np 3 ./a.out 
$ mv distributed_array.h5 distributed_array.h5.dynamic_gen2
$ h5diff -v distributed_array.h5.dynamic_gen2 distributed_array.h5.romio

file1     file2
---------------------------------------
    x      x    /              
    x      x    /distributed_array

group  : </> and </>
0 differences found
dataset: </distributed_array> and </distributed_array>
0 differences found

@tpadioleau
Copy link
Author

tpadioleau commented Aug 6, 2024

Thank you for your time. The difficulty is that I also get the correct result from time to time. The sample h5 in the tar is an example of an erroneous result.

If you are interested, I can try to generate a docker image to get closer to my environment ?

@tpadioleau
Copy link
Author

Here is an archive environment.zip that contains a Dockerfile that was generated (slightly modified to create a toto user) from a spack.yaml environment.

Inside the container and mounting the directory that contains the reproducer in /src, I build with cmake -B build -S /src && cmake --build build. Then repeating multiple times the commands rm -f distributed_array.h5 && mpirun -np 3 build/main && h5dump distributed_array.h5, I see that the content changes.

The expected result with 3 mpi processes should look like this

HDF5 "distributed_array.h5" {
GROUP "/" {
   DATASET "distributed_array" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 3, 5 ) / ( 3, 5 ) }
      DATA {
      (0,0): 1, 1, 1, 1, 1,
      (1,0): 2, 2, 2, 2, 2,
      (2,0): 3, 3, 3, 3, 3
      }
   }
}
}

@edgargabriel
Copy link
Member

edgargabriel commented Aug 6, 2024

Ok, I can confirm that inside of the docker image I can reproduce the issue with the fcoll/vulcan component. Using the other fcoll components (i.e. individual, dynamic_gen2) produces the correct output. So the question is why is that occurring. I noticed that the precompiled Open MPI library in the docker image is configured quite differently than what I usually do, I am wondering whether one of them is contributing to this (e.g. --disable-builtin-atomics, --without-cma, --enable-mpi1-compatibility). I will look into this later this week.

@edgargabriel
Copy link
Member

I know what is triggering the issue. I just need to decide whether an if-statement in the code is erroneous or whether I need to add some locking protection around a particular write operation. Both make the test pass reliably, but since the code that includes the if-statement in question was written many years ago, I don't remember all the details (which would be important to decide whether the if-statement is erroneous or not). Either way, it is a legitimate bug, not a fluke or configure option issue.

@tpadioleau
Copy link
Author

That is good news, thank you!

@edgargabriel
Copy link
Member

luckily the commit message from 5 years ago was helpful, the if-statement is correct in that it does what it was supposed to do.

@tpadioleau
Copy link
Author

Do you know if the bug can affect other communication/write operations ?

@edgargabriel
Copy link
Member

edgargabriel commented Aug 9, 2024

yes, it could, but it depends on the file system how likely it is. I will have a fix ready either later today or tomorrow, and I will backport it to both 5.0.x and 4.1.x series

edgargabriel added a commit to edgargabriel/ompi that referenced this issue Aug 12, 2024
The fs/ufs component by default disabled all file locking before
read/write operations (except for NFS file systems). This was based on
the assumption, that the file system itself performs the required
locking operation and hence we don't have to add to it.

This assumption is incorrect when using data sieving. In data sieving,
the code 'ignore' small gaps when we write to a file, and perform
instead a read-modify-write sequence ourselves for performance reasons.
The problem is however that even within a collective operation not all
aggregators might want to use data sieving. Hence, enabling locking just
for the data-sieving routines is insufficient, all processes have to
perform the locking. Therefore, our two options are: a) either disable
write data-sieving by default, or b) enable range-locking by default.

After some testing, I think enabling range-locking be default is the
safer and better approach. It doesn't seem to show any significant
performance impact on my test systems.

Note, that on Lustre file systems, we can keep the default to no-locking
as far as I can see, since the collective algorithm used by Lustre is
unlikely to produce this pattern. I did add in however an mca parameter
that allows us to control the locking algorithm used by the Lustre
component as well, in case we need to change that for a particular
use-case or platform.

Fixes Issue open-mpi#12718

Signed-off-by: Edgar Gabriel <[email protected]>
edgargabriel added a commit to edgargabriel/ompi that referenced this issue Aug 13, 2024
The fs/ufs component by default disabled all file locking before
read/write operations (except for NFS file systems). This was based on
the assumption, that the file system itself performs the required
locking operation and hence we don't have to add to it.

This assumption is incorrect when using data sieving. In data sieving,
the code 'ignore' small gaps when we write to a file, and perform
instead a read-modify-write sequence ourselves for performance reasons.
The problem is however that even within a collective operation not all
aggregators might want to use data sieving. Hence, enabling locking just
for the data-sieving routines is insufficient, all processes have to
perform the locking. Therefore, our two options are: a) either disable
write data-sieving by default, or b) enable range-locking by default.

After some testing, I think enabling range-locking be default is the
safer and better approach. It doesn't seem to show any significant
performance impact on my test systems.

Note, that on Lustre file systems, we can keep the default to no-locking
as far as I can see, since the collective algorithm used by Lustre is
unlikely to produce this pattern. I did add in however an mca parameter
that allows us to control the locking algorithm used by the Lustre
component as well, in case we need to change that for a particular
use-case or platform.

Fixes Issue open-mpi#12718

Signed-off-by: Edgar Gabriel <[email protected]>
@edgargabriel
Copy link
Member

@tpadioleau I filed a pr that fixes the issue. I spent quite some time thinking about the issue and the various options, I am 99% sure that real application scenario will not hit this problem. Part of the reason why you saw this error is actually because the data volume is so small that it all ended up in a single file system block, which caused the inconsistency. In a real life application scenario with data volumes are not this tiny, I don't think this issue would have occurred. That being said, we still want to fix it, hence the PR.

edgargabriel added a commit to edgargabriel/ompi that referenced this issue Aug 14, 2024
The fs/ufs component by default disabled all file locking before
read/write operations (except for NFS file systems). This was based on
the assumption, that the file system itself performs the required
locking operation and hence we don't have to add to it.

This assumption is incorrect when using data sieving. In data sieving,
the code 'ignore' small gaps when we write to a file, and perform
instead a read-modify-write sequence ourselves for performance reasons.
The problem is however that even within a collective operation not all
aggregators might want to use data sieving. Hence, enabling locking just
for the data-sieving routines is insufficient, all processes have to
perform the locking. Therefore, our two options are: a) either disable
write data-sieving by default, or b) enable range-locking by default.

After some testing, I think enabling range-locking be default is the
safer and better approach. It doesn't seem to show any significant
performance impact on my test systems.

Note, that on Lustre file systems, we can keep the default to no-locking
as far as I can see, since the collective algorithm used by Lustre is
unlikely to produce this pattern. I did add in however an mca parameter
that allows us to control the locking algorithm used by the Lustre
component as well, in case we need to change that for a particular
use-case or platform.

Fixes Issue open-mpi#12718

Signed-off-by: Edgar Gabriel <[email protected]>
(cherry picked from commit c697f28)
edgargabriel added a commit to edgargabriel/ompi that referenced this issue Aug 14, 2024
The fs/ufs component by default disabled all file locking before
read/write operations (except for NFS file systems). This was based on
the assumption, that the file system itself performs the required
locking operation and hence we don't have to add to it.

This assumption is incorrect when using data sieving. In data sieving,
the code 'ignore' small gaps when we write to a file, and perform
instead a read-modify-write sequence ourselves for performance reasons.
The problem is however that even within a collective operation not all
aggregators might want to use data sieving. Hence, enabling locking just
for the data-sieving routines is insufficient, all processes have to
perform the locking. Therefore, our two options are: a) either disable
write data-sieving by default, or b) enable range-locking by default.

After some testing, I think enabling range-locking be default is the
safer and better approach. It doesn't seem to show any significant
performance impact on my test systems.

Note, that on Lustre file systems, we can keep the default to no-locking
as far as I can see, since the collective algorithm used by Lustre is
unlikely to produce this pattern. I did add in however an mca parameter
that allows us to control the locking algorithm used by the Lustre
component as well, in case we need to change that for a particular
use-case or platform.

Fixes Issue open-mpi#12718

Signed-off-by: Edgar Gabriel <[email protected]>
(cherry picked from commit c697f28)
@tpadioleau
Copy link
Author

@tpadioleau I filed a pr that fixes the issue. I spent quite some time thinking about the issue and the various options, I am 99% sure that real application scenario will not hit this problem. Part of the reason why you saw this error is actually because the data volume is so small that it all ended up in a single file system block, which caused the inconsistency. In a real life application scenario with data volumes are not this tiny, I don't think this issue would have occurred. That being said, we still want to fix it, hence the PR.

You may be right, I only noticed this issue when developing on my laptop hence with small test cases. Thank you again for your time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants