Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ph5diff crashes with mpich 4.1 due to unmatched messages #3123

Closed
minrk opened this issue Jun 14, 2023 · 8 comments · Fixed by #3719
Closed

ph5diff crashes with mpich 4.1 due to unmatched messages #3123

minrk opened this issue Jun 14, 2023 · 8 comments · Fixed by #3719
Assignees
Labels
Component - Build CMake, Autotools Component - Parallel Parallel HDF5 (NOT thread-safety) Component - Testing Code in test or testpar directories, GitHub workflows Component - Tools Command-line tools like h5dump, includes high-level tools Confirmed Priority - 0. Blocker ⛔ This MUST be merged for the release to happen Type - Bug / Bugfix Please report security issues to [email protected] instead of creating an issue on GitHub
Milestone

Comments

@minrk
Copy link

minrk commented Jun 14, 2023

Describe the bug

almost all ph5diff tests fail when building with mpich 4.1, on mac and Linux.

Failures all look like:

Testing ../../src/h5diff/ph5diff h5diff_basic1.h5 h5diff_basic2.h5    *FAILED*
====Expected result (expect_sorted) differs from actual result (actual_sorted)
    *** expect_sorted	2023-05-21 20:11:46.156608867 +0000
    --- actual_sorted	2023-05-21 20:11:46.156608867 +0000
    ***************
    *** 1,2 ****
    --- 1,8 ----
      5 differences found
    + Abort(810645519) on node 1 (rank 1 in comm 0): Fatal error in internal_Finalize: Other MPI error, error stack:
      dataset: </g1/dset1> and </g1/dset1>
    + internal_Finalize(50)...........: MPI_Finalize failed
    + MPII_Finalize(394)..............:
    + MPIR_Comm_delete_internal(1224).: Communicator (handle=44000000) being freed has 1 unmatched message(s)
    + MPIR_Comm_release_always(1250)..:
    + MPIR_finalize_builtin_comms(154):
====The actual output (./testfiles/h5diff_11.out-sav)
    dataset: </g1/dset1> and </g1/dset1>
    5 differences found
    Abort(810645519) on node 1 (rank 1 in comm 0): Fatal error in internal_Finalize: Other MPI error, error stack:
    internal_Finalize(50)...........: MPI_Finalize failed
    MPII_Finalize(394)..............:
    MPIR_finalize_builtin_comms(154):
    MPIR_Comm_release_always(1250)..:
    MPIR_Comm_delete_internal(1224).: Communicator (handle=44000000) being freed has 1 unmatched message(s)
====The actual stderr (./testfiles/h5diff_11.out.err-sav)
====End of actual stderr (./testfiles/h5diff_11.out.err-sav)

with 132 total failures, all with the same message.

I believe the change is due to pmodels/mpich#6186, introduced in mpich 4.1.

Expected behavior

ph5diff doesn't crash with mpich 4.1.

Platform (please complete the following information)

  • HDF5 version: 1.14.1
  • OS and version: linux (CentOS 7, conda-forge)
  • Compiler and version: gcc 12.2 (conda-forge)
  • Build system (e.g. CMake, Autotools) and version: autotools
  • Any configure options you specified
./configure --prefix="${PREFIX}" \
            --with-pic \
            --host="${HOST}" \
            --build="${BUILD}" \
            --with-zlib="${PREFIX}" \
            --with-szlib="${PREFIX}" \
            --with-pthread=yes  \
            --enable-parallel \
            --enable-direct-vfd \
            --enable-cxx \
            --enable-fortran \
            --with-default-plugindir="${PREFIX}/lib/hdf5/plugin" \
            --enable-threadsafe \
            --enable-build-mode=production \
            --enable-unsupported \
            --enable-hlgiftools=yes \
            --enable-using-memchecker \
            --enable-static=no \
            --enable-ros3-vfd

build script

  • MPI library and version (parallel HDF5): mpich 4.1.1

Additional context

Found trying to update the conda-forge hdf5 package to 1.14.1, which happens to be the first build after mpich was updated to 4.1. Same failures seen with 1.14.0.

conda environment
build:

    _libgcc_mutex:            0.1-conda_forge       conda-forge
    _openmp_mutex:            4.5-2_gnu             conda-forge
    binutils_impl_linux-64:   2.39-he00db2b_1       conda-forge
    binutils_linux-64:        2.39-h5fc0e48_13      conda-forge
    ca-certificates:          2023.5.7-hbcca054_0   conda-forge
    gcc_impl_linux-64:        12.2.0-hcc96c02_19    conda-forge
    gcc_linux-64:             12.2.0-h4798a0e_13    conda-forge
    gfortran_impl_linux-64:   12.2.0-h55be85b_19    conda-forge
    gfortran_linux-64:        12.2.0-h307d370_13    conda-forge
    gnuconfig:                2020.11.07-hd8ed1ab_0 conda-forge
    gxx_impl_linux-64:        12.2.0-hcc96c02_19    conda-forge
    gxx_linux-64:             12.2.0-hb41e900_13    conda-forge
    kernel-headers_linux-64:  2.6.32-he073ed8_15    conda-forge
    ld_impl_linux-64:         2.39-hcc3a1bd_1       conda-forge
    libgcc-devel_linux-64:    12.2.0-h3b97bd3_19    conda-forge
    libgcc-ng:                12.2.0-h65d4601_19    conda-forge
    libgfortran5:             12.2.0-h337968e_19    conda-forge
    libgomp:                  12.2.0-h65d4601_19    conda-forge
    libsanitizer:             12.2.0-h46fd767_19    conda-forge
    libstdcxx-devel_linux-64: 12.2.0-h3b97bd3_19    conda-forge
    libstdcxx-ng:             12.2.0-h46fd767_19    conda-forge
    libtool:                  2.4.7-h27087fc_0      conda-forge
    make:                     4.3-hd18ef5c_1        conda-forge
    openssl:                  3.1.0-hd590300_3      conda-forge
    sysroot_linux-64:         2.12-he073ed8_15      conda-forge

host:

    _libgcc_mutex:   0.1-conda_forge         conda-forge
    _openmp_mutex:   4.5-2_gnu               conda-forge
    c-ares:          1.19.0-hd590300_0       conda-forge
    ca-certificates: 2023.5.7-hbcca054_0     conda-forge
    keyutils:        1.6.1-h166bdaf_0        conda-forge
    krb5:            1.20.1-h81ceb04_0       conda-forge
    libaec:          1.0.6-hcb278e6_1        conda-forge
    libcurl:         8.1.0-h409715c_0        conda-forge
    libedit:         3.1.20191231-he28a2e2_2 conda-forge
    libev:           4.33-h516909a_1         conda-forge
    libgcc-ng:       12.2.0-h65d4601_19      conda-forge
    libgfortran-ng:  12.2.0-h69a702a_19      conda-forge
    libgfortran5:    12.2.0-h337968e_19      conda-forge
    libgomp:         12.2.0-h65d4601_19      conda-forge
    libnghttp2:      1.52.0-h61bc06f_0       conda-forge
    libssh2:         1.10.0-hf14f497_3       conda-forge
    libstdcxx-ng:    12.2.0-h46fd767_19      conda-forge
    libzlib:         1.2.13-h166bdaf_4       conda-forge
    mpi:             1.0-mpich               conda-forge
    mpich:           4.1.1-h846660c_100      conda-forge
    ncurses:         6.3-h27087fc_1          conda-forge
    openssl:         3.1.0-hd590300_3        conda-forge
    zlib:            1.2.13-h166bdaf_4       conda-forge
    zstd:            1.5.2-h3eb15da_6        conda-forge

full build and test logs

Fix in another project which encountered this error

@bmribler bmribler added Component - Parallel Parallel HDF5 (NOT thread-safety) Component - Testing Code in test or testpar directories, GitHub workflows Component - Build CMake, Autotools UNCONFIRMED New issues are unconfirmed until a maintainer can duplicate them Branch - 1.14 labels Jun 14, 2023
@minrk minrk changed the title ph5diff crashes with mpich 4.1 due to ph5diff crashes with mpich 4.1 due to unmatched messages Jun 14, 2023
@byrnHDF
Copy link
Contributor

byrnHDF commented Jun 14, 2023

The issue seems to be that the actual output did not sort lines.

@byrnHDF
Copy link
Contributor

byrnHDF commented Jun 14, 2023

Looking at the test code the display is incorrect.
the parallel failure block:
else echo "*FAILED*" nerrors="expr $nerrors + 1" if test yes = "$verbose"; then echo "====Expected result ($expect_sorted) differs from actual result ($actual_sorted)" $DIFF $expect_sorted $actual_sorted |sed 's/^/ /' echo "====The actual output ($actual_sav)" sed 's/^/ /' < $actual_sav echo "====The actual stderr ($actual_err_sav)" sed 's/^/ /' < $actual_err_sav echo "====End of actual stderr ($actual_err_sav)" echo "" fi fi

The diff compares the correct files but display uses the wrong files

@byrnHDF
Copy link
Contributor

byrnHDF commented Jun 14, 2023

so the actual error is the
Abort(810645519) on node 1 (rank 1 in comm 0): Fatal error in internal_Finalize: Other MPI error, error stack:

@byrnHDF
Copy link
Contributor

byrnHDF commented Jun 14, 2023

We should fix the diff output to use the correct files.
What is confusing is why the Abort error is in expected display?

@byrnHDF
Copy link
Contributor

byrnHDF commented Jun 14, 2023

Where does the other lines come from? The expected file should only have:
dataset: </g1/dset1> and </g1/dset1> 5 differences found

@minrk
Copy link
Author

minrk commented Jun 14, 2023

I think the abort's not in the expected display, the first section of the output is a diff, where the unexpected additions have a leading + in the margin:

            echo "====Expected result ($expect_sorted) differs from actual result ($actual_sorted)"
            $DIFF $expect_sorted $actual_sorted |sed 's/^/    /'
            echo "====The actual output ($actual_sav)"
            sed 's/^/    /' < $actual_sav
            echo "====The actual stderr ($actual_err_sav)"
            sed 's/^/    /' < $actual_err_sav
            echo "====End of actual stderr ($actual_err_sav)"
            echo ""

I'm pretty sure the key error is the line:

MPIR_Comm_delete_internal(1224).: Communicator (handle=44000000) being freed has 1 unmatched message(s)

which is a message produced by a new error check in mpich 4.1, introduced in pmodels/mpich#6186

@qkoziol
Copy link
Contributor

qkoziol commented Jun 22, 2023

I can confirm that the code for messages, which uses tags, etc. is complicated and have a small patch working toward solving the issue. I'll assign myself, but it might be a few weeks before I can make progress on this, and I'm happy to pass my code to someone else.

@qkoziol qkoziol self-assigned this Jun 22, 2023
@qkoziol qkoziol added Component - Tools Command-line tools like h5dump, includes high-level tools Type - Bug / Bugfix Please report security issues to [email protected] instead of creating an issue on GitHub Confirmed and removed UNCONFIRMED New issues are unconfirmed until a maintainer can duplicate them labels Jun 22, 2023
@jhendersonHDF jhendersonHDF added the Priority - 0. Blocker ⛔ This MUST be merged for the release to happen label Jun 30, 2023
@derobins derobins added this to the 1.14.3 milestone Oct 9, 2023
@derobins derobins self-assigned this Oct 13, 2023
@WwkChina
Copy link

WwkChina commented Jan 6, 2024

I have faced same problem in 3.4.3,4.1.2 but 3.3.2 doesn't have this problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component - Build CMake, Autotools Component - Parallel Parallel HDF5 (NOT thread-safety) Component - Testing Code in test or testpar directories, GitHub workflows Component - Tools Command-line tools like h5dump, includes high-level tools Confirmed Priority - 0. Blocker ⛔ This MUST be merged for the release to happen Type - Bug / Bugfix Please report security issues to [email protected] instead of creating an issue on GitHub
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants