A big refresh of the datatype engine #6695

bosilca · 2019-05-20T16:53:55Z

Faster and less error prone. this patch is a significant redesign of the internals of the datatype engine. No API or ABI changes.

Fixes #5540 (Issue with overlapping vector datatype)

bosilca · 2019-05-22T16:13:49Z

I put together the datatype related benchmarks we exchanged on the different datatype issues and PRs in a single repo (https://github.com/bosilca/ddt_bench).

opal/datatype/opal_datatype_memcpy.h

opal/datatype/opal_datatype_module.c

opal/datatype/opal_datatype_unpack.h

opal/datatype/opal_convertor.c

hppritcha · 2019-06-17T19:53:20Z

bot:ompi:retest

opal/datatype/opal_datatype_copy.h

opal/datatype/opal_datatype_position.c

opal/datatype/opal_datatype_pack.c

jsquyres · 2019-06-25T15:27:13Z

As noted on the 2019-06-25 webex, this PR is both a performance enhancement for the DDT engine, but it's also a fix for #5540 (i.e., an issue that has been observed on the v4.0.x branch).

ggouaillardet

just an idea, we could always define CHECKSUM to 0 or 1, and then if (CHECKSUM && convertor->flags & CONVERTOR_WITH_CHECKSUM) and let the compiler handle the dead code.
That would make the code more compact and easier to maintain.

hppritcha · 2019-07-08T19:14:29Z

@derbeyn could you re-review?

hppritcha · 2019-07-09T15:09:12Z

@ggouaillardet could you review this PR when you have time?

Move toward a base type of vector (count, type, blocklen, extent, disp) with disp and extent applying toward the count repertition and blocklen being a contiguous memory of type type. Implement 2 optimizations on this description used during type_commit: - collapse: successive similar datatype descriptions are collapsed together with an increased count. - fusion: fuse successive datatype descriptions in order to minimize the number of resulting memcpy during pack/unpack. Fixes at the OMPI datatype level including: - Fix the create_hindexed and vector creation. - Fix the handling of [get|set]_elements and _count. - Correctly compute the dispacement for block indexed types. - Support the MPI_LB and MPI_UB deprecation, aka. OMPI_ENABLE_MPI1_COMPAT. Signed-off-by: George Bosilca <[email protected]>

Merge contiguous iov in order to minimize the number of returned iovec. Signed-off-by: George Bosilca <[email protected]>

- optimize handling of contiguous with gaps datatypes. - fixes a performance issue for all datatypes with a count of 1. - optimize the pack/unpack of contiguous with gaps datatype. - optimize the case of blocklen == 1 Signed-off-by: George Bosilca <[email protected]>

Signed-off-by: George Bosilca <[email protected]>

Rework the to_self test to be able to be used as a benchmark. Signed-off-by: George Bosilca <[email protected]>

Upon detecting a datatype loop representation skip the entire loop according the the remaining space. Signed-off-by: George Bosilca <[email protected]>

Optimize contiguous loops by collapsing them into a single element. During datatype optimization collapse similar elements into larger blocks. Signed-off-by: George Bosilca <[email protected]>

Amazing how a bad instruction scheduling can have such a drastic impact on the code performance. With this change, the get a boost of at least 50% on the performance of data with a small blocklen and/or count. Signed-off-by: George Bosilca <[email protected]>

gpaulsen · 2019-07-10T15:18:01Z

Looks like both opal_datatype_test and ddt_test Aborted at runtime in Mellanox CI (http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/10139). This may be a real error, or perhaps the test case needs to be updated.

jsquyres · 2019-07-10T15:20:12Z

Copying the failed results here, because CI logs get recycled every so often:

...
07:45:45 PASS: unpack_hetero
07:45:45 PASS: position
07:45:45 PASS: checksum
07:45:45 PASS: position_noncontig
07:45:45 PASS: unpack_ooo
07:45:45 PASS: ddt_raw2
07:45:45 ../../config/test-driver: line 107:  6960 Aborted                 "$@" > $log_file 2>&1
07:45:45 FAIL: opal_datatype_test
07:45:45 PASS: external32
07:45:45 PASS: ddt_pack
07:45:45 PASS: large_data
07:45:45 ../../config/test-driver: line 107:  6978 Aborted                 "$@" > $log_file 2>&1
07:45:45 FAIL: ddt_test
07:45:45 PASS: ddt_raw

Start optimizing the code. This commit divides the operations in 2 parts, the first, outside the critical part, deals with partial blocks of predefined elements, and the second, inside the critical path, only deals with full blocks of elements. This reduces the number of expensive operations in the critical path and results in a decent performance increase. Signed-off-by: George Bosilca <[email protected]>

gpaulsen · 2019-08-05T13:12:28Z

@bosilca,

@hppritcha and I were discussing this PR. Would you mind PRing this back to v4.0.x (if you agree it's appropriate for v4.0.2) to resolve Issue #5540. That issue was marked as blocking a v4.0.2 release.

bosilca · 2019-08-05T13:45:03Z

@hppritcha and @gpaulsen here is the 4.0 PR #6863.

bosilca added the ⚠️ WIP-DNM! label May 20, 2019

bosilca self-assigned this May 20, 2019

bosilca force-pushed the fix/vector_stride_0 branch 2 times, most recently from 599f3ad to a2239d4 Compare May 20, 2019 16:58

This was referenced May 22, 2019

Force memcpy inlining to assignments during pack/unpack of some DDTs #6678

Closed

Issue with overlapping vector datatype #5540

Closed

bosilca force-pushed the fix/vector_stride_0 branch from e1e92d1 to d519e19 Compare May 29, 2019 05:06

bwbarrett added the Target: main label Jun 4, 2019

gpaulsen requested review from derbeyn and ggouaillardet June 7, 2019 22:10

derbeyn reviewed Jun 13, 2019

View reviewed changes

opal/datatype/opal_datatype_memcpy.h Outdated Show resolved Hide resolved