Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1111 Fix for large message sends #1113

Merged
merged 26 commits into from
Nov 13, 2020
Merged

1111 Fix for large message sends #1113

merged 26 commits into from
Nov 13, 2020

Conversation

lifflander
Copy link
Collaborator

@lifflander lifflander commented Oct 14, 2020

Fixes #1111

  • Start with tests that fail for serialized/non-serialized messages

  • Make a major improvement to how tests are being dispatched. Currently, tests are being ignored due to deficiencies in the gtest cmake. Pull in the latest gtest cmake and fix the bugs in that version of the cmake. This is causing "new" tests to fail on this PR that weren't running before which I am systematically trying to fix

    • Addressed & merged in separate PR
  • Implement "multi-sends" for serialized and non-serialized messages (now passing!)

  • Add a new runtime argument --vt_max_mpi_send_size=X that defaults to 1ull << 30.

  • Use the new argument injector to set the size for testing to 16384 bytes

@codecov
Copy link

codecov bot commented Oct 14, 2020

Codecov Report

Merging #1113 (90c9305) into develop (f0c4480) will increase coverage by 0.16%.
The diff coverage is 90.81%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1113      +/-   ##
===========================================
+ Coverage    79.28%   79.45%   +0.16%     
===========================================
  Files          716      718       +2     
  Lines        27037    27241     +204     
===========================================
+ Hits         21436    21643     +207     
+ Misses        5601     5598       -3     
Impacted Files Coverage Δ
src/vt/configs/arguments/app_config.h 100.00% <ø> (ø)
src/vt/configs/arguments/args.h 100.00% <ø> (ø)
src/vt/messaging/irecv_holder.h 37.50% <0.00%> (+0.35%) ⬆️
src/vt/rdma/collection/rdma_collection.cc 1.16% <0.00%> (-0.02%) ⬇️
...c/vt/serialization/messaging/serialized_data_msg.h 100.00% <ø> (ø)
src/vt/utils/memory/memory_units.h 100.00% <ø> (ø)
src/vt/rdma/rdma.cc 25.04% <57.14%> (+0.08%) ⬆️
src/vt/runtime/runtime_banner.cc 56.26% <80.00%> (+0.49%) ⬆️
src/vt/messaging/active.cc 84.30% <91.52%> (+3.31%) ⬆️
tests/unit/active/test_active_send_large.cc 93.02% <93.02%> (ø)
... and 11 more

@lifflander lifflander requested a review from PhilMiller October 16, 2020 02:43
@lifflander
Copy link
Collaborator Author

lifflander commented Oct 16, 2020

Currently failing unrelated tests:

The following tests FAILED:
        248 - vt:TestDiagnosticValue.test_diagnostic_value_2_proc_8 (Failed)
        255 - vt:TestGroup.test_group_range_construct_1_proc_2 (Failed)
        258 - vt:TestGroup.test_group_range_construct_1_proc_4 (Failed)
        282 - vt:*TestLocationRoute/*test_entity_cache_hits_proc_2 (Failed)
        291 - vt:*TestLocationRoute/*test_entity_cache_hits_proc_4 (Failed)
        292 - vt:*TestLocationRoute/*test_entity_cache_migrated_entity_proc_4 (Failed)
        300 - vt:*TestLocationRoute/*test_entity_cache_hits_proc_8 (Failed)
        301 - vt:*TestLocationRoute/*test_entity_cache_migrated_entity_proc_8 (Failed)
        434 - vt:*TestRDMAHandleCollection/*test_rdma_handle_collection_1_proc_2 (Failed)
        440 - vt:*TestRDMAHandleCollection/*test_rdma_handle_collection_1_proc_4 (Failed)
        446 - vt:*TestRDMAHandleCollection/*test_rdma_handle_collection_1_proc_8 (Failed)

Edit: should all be fixed now

@lifflander
Copy link
Collaborator Author

All the test failures should be fixed now.

@lifflander lifflander marked this pull request as ready for review October 16, 2020 19:41
@lifflander lifflander force-pushed the 1111-over-2gib-send-bug branch 2 times, most recently from db4dc1f to 24d9355 Compare October 20, 2020 21:12
@lifflander lifflander force-pushed the 1111-over-2gib-send-bug branch from 0af808f to e0f9bc8 Compare October 27, 2020 22:57
auto& holder = theEvent()->getEventHolder(ret_event);
for (auto&& child_event : events) {
holder.get_event()->addEventToList(child_event);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block seems a bit awkward, but it's inconsequential

@PhilMiller
Copy link
Member

Reading on my phone, so I can't get a great overview at the moment. I think we're close on this

@@ -485,6 +485,18 @@ void ArgConfig::addConfigFileArgs(CLI::App& app) {
a2->group(configGroup);
}

void ArgConfig::addRuntimeArgs(CLI::App& app) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a problem, but why is this 'RuntimeArgs' instead of 'MessengerArgs'?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be either. I just happened to choose that one.

@PhilMiller
Copy link
Member

At least some of the failing tests are quite real - there's a memory leak reported of an allocation in recvDataDirect. I don't immediately see it in the code, but that needs to be hunted down.

Copy link
Collaborator Author

Codacy Here is an overview of what got changed by this pull request:

Clones removed
==============
+ src/vt/messaging/active.cc  -2
         

See the complete overview on Codacy

@lifflander lifflander force-pushed the 1111-over-2gib-send-bug branch from aff8bfa to 90c9305 Compare November 13, 2020 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Active messenger fails when sending over 2 GiB of data due to MPI limitations
3 participants