-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'Pending communication' not defined in definition of MPI_Comm_disconnect #710
Comments
@hppritcha @raffenet @Wee-Free-Scot @schulzm @naughtont3 / @GuillaumeMercier / @devreal @bkmgit I tried to summarize the problems here on the abstract of this issue. In my opinion, it is entirely new territory to promote dangling "request handles" that are never released. I hope you like the content of the proposal. I feel the rules must be clear for the users - that should be an important goal. Issue: #710 |
The changes are marked with PR823 and also in yellow color on pages 475, 525, 1025, 1054 in |
If we only require persistent communications to be inactive, we are forcing a high-quality (aka memory-leak free) MPI implementation to track all persistent requests to be able to release them (on behalf of the user) during finalization. Moreover, having different requirements for nonblocking (which are required to be freed) and persistent requests (which are required to be inactive or freed) is not helping with the consistency. From an implementor perspective it makes more sense to require that all persistent requests are inactive and freed, or simply all requests to be freed (and then the part about |
For MPI finalize, no MPI library required persistent handles to be freed. We also do not reqire datatype handles or ... to be freed before finalize. Would it be helpful to write an AtoU? |
Open MPI does. Our objects refcount each other (datatype to requests, requests to communicators, ...) such that at the end we know if there are dangling objects. In debug mode we even print a list of dangling objects. |
MPICH will also print a list of objects still allocated objects during MPI_FINALIZE in a debug build. Though I am not sure the user is required to do anything with that information. |
A possible approach to dealing with dangling/stale handles/objects is to permit MPI_Request_free (and MPI_Type_free, ...) to be called after MPI_Finalize. I'm not saying it's a good approach, but it points out that "valgrind cleanliness" is not just the responsibility of the MPI library, but is a cooperation between MPI and the user. Open MPI keeps a list internally of active requests, so it can tell the difference between "there are no active requests right now" and "there is at least one active request right now". It can partition the list of active requests by the session they are derived from, so it can tell the difference between "there are no active requests for session X (which might be the World model)" and "there is at least one active request for session X (which might be the World model)". During MPI_Finalize/MPI_Session_finalize, even one active request in the World model/particular session means an erroneous program (go mad, do whatever you want). During MPI_Comm_disconnect for a communicator derived from the World/Session model, even one active request in the World model/particular session means an erroneous program (go mad, do whatever you want). The problem highlighted is exclusively about the existence of inactive requests at the point of destroying a communicator (i.e. at the point of calling finalize or disconnect). Axiom: The internal request structure must keep a pointer to the communicator (not strictly true, but lets pretend that is an axiom and argue about it separately). Question: why is that pointer contributing to the ref-count on the communicator when the request is inactive? If MPI_Start (and friends) incremented the ref-count on the communicator and MPI_Wait (and friends) decremented the ref-count on the communicator, then the existence of inactive requests would not contribute to the ref-count and would not prevent MPI from destroying the communicator when told to do so, e.g. by MPI_Comm_disconnect. The obvious objection is that destroying a request possibly/sometimes involves destroying the objects to which it holds a pointer (i.e. when that pointer is the last ref-counted reference to that other object). Question: why does destroying a request cascade to potentially destroying the communicator for which it has a pointer? The answer is MPI_Comm_free -- when MPI_Comm_free only marks the communicator for later destruction (because the ref-count is nonzero), it expects that some later action will actually destroy the communicator (when the ref-count becomes zero). If inactive requests have a reference to the communicator (as well as a pointer to it), then the existence of even one inactive request will prevent MPI_Comm_free from destroying the communicator. In this case, destroying inactive requests to remove the reference is necessary before the communicator can be destroyed -- which means an inactive request is a pending communication in the sense of the MPI Standard wording for MPI_Comm_disconnect, If inactive requests have a pointer to the communicator, but not a reference (i.e. it is known by design that the communicator ref-count (if it still exists) is not counting this pointer at this time), then their existence does not prevent MPI_Comm__free from destroying the communicator; only active requests have a reference and can prevent destruction during MPI_Comm_free. In this case, destroying inactive requests is not necessary before the communicator can be destroyed (because they do not contribute to the ref-count) -- which means an inactive request is NOT a pending communication in the sense of the MPI Standard wording for MPI_Comm_disconnect, Consequences, part 1: If an inactive request does NOT have a reference to the communicator, then the communicator could be destroyed by MPI while it is inactive, and attempting to start it again after MPI_Comm_free is then going to be a potential problem. MPI_Start might attempt to dereference a dangling communicator pointer and bad things (UB) will likely happen. OTOH, MPI_Comm_free/MPI_Comm_disconnect can safely ignore the existence of inactive requests (MPI does not even need to know whether any inactive requests exist) and destroy the communicator if/when the ref-count (which counts active references for pending communication only) is zero. Thus, if inactive requests are NOT pending communications then
Consequences, part 2: If an inactive request DOES have a reference to the communicator, then the communicator cannot be destroyed while an inactive request exists; attempting to start it again after MPI_Comm_free is then going to be fine (probably). MPI_Start would attempt to dereference the communicator pointer and good things will likely happen. OTOH, MPI_Comm_free/MPI_Comm_disconnect must not ignore the existence of inactive requests and must not destroy the communicator while they still exist, so they must contribute to the ref-count that triggers/prevents the communicator destruction. Thus, if inactive requests ARE pending communications then
Conclusion We have a choice: which group of customers do we disappoint? Those who want to use (start, wait, etc) inactive requests after freeing the communicator handle or those who want to destroy the communicator itself without freeing inactive requests beforehand? Which is all a really long way of getting to the suggestion of permitting inactive requests to be freed after the communicator is destroyed. If "the request is inactive" means "has a pointer, but not a reference, to a communicator" then MPI_Request_free must not follow that pointer, it just destroys the request itself, treating the communicator pointer as if it were dangling (even if it is not). That can happen before the communicator is destroyed (the pointer would work, if used, but no-one cares because there must be at least one other reference to the communicator so MPI_Request_free has no work to do) or afterwards (the pointer is dangling but no-one cares because the communicator is already gone so MPI_Request_free has no work to do). To achieve valgrind cleanliness, the user must free all their requests and destroy all their communicators, of course, but not necessarily in that order. We now have an option that doesn't disappoint either group of customers, but involves re-work in some MPI implementations. |
So my view of the options are:
|
Well ... "before disconnect/finalize can return". With thread multiple support, the user can free inactive requests on a different thread to release the thread blocked on disconnect/finalize. Without thread multiple support, then yes exactly what Ken wrote because my wording and Ken's are equivalent in that situation. |
Finalize and persistent request exist since MPI-1 and freeing before finalize was never required.
After the discussion, I would really prefer Option 3:
This is for me the best option. It is very clear and symmetric:
If there is some consensus on this 3rd option, I would change the pull request accordingly. |
In addition to my comment above on handling of inactive persistent request, here, I want to comment on nonblocking requests: The following example is correct since MPI-2.0, although it might be seen as a corner case. int snd_buf, rcv_buf;
MPI_Init();
MPI_Buffer_attach(...);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_dup(MPI_COMM_WORLD, &my_comm);
// same ranks on my_comm and MPI_COMM_WORLD
if (my_rank==0) {
snd_buf = 5;
MPI_Bsend(snd_buf, 1, MPI_INT, dest=1, tag=0, comm=my_comm);
} else if (my_rank==1) {
MPI_Request rq;
MPI_Irecv(rcv_buf, 1, MPI_INT, source=0, tag=0, comm=my_comm, request=&rq);
MPI_Request_free(&rq); // rq is set to MPI_REQUEST_NULL, but operation need not to be complete:
// the rcv_buf must not be read or reused bofore the operation is complete
}
MPI_Comm_disconnect(&my_comm); // collective over whole my_comm
// all pending communication is complete
// i.e., - the Irecv has filled in the rcv_buf
// - the Bsend must have send out its buffered data
if (my_rank==0) {
// the application can now reuse the attached buffer for other purposes
} else if (my_rank==1) {
// the rcv_buf is now filled with the value 5 and can be reused for other purposes
}
MPI_Finalize(); Agreed? An even more extreme corner case example based on the advice to implementors for MPI_Finalze:
int snd_buf, rcv_buf;
MPI_Init();
MPI_Buffer_attach(...);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
if (my_rank==1) {
snd_buf = 5;
MPI_Bsend(snd_buf, 1, MPI_INT, dest=0, tag=0, comm=MPI_COMM_WORLD);
} else if (my_rank==0) {
MPI_Request rq;
MPI_Irecv(rcv_buf, 1, MPI_INT, source=1, tag=0, comm=MPI_COMM_WORLD, request=&rq);
MPI_Request_free(&rq); // rq is set to MPI_REQUEST_NULL, but operation need not to be complete:
// the rcv_buf must not be read or reused bofore the operation is complete
}
MPI_Finalize();
// all pending communication is complete
// i.e., - the Irecv has filled in the rcv_buf
// - the Bsend must have send out its buffered data
if (my_rank==1) { // if this process still exists after MPI_Finalize
// the application can now reuse the attached buffer for other purposes
} else if (my_rank==0) { // rank 0 in MPI_COMM_WORLD must still exist after MPI_Finalize
// the rcv_buf is now filled with the value 5 and can be reused for other purposes
} Agreed? |
In the example codes from @RolfRabenseifner that use buffered mode send operations, there is no attempt to detach the buffer before the call to disconnect/finalize -- do we wish "communication is complete" to mean "there are no internal references to the communicator used in the buffered mode send operations that were using space in the attached buffer"? That is, disconnect/finalize must do all decoupled activities, which involves draining the attached buffer, which is all/most of the work of detaching the buffer, so the user can refuse to detach the buffer but still expect a valgrind clean state at the end. When can the user free the heap-allocated buffer space they attached to MPI? Aside: can I do a buffered mode send from within an delete attribute callback that is called during disconnect/finalize -- is the buffer still guaranteed to be attached at the point in execution of disconnect/finalize when those callbacks are called by MPI? After the new proposal that permits a buffered mode buffer to be attached directly to the communicator, do we need to state that disconnect/finalize both do the work of detaching the buffer, if one is still attached? Of course, finalizing a session will also do the work of detaching the buffer attached to the session, if any, as well as doing the work of detaching all the buffers attached to all the communicators derived from that session, thereby releasing all the internal references to those communicators from the decoupled activities related to all the buffered mode send operations that used any of the communicators within that session. Do we need to state that too? Is it adequately covered by "decoupled activities"? |
Option 3 in the summary from @raffenet gives me the heebie-jeebies. It's either the status quo or a dangerous security flaw. MPI_Recv_init(..., comm, &req);
MPI_Comm_disconnect(&comm);
assert(req == MPI_REQUEST_NULL); // this assert should fail, even if the request has been destroyed In the above code, lets assume we have option (3) with an implementation that chooses to avail itself of this new(?) freedom to "free inactive requests during disconnect/finalize".
I think MPI libraries are permitted this level of freedom already -- this is the status quo (except for an explicit "subsequent usage of request handles is erroneous" statement). An MPI library that chooses NOT to destroy inactive requests during disconnect/finalize, and thereby leak MPI-allocated memory, is just not high quality. Aside: this makes |
With the new issue #586, it is the work of MPI_BUFFER_FLUSH (instead of detach+attach). My comment
meant that the attached global buffer can be reused for Bsends on other communicators. My comment did not want to say that the buffer is detached. This means, here I do not want to discuss some automatic detach of the global buffer. I would really add somthing like
(It is always up to the programmer to fulfill such rules.) Okay? |
Here the protocol from the Virtual MPI forum meeting, June 21, 2023: |
The result from the sessions WG meeting on June 19, 2023 and the virtual MPI forum meeting on June 21, 2023 is now added to this PR 823. Commented PDF file: mpi41-report_Issue710_PR823_update10.pdf The changes by this PR 823 are marked with PR823 and also in yellow color on pages 100, 148, 477, 483-484, 527-528, 1027, 1056, 1059, 1091, 1094-1095, 1097-1099 with additional
It will be the input for the errata reading and voting at the MPI forum meeting on July 10-13, 2023.
|
As for Option C: while I understand the motivation to make a change here to fix something that we don't like in the world model, I would strongly argue against this: this would create a difference between the default "world" session and all other sessions. This will not only cause issues for tools, as discussed in the last call, it will also cause problems for any library that has that behavior - those could not be used safely in a session model, but this is hard to determine for the end user (and the library itself would not be aware of the fact that it is used from sessions). Also, the property of not removing persistent requests at the end would not be an unexpected behavior for a library that reuses persistent requests across library calls. |
The result from the sessions WG meeting on June 19, 2023 and the virtual MPI forum meeting on June 21, 2023 is now added to this PR 823, plus a few additional English corrections (June 22). Commented PDF file: mpi41-report_Issue710_PR823_update11.pdf The changes by this PR 823 are marked with PR823 and also in yellow color on pages 100, 148, 477, 483-484, 527-528, 1027, 1056, 1059, 1091, 1094-1095, 1097-1099 with additional
It will be the input for the errata reading and voting at the MPI forum meeting on July 10-13, 2023.
|
@schulzm makes three points against option C; with extensive editorializing, these are:
Rebuttals for these three objections might use the following arguments:
How far should we go out of our way to support bad user code? |
To 2) - this is not bad code or a bad design decision - tools do subset COMM_WORLD and mask that upwards to the application and one of the use cases for sessions was to support tools with extra processes, e.g., in overlay networks. Hence, this will not be an usual case for tools - or rather we are preventing legitimate tool use cases To 3) - even a session-ized library will have to (for quite some time) rely on existing other libraries and hence falls into this category, hence, again not bad code: we would basically be saying, if you want to switch to sessions you cannot rely on any existing MPI library/code anymore, you have to write everything from scratch, unless you know exactly what is in this library. This will not help us to promote the use of sessions - nobody will want to give up using the existing MPI ecosystem. |
Protocol from the formal reading, July 10, 2023: |
Final version for no-no-vote: mpi41-report_Issue710_PR823_update14-for-voting.pdf |
This passed a no-no vote.
|
This passed a errata vote.
|
I am very surprised that one of the conclusions of this issue is -- users do not need free a persistent request that they explicitly created. This is just odd with no precedence. I am a bit upset that this gets sneaked in as errata. It is one thing that the previous text neglected to mention the requirement, but it is another thing to issue an errata to clarify that the neglect was intentional. I guess I am ahead of myself -- Is my interpretation of the errata correct? And what is the logic that allows explicit MPI object creation without corresponding explicit free/destroy? PS: sorry to comment on an already closed issue. |
I created a new issue to capture the ongoing discussion of this: #732 |
Thanks @Wee-Free-Scot for creating new tracking issue. Separately, @raffenet and I think it is questionable for the current errata procedures. Errata by common sense should be a correction of non-controversial mistakes. The errata in question has significant negative votes -- 2 No and 5 Abstain. Thus, in my opinion, should not qualify as an errata. We think passing errata should have a much higher bar than other votes. For example, I think it is reasonable to expect "No-No" for errata votes. |
@wgropp What do you think #710 (comment) ? |
Draft implementation of a proposed semantic from mpi-forum/mpi-issues#710.
Maintain a per-comm hash for persistent requests so we can check and free them at MPI_Comm_disconnect or MPI_Finalize if user forget to free them. We need apply per-vci critical sections to prevent data race. Reference: mpi-forum/mpi-issues#710.
Maintain a per-comm hash for persistent requests so we can check and free them at MPI_Comm_disconnect or MPI_Finalize if user forget to free them. We need apply per-vci critical sections to prevent data race. Reference: mpi-forum/mpi-issues#710.
Maintain a per-comm hash for persistent requests so we can check and free them at MPI_Comm_disconnect or MPI_Finalize if user forget to free them. We need apply per-vci critical sections to prevent data race. Reference: mpi-forum/mpi-issues#710.
Maintain a per-comm hash for persistent requests so we can check and free them at MPI_Comm_disconnect or MPI_Finalize if user forget to free them. We need apply per-vci critical sections to prevent data race. Reference: mpi-forum/mpi-issues#710.
Maintain a per-comm hash for persistent requests so we can check and free them at MPI_Comm_disconnect or MPI_Finalize if user forget to free them. We need apply per-vci critical sections to prevent data race. Reference: mpi-forum/mpi-issues#710.
Problem
The text uses the term "pending communication". Nowhere is it defined. Here, "decoupled MPI activities" may be the correct wording, see new MPI-4.1 Section 2.9 Progress.
§11.10.4
would be
Another problem is, that the the phrase "has the same action ... except ..." is wrong, because the
which implies also different outcome (=action) if the requirements are not fulfilled, and
see also New example showing differences between MPI_COMM_FREE and MPI_COMM_DISCONNECT #536.
Furthermore, the text in MPI-4.0
seems not to be clear enough, as the discussion at the MPI forum meeting May 3, 2023 has shown.
Especially in the definition of MPI_Finalize
it is clear that the calls to MPI_WAIT, MPI_TEST, or MPI_REQUEST_FREE (to free request handles of blocking operations or to complete or free persistent requests) nmust be called before MPI_Finalize.
Open questions have been
and which special rules then apply for the still existing request handle?
Consensus was that MPI_Finalize nor the sections about pt-to-pt and collective persistent operations require that an inactive persistent request handle is freed with MPI_REQUEST_FREE before calling MPI_Finalize.
See also discussions in issues #543 and #676
Proposal
The proposal in the pull request tries onyl to resolve the inconsistencies, but as this is an errata, not any future extensions for MPI_REQUEST_GET_STATUS.
In addition, it makes the same clarification for MPI_Finalize.
Changes to the Text
See above. This should be part of the "cleanup" for MPI 4.1.
Alternatively, the term "pending communication" could be properly in the terms section.
(Risky, because its meaning may be different in different contexts)
Impact on Implementations
None (unless there it renders some implementations non-compliant due to confusion stemming from the usage of the term "pending")
Exception: If an MPI implementation requires that related persistent requests are freed (instead of only beeing inactive) before calling MPI_COMM_DISCONNECT (i.e. requiring more than for MPI_FINALIZE), then this bug should be fixed.
Impact on Users
Clear use of terms helps users better understand the text without having to reason about what a "pending operation" is.
References and Pull Requests
PR 823
The text was updated successfully, but these errors were encountered: