-
-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure processes always synchronize in t_shapesame test setup #306
Conversation
File formatting fixed in PR #303. |
mrc = MPI_Barrier(MPI_COMM_WORLD); | ||
VRFY((mrc == MPI_SUCCESS), "Sync after large dataset writes"); | ||
} | ||
mrc = MPI_Barrier(MPI_COMM_WORLD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that the code change is a code simplification, but I question why one needs a Barrier in the first place. If we're using independent IO (which would seem to be the condition for the 'if (!use_collective_io)" check), then one really can't guarantee when ALL of the "other" ranks get to this point in the code. If on the other hand, a collective IO operation has just taken place, then we don't need yet another Barrier to synchronize processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the above comment, I was thinking that this was somewhere in the main code, not a test.
Having the Barrier in the test code is perfectly acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though this is in test code, it seems the MPI standard doesn't necessarily force synchronization to occur for collective I/O. Is that just something that most impementations are doing? The standard mentions that a rank may return from a collective as soon as its portion of the operation is complete and that this applies to all the I/O collectives as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For test code, the change looks good to me -- however a few comments:
The use_collective_io parameter is now used only in debugging printfs -- perhaps it should be removed.
To avoid introduction of similar errors in the future, a comment noting the change and explaining the reason why would be useful. I would put it in a change note in the header comment, but the exact location doesn't matter.
Are the barriers still necessary in the collective case if MPI file atomicity is set? (MPI_FILE_SET_ATOMICITY(fh, flag)). This point is irrelevant for the purposes of the code in question since it is just test code. However, we may encounter similar issues elsewhere -- if so, it would be good to know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use_collective_io parameter is now used only in debugging printfs -- perhaps it should be removed.
That's a good point. I have a set of changes I've been working on to try to cleanup all the warnings related to snprintf in the library. I can see about trying to integrate this change there.
To avoid introduction of similar errors in the future, a comment noting the change and explaining the reason why would be useful. I would put it in a change note in the header comment, but the exact location doesn't matter.
This PR was merged quickly since it was such a small change, but I think this is also a good idea. It would be simple enough to add that change separately, but I also think we need to start considering how we can integrate the external VOL tests repo and its changes back into the library proper; I think that would be a good time to add in a small note about the change made.
Are the barriers still necessary in the collective case if MPI file atomicity is set?
For the purpose of the DAOS VOL connector, I'd imagine that the MPI atomicity setting would just get ignored, since we don't directly use any of the MPI I/O routines. For the native connector, it seems that there is some process synchronization happening anyway (it's unclear whether it is the MPI implementation doing this, or if this is occurring elsewhere), so the atomicity setting would probably just incur unnecessary overhead.
As the MPI standard does not guarantee process synchronization in collective operations, VOL connectors with different synchronization semantics (such as the DAOS VOL) could cause t_shapesame to have data verification issues during collective I/O. When the final barrier in t_shapesame's test setup function didn't execute, some ranks could be writing to a dataset during one iteration, while other ranks are reading the same dataset during a previous iteration.