Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question of the MPI Network Layer (is the order guaranteed for the message?) #397

Open
Steamgjk opened this issue Mar 18, 2022 · 6 comments

Comments

@Steamgjk
Copy link

Steamgjk commented Mar 18, 2022

Hi, Galois Staff.

I am now studying the network layer of Galois (mainly NetworkBuffer.cpp and NetworkIOMPI.cpp)

I notice that, in async mode, Galois is calling MPI_ISend and MPI_IRecv for communication, and messages are given different tags (based on galois::runtime::evilPhase).

I can understand that messages are sent in the asending order of tag (evilPhase).

But I found here (refer to the link below) that MPI does not guanratee messages are received in the same order, if they are sent with different tags
https://stackoverflow.com/questions/20909648/is-there-a-fifo-per-process-fifo-guarantee-on-mpi-isend

That means, when MPI_ISend send 2 messages with tag1 and tag2 respectively, tag1< tag2 (Let's say tag1=1, tag2=2), the other node may receive tag2 first ahead of tag1, because in

int rv = MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &flag,

it is probing any tag. If tag2 is first recived and put into done queue, and the workerThread copies it to recvBuffer.data.

NetworkIO::message rdata = netio->dequeue();
if (rdata.data.size()) {
++statRecvDequeued;
assert(rdata.data.size() !=
(unsigned int)std::count(rdata.data.begin(), rdata.data.end(),
0));
galois::runtime::trace("BufferedRecieving", rdata.host, rdata.tag,
galois::runtime::printVec(rdata.data));
recvData[rdata.host].add(std::move(rdata));
}
}
}

Later syncNetRecv calls receiveTagged with the tag=1,

void syncNetRecv(std::string loopName) {
auto& net = galois::runtime::getSystemNetworkInterface();
std::string wait_timer_str("Wait_" + get_run_identifier(loopName));
galois::CondStatTimer<GALOIS_COMM_STATS> Twait(wait_timer_str.c_str(),
RNAME);
if (async) {
size_t syncTypePhase = 0;
if (syncType == syncBroadcast)
syncTypePhase = 1;
decltype(net.recieveTagged(galois::runtime::evilPhase, nullptr,
syncTypePhase)) p;
do {
p = net.recieveTagged(galois::runtime::evilPhase, nullptr,
syncTypePhase);

But the head of tag =2

if (rq.hasData(tag)) {

// tag of head of queue
std::atomic<uint32_t> dataPresent;

In that way, we can never match the tag and get the message with tag=1 (because the data queue is a deque type rather than a priority queue). Once reordering really occurs and the larger tag (tag=2) is received first, then it will cause head-of-line blocking. I am not sure whether my guess can happen, so I hope to get some explanation from you staff.

@Steamgjk
Copy link
Author

BTW: What is the intuition behind the assert statement?
Why are you so sure that data.back().data.size() will not equal the right hand?

assert(data.back().data.size() !=
(unsigned int)std::count(data.back().data.begin(),
data.back().data.end(), 0));
}

@roshandathathri
Copy link
Member

NetworkInterface does not guarantee in-order delivery (even for the same tag). Gluon(-Async) ensures that all messages are eventually received. You can read the Gluon-Async paper for more details.

The assert in line 226 is just asserting that a message with all 0s is NOT sent/received.

@Steamgjk
Copy link
Author

Hi, @roshandathathri
Thanks for your explanation.
I am a bit suprised that "no in-order delivery EVEN for the SAME tag". Does that mean, the anwser here is wrong?
https://stackoverflow.com/questions/20909648/is-there-a-fifo-per-process-fifo-guarantee-on-mpi-isend

"In MPI, you are guaranteed that all messages on the same communicator/tag/rank combo will be received in the same order they were sent."

@roshandathathri
Copy link
Member

No, MPI guarantees in-order delivery for the same tag. NetworkInteface doesn't because it can buffer messages and return smaller, later messages before larger, earlier messages are received.

@Steamgjk
Copy link
Author

No, MPI guarantees in-order delivery for the same tag. NetworkInteface doesn't because it can buffer messages and return smaller, later messages before larger, earlier messages are received.

Cool @roshandathathri . We are on the same page regarding "in-order delivery for the same tag".

Then, it comes back to the original question I proposed, related to the implementation codebase.
Let me rephrase it more briefly. (Please correct me if I am wrong)

The receiver side actually is using one workerThread, with a while loop. It keeps doing Send, Probe, and Receive with MPI.

When something is probed, then it will continue to call MPI_IRecv to get the message, and put it into a FIFO queue (a std::deque), then the incoming message is dequeued following the FIFO order when the other threads call receiveTagged, with a tag parameter.

My concern is, the tag parameter is incrementally increasing, (regarding the evliPhase variable), so the application must receive tag=1, and then tag=2.

But if the message with tag=2 is probed earlier and enqued earlier, it will stay at the top of the queue. Whenever the receiveTagged (with tag=1) API is called, it will check rq.hasData, then it finds the top message's tag is not 1, it will return false. In that way, will that be stuck here? (just like head-of-line blocking)

@roshandathathri
Copy link
Member

Yes, the application is expected to check for both tags and NOT block/wait on one tag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants