-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix handling of a response that comes after MRP resends end. #29640
Conversation
PR #29640: Size comparison from 43981ac to 3b2e700 Increases (70 builds for bl602, bl702, bl702l, cc13x4_26x4, cc32xx, cyw30739, efr32, esp32, k32w, linux, nrfconnect, psoc6, qpg, telink)
Decreases (19 builds for bl702, bl702l, cc32xx, linux, psoc6)
Full report (71 builds for bl602, bl702, bl702l, cc13x4_26x4, cc32xx, cyw30739, efr32, esp32, k32w, linux, mbed, nrfconnect, psoc6, qpg, telink)
|
After project-chip#29173 we can get into the following situation: 1. A message is sent. 2. Before we get an ack or response, all MRP retries happen, MRP gives up, but the exchange response timer has not been hit yet. 3. We get an actual response. 4. Because our exchange is marked as having an un-acked message, but the incoming message is not treated as an ack (because the MRP state that would do that has been torn down), we do not clear our "have un-acked message" state and end up discarding the incoming message. The fix is as follows: * Rename things to make it clear that what we really have is "waiting for an ack" state, which in fact _does_ get cleared when we run out of MRP retries, not an "un-acked message" state. * Have a separate state bit for tracking that we ran out of MRP retries on a message we sent.
3b2e700
to
6ecfb46
Compare
PR #29640: Size comparison from 43981ac to 6ecfb46 Increases (70 builds for bl602, bl702, bl702l, cc13x4_26x4, cc32xx, cyw30739, efr32, esp32, k32w, linux, nrfconnect, psoc6, qpg, telink)
Decreases (19 builds for bl702, bl702l, cc32xx, linux, psoc6)
Full report (71 builds for bl602, bl702, bl702l, cc13x4_26x4, cc32xx, cyw30739, efr32, esp32, k32w, linux, mbed, nrfconnect, psoc6, qpg, telink)
|
PR #29640: Size comparison from 43981ac to d06fbf8 Increases above 0.2%:
Increases (26 builds for bl602, bl702, bl702l, cc32xx, linux, psoc6, telink)
Decreases (67 builds for bl702, bl702l, cc13x4_26x4, cc32xx, cyw30739, efr32, esp32, k32w, linux, mbed, nrfconnect, psoc6, qpg, telink)
Full report (71 builds for bl602, bl702, bl702l, cc13x4_26x4, cc32xx, cyw30739, efr32, esp32, k32w, linux, mbed, nrfconnect, psoc6, qpg, telink)
|
@@ -489,6 +490,9 @@ void ExchangeContext::NotifyResponseTimeout(bool aCloseIfNeeded) | |||
} | |||
#endif // CONFIG_DEVICE_LAYER && CHIP_CONFIG_ENABLE_ICD_SERVER | |||
|
|||
// Grab the value of WaitingForResponseOrAck() before we mess with our state. | |||
bool gotMRPAck = !WaitingForResponseOrAck(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There doesn't seem any difference between the flags WaitingForResponseOrAck and WaitingForAck in the way they are actually used. So, is the former required? We could use the latter for gotMRPAck instead, no?
Because once we receive the application response, we would be implicitly treating it as an Ack for the sent message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There doesn't seem any difference between the flags WaitingForResponseOrAck and WaitingForAck in the way they are actually used
There is absolutely a difference. WaitingForResponseOrAck
gets set to false when we get a response (which might carry an ack... but we might be able to tell that) or an ack. WaitingForAck
gets set to false when we get an ack or stop listening for acks.
Because once we receive the application response, we would be implicitly treating it as an Ack for the sent message.
The point is: we don't. And the reason we don't is that once WaitingForAck
goes false we are no longer tracking what message id has an outstanding ack (if any), and hence can't tell whether to treat the application response as that ack...
We could consider changing that, having a single WaitingForAck
bit that stays true of MRP runs out of resends, and treating an app response, no matter whether it carries an ack or not (?), as an ack if we are no longer in the MRP table. That involves some more widespread logic changes than what we did here, but would be reasonably equivalent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I read, it gets set to false in ReliableMessageContext.cpp:108, rt? That is when MRP is processing its Ack, be it either a StandaloneAck or a piggybacked Ack. I was treating them both as the same, i.e, an acknowledgment of delivery of the sent message, hence an Ack.
Yes, once WaitingForAck goes false, we have removed the message from RetransmitTable and do not care about retransmission anymore. However, Acks can still arrive after that, as you stated(maybe in a piggybacked form). These are duplicate Acks and are generally discarded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it gets set to false in ReliableMessageContext.cpp:108, rt?
And also in all the places where SetResponseExpected(false)
is called.
These are duplicate Acks and are generally discarded.
The point is: if we removed fthe message from RetransmitTable because we ran out of retransmits, they may not in fact be duplicate acks. But we just can't tell, because we removed our state that would enable us to tell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we were removing from the retransmit table only on all MRP retries timeout or an Ack was received. Are we evicting an outstanding retransmit to make room for a new Send? That should be a SendError because our RetransmitTable is full, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we were removing from the retransmit table only on all MRP retries timeout or an Ack was received.
That is correct. The question is what happens when a message is received after the MRP retries timeout but before response timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From an MRP standpoint, this message means nothing as all state for Ack receipt is gone(although this message is proof of delivery of the sent message). However, since Exchange has not timed out, the application may be waiting for the response still, and the question is should this be delivered up to the app? Or thrown away. I guess, if the ExchangeContext exists, then it MAY be sent up to the app bypassing all MRP checks.
One option would have been to notify the App with OnSendError() upon MRP timeout, and that would have allowed the App to either close the exchange or wait. But in the absence of that, the App will wait until ResponseTimeout and if the Response comes in within the timeout and the Exchange is available, then it can be sent to the app. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this message means nothing as all state for Ack receipt is gone
Yep, that is what we implement.
the question is should this be delivered up to the app
The answer to this question is "yes". ;)
Or thrown away.
It being thrown away (due to recent changes to the MRP logic which this PR undoes) is exactly what this PR is fixing. Throwing it away leads to an unacceptably high rate of commissioning failures.
WDYT?
Sounds like we are in agreement.
if (!GetReliableMessageMgr()->CheckAndRemRetransTable(this, ackMessageCounter)) | ||
if (GetReliableMessageMgr()->CheckAndRemRetransTable(this, ackMessageCounter)) | ||
{ | ||
SetWaitingForResponseOrAck(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just set SetWaitingForAck(false) here, instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because what if ackMessageCounter is an ack for a totally different message counter value than the one we are waiting for an ack for? As in, it's some stale ack that was sent for a previous message, delayed in transit, and is now getting delivered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, isn't HandleRcvdAck actually processing the Ack for the message it is expecting one for? So, setting the flag to false is essentially acknowledging an MRP Ack, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, isn't HandleRcvdAck actually processing the Ack for the message it is expecting one for?
Nope, it's processing an ack if we got an ack. It doesn't check what it's an ack for; that's checked down in CheckAndRemRetransTable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion. But I meant where it is being currently set inside HandleRcvdAck() in your change after checking the retransmit table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, that is indeed the "We got the ack we were expecting" case. But in that case, IsWaitingForAck has already been set false, no? So I am not sure I understand the question here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, precisely. IsWaitingForAck is set to false when the RetransEntry is destroyed. So, gotMRPAck in ExchangeContext.cpp:494 should be assigned to !IsWaitingForAck() instead, isn't it? If we need separate tracking of App response and MRP Ack, we can achieve that using the existing ResponseExpected(for app response) and WaitingForAck(for MRP), no? My point was whether there is a need for ResponseOrAck flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsWaitingForAck is set to false when the RetransEntry is destroyed
Which happens if either we get the ack or we stop waiting for the ack.
In other words, IsWaitingForAck tells you nothing about whether we got an ack. At least after this PR, or before #29173
Fundamentally, we have the following states that we need to be able to tell apart in OnResponseTimeout and in "message received" handling:
- We have not yet gotten an ack, we have not timed out MRP.
- We have not yet gotten an ack, but we have timed out MRP.
- We got an ack.
OnResponseTimeout needs to handle state 3 different from states 1 and 2, right? It cannot use the "got response" bit for that, since that bit is false by assumption in OnResponseTimeout, so we need a bit to tell those cases apart.
"message received" handling needs to handle state 1 differently from state 2: In state 1 it needs to drop messages that are acking the wrong thing, while in state 2 it needs to process them because it has no way to tell that it's the wrong thing. This needs a second bit to tell the cases apart, yes?
/// (1) We sent a message that expected a response (hence | ||
/// IsResponseExpected() is true). | ||
/// (2) We have received neither a response nor an ack for that message. | ||
kFlagWaitingForResponseOrAck = (1u << 11), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this flag? Because once we receive the Message response, we would be practically done with the Exchange and not care about MRP Acks. And there is only one timeout for the Message response(compared to multiple for MRP retries)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we would be practically done with the Exchange
That's not at all true. Many exchanges have long trains of messages going back and forth.
For the rest, see #29640 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that we do. But we do not have multiple outstanding sent messages. So, an Exchange is a StopAndWait protocol and one and only one message is waiting for a delivery Ack at a given time.
I might have overlooked the case where there are multiple back-and-forth over a single exchange, but my point was about the single sent outstanding message. It is not a sliding window where a received response is not acking an outstanding sent message(since there is only one outstanding sent message).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Again, we could treat a non-duplicate response with a piggyback ack as an implicit ack for "whatever the last thing we sent was, even if we have no way to check whether the piggyback is for the right thing", as #29640 (comment) describes. If you think that would be clearer, we can do that. We still need two flags to track the "no ack" and "we have no way to check" states, which are not identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it is technically true in our stop-and-wait scenario
, that a received response is actually a true acknowledgment for the message sent earlier(even if all MRP retries have been exhausted), we can ignore the piggybacked Ack in the response(if MRP timeout has happened), but send the response up to the application if the response timeout has not happened and the exchange is still open. For the 2 states, I think WaitingForAck for MRPAck and IsResponseExpected for app response should cover the 2 scenarios, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if MRP timeout has happened
This parenthetical is really important: There is nothing tracking whether this has happened right now. We could have a bit to track this, if we want, and then that bit plus the "waiting for ack" bit plus the "response expected" bit would let us detect the states we want.
For the 2 states
There are more than 2 states. See #29640 (comment)
// message... but of course it _does_ have an un-acked message and | ||
// we have just given up on waiting for the ack. | ||
|
||
ec->GetReliableMessageContext()->SetMessageNotAcked(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind-of raises a protocol question in the sense that, by not setting MessageNotAcked to true, we are stating that we are still open to receiving an acknowledgment of delivery via the response, although all our retransmits have been exhausted. This clearly is an indicator of misconfiguration of MRP timeouts or the application response timeout, isn't it? Ideally, if we have exhausted all retries, it should go up as an OnSendFailure() callback for the application to take a necessary action(of closing the exchange, etc) even if the ResponseTimeout is outstanding, isn't it?
Ideally, the ResponseTimeout should be some function of the MRP intervals and number of retries with some additional buffer time, and not entirely decoupled from the MRP values.
MRP, at its level has a window within which it is expected to process acknowledgments to its sent messages and if that window passes, should it still process late response messages as Acks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind-of raises a protocol question in the sense that, by not setting MessageNotAcked to true
There is no more "message not acked" flag, precisely because the name did not accurately reflect what the flag did. What the flag did (and still does after this PR, under the new name) is track "We have not gotten an ack, and if we do get an ack for the relevant message counter value we will clear this flag". That last "if ..." happens to be false once we reach this code, which is why setting the flag to true here, which was added recently, was wrong.
This clearly is an indicator of misconfiguration of MRP timeouts or the application response timeout, isn't it?
Well... The application response timeout can be quite long for a long-running operation. MRP timeouts are quite short (4s for all resends) for devices that don't claim to be sleepy. Realistically, the use of sleepy params or MRP timeouts is kind of broken, but that's a brokenness required by the spec, unfortunately, which has been reaised as a spec issue.
Ideally, if we have exhausted all retries, it should go up as an OnSendFailure() callback for the application to take a necessary action(of closing the exchange, etc) even if the ResponseTimeout is outstanding, isn't it?
Honestly: that would break the world as things stand, as long as we are following the spec-required algorithm here.
Ideally, the ResponseTimeout should be some function of the MRP intervals and number of retries with some additional buffer time, and not entirely decoupled from the MRP values.
It's not, generally, decoupled. The default behavior is that the app provides how much time to allow on top of the MRP timeout. So the response timeout is nearly always the MRP timeout + some app-defined constant.
should it still process late response messages as Acks
That is the big question, yes. The current implementation does not, because it has no way to tell whether they are in a sane way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. So, app response timeout is well correlated with MRP timeout which is good.
However, OnSendFailure() is not implemented in our SDK. Curious what happens in scenarios where there is a critical issue in sending the message out(say the key is wrong and the message cannot be encrypted) and so all MRP retransmits fail. In that case, the application is forced to wait out the full ExchangeContext timeout when it could have been notified of a Send failure.
Also, regarding the ResponseTimeout, especially in Exchanges where there are multiple back-and-forth messaging, there is a distinction between the ResponseTimeout and a timeout for the entire Exchange. The former should, ideally capture the timeout for the current outstanding sent app message, whereas the latter would be a sum of all the individual back-and-forth. So, for each new send on the same exchange, the response timeout should be re-set for each new send. Not sure if we do this, or the Response timeout is essentially a timeout for all message send and receives.
That is the big question, yes. The current implementation does not, because it has no way to tell whether they are in a sane way.
I think, we can ignore the MRPAcking part of it since that timeout has already passed, but still send the response to the application if it is still waiting and the Exchange is valid. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, OnSendFailure() is not implemented in our SDK.
Correct.
Curious what happens in scenarios where there is a critical issue in sending the message out(say the key is wrong and the message cannot be encrypted)
That's a synchronous failure from SendMessage.
there is a distinction between the ResponseTimeout and a timeout for the entire Exchange
We have no concept of "timeout for the entire Exchange" right now.
for each new send on the same exchange, the response timeout should be re-set for each new send.
Yep, that's what we do. Or rather you can configure it once, and the timer is restarted for each send. Or you can change it as you go (and CASE does that, in fact).
…-chip#29640) * Fix handling of a response that comes after MRP resends end. After project-chip#29173 we can get into the following situation: 1. A message is sent. 2. Before we get an ack or response, all MRP retries happen, MRP gives up, but the exchange response timer has not been hit yet. 3. We get an actual response. 4. Because our exchange is marked as having an un-acked message, but the incoming message is not treated as an ack (because the MRP state that would do that has been torn down), we do not clear our "have un-acked message" state and end up discarding the incoming message. The fix is as follows: * Rename things to make it clear that what we really have is "waiting for an ack" state, which in fact _does_ get cleared when we run out of MRP retries, not an "un-acked message" state. * Have a separate state bit for tracking that we ran out of MRP retries on a message we sent. * Address review comment.
After #29173 we can get into the following situation:
The fix is as follows: