-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[portsorch] fix errors when moving port from one lag to another. #1797
[portsorch] fix errors when moving port from one lag to another. #1797
Conversation
In schenario that is executed in sonic-mgmt in test_po_update.py a portchannel member is deleted from one portchannel and added to another portchannel: ``` config portchannel member del PortChannel9999 Ethernet0; config portchannel member add PortChannel0001 Ethernet0; ``` It is possible that requests from teamsynd will arrive in different order: ``` 2021-06-17.14:01:19.405245|LAG_MEMBER_TABLE:PortChannel0001:Ethernet0|SET|status:enabled 2021-06-17.14:01:19.405318|LAG_MEMBER_TABLE:PortChannel999:Ethernet0|DEL ``` This reordering happenes because teamsyncd has single event handler/selectable TeamSync::TeamPortSync::onChange() per team device so when two of them are ready it is swss::Select implementation detail in which order they are going to be returned. Signed-off-by: Stepan Blyschak <[email protected]>
Signed-off-by: stepanb <[email protected]>
…-err Signed-off-by: Stepan Blyschak <[email protected]>
ddc6758
to
9ffcc88
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small but not must comment
So here the netlink events is coming correctly but teamsyncd events are in different order to orchagent? @qiluo-msft to take a look |
orchagent/portsorch.cpp
Outdated
|
||
if (port.m_lag_member_id != SAI_NULL_OBJECT_ID) | ||
{ | ||
SWSS_LOG_NOTICE("Port %s is already a LAG member", port.m_alias.c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, please move it to INFO as we dont want this possible retry code to continuously spew logs. Otherwise the change lgtm. @judyjoseph to sign-off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not continuously print in logs, when the situation described above happens, you'll get only 1 notice message. I did not want to make it lower level because if user configured lag member which is already a part of different lag he will not see any notification in the log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you are not erasing the object, right? So there will be a retry in the next doTask? Can you please confirm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Prince is referring to a case where Ethernet0 is already part of a different Port Channel (eg: PortChannel0002) & user issues this command only --> config portchannel member add PortChannel0001 Ethernet0; (no delete command issued )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean orchagent should not print anything on such invalid configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qiluo-msft Then how to deal with reorder? or it is a ConsumerStateTable bug? although ConsumerStateTable always worked without preserving order as far as I know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prsunny I suggest mark it for 201911 as well as I see a possibility for this to happen there as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am only comment on "infinite retry". For reorder case, I think retry will pass finally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think this is required for 201911 as it is relatively stable. Secondly, please fix the log level as discussed above. As mentioned, if it is a reorder case, it will pass.. if not it will infinitely spew log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prsunny changed. Regarding 201911 I just want to make it clear that it is possible althoug I didn't tested it (a reporduction may require to put some artificial load on CPU while doing member add/remove in the loop). In 201911 it is even worth then in 202012. In 201911 the call is done in sairedis async mode which means it will crash orchagent if the call is not successful. In 202012 it will call SAI but orchagent will retry later.
@judyjoseph can you please provide your input? |
Even I would like to find a bit more on this ..shouldn't the producer events be added in the order it is received -- otherwise there could be issues with other use cases eg: vlan member addition |
orchagent/portsorch.cpp
Outdated
/* Assert the port doesn't belong to any LAG already */ | ||
assert(!port.m_lag_id && !port.m_lag_member_id); | ||
/* Assert the port is not a LAG */ | ||
assert(port.m_lag_id == SAI_NULL_OBJECT_ID); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, are you saying this needs to be handled in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this assert, placed a check for correct port type instead at the begginning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually there are lots of asserts in portsorch, only this is addressed
Signed-off-by: Stepan Blyschak <[email protected]>
Signed-off-by: Stepan Blyschak <[email protected]>
Signed-off-by: Stepan Blyschak <[email protected]>
@judyjoseph There could be other similar bugs in orchs that do not expect reorder. |
The design of ProducerStateTable and ConsumerStateTable only respect the final state, not the order. Even m_toSync also design with this philosophy. You could use ProducerTable and ConsumerTable to conquer first shortage, but not the second. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please check with other reviewers.
Signed-off-by: Stepan Blyschak <[email protected]>
@judyjoseph and @prsunny could you please review and approve? we need this for 202012 as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please check with other reviewers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, minor comment. @judyjoseph to approve
orchagent/portsorch.cpp
Outdated
@@ -3676,6 +3682,15 @@ void PortsOrch::doLagMemberTask(Consumer &consumer) | |||
continue; | |||
} | |||
|
|||
/* Fast failure if a port type is not a valid type for beeing a LAG member port. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please revisit the comment? What is fast failure? Also beeing -> typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Signed-off-by: Stepan Blyschak <[email protected]>
…-err Signed-off-by: Stepan Blyschak <[email protected]>
Signed-off-by: Stepan Blyschak <[email protected]>
@judyjoseph kindly reminder, can you please review following the changes integrated? |
Sure, have a minor comment - else looks good ! |
This commit could not be cleanly cherry-pick to 202012. Please submit another PR. |
…ic-net#1797) In scenario that is executed in sonic-mgmt in test_po_update.py a portchannel member is deleted from one portchannel and added to another portchannel. It is possible that requests from teamsynd will arrive in different order This reordering happens because teamsyncd has single event handler/selectable TeamSync::TeamPortSync::onChange() per team device so when two of them are ready it is swss::Select implementation detail in which order they are going to be returned. This is a fundamental issue of Producer/ConsumerStateTable, thus orchagent must be aware of this and treat it as normal situation and figure out the right order and not crash or print an errors. - What I did Check if port is already a lag member beforehand. Added an UT to cover this scenario, this UT verifies that SAI API is not called in this case. Refactored portsorch_ut.cpp by moving out Orchs creation/deletion into SetUp()/TearDown() - Why I did it To fix errors in log. - How I verified it Ran test_po_update.py test. Signed-off-by: Stepan Blyschak [email protected]
…ic-net#1797) In scenario that is executed in sonic-mgmt in test_po_update.py a portchannel member is deleted from one portchannel and added to another portchannel. It is possible that requests from teamsynd will arrive in different order This reordering happens because teamsyncd has single event handler/selectable TeamSync::TeamPortSync::onChange() per team device so when two of them are ready it is swss::Select implementation detail in which order they are going to be returned. This is a fundamental issue of Producer/ConsumerStateTable, thus orchagent must be aware of this and treat it as normal situation and figure out the right order and not crash or print an errors. - What I did Check if port is already a lag member beforehand. Added an UT to cover this scenario, this UT verifies that SAI API is not called in this case. Refactored portsorch_ut.cpp by moving out Orchs creation/deletion into SetUp()/TearDown() - Why I did it To fix errors in log. - How I verified it Ran test_po_update.py test. Signed-off-by: Stepan Blyschak [email protected]
…er. (#1819) In scenario that is executed in sonic-mgmt in test_po_update.py a portchannel member is deleted from one portchannel and added to another portchannel. It is possible that requests from teamsynd will arrive in different order This reordering happens because teamsyncd has single event handler/selectable TeamSync::TeamPortSync::onChange() per team device so when two of them are ready it is swss::Select implementation detail in which order they are going to be returned. This is a fundamental issue of Producer/ConsumerStateTable, thus orchagent must be aware of this and treat it as normal situation and figure out the right order and not crash or print an errors. Original PR #1797 **What I did** Check if port is already a lag member beforehand. Added an UT to cover this scenario, this UT verifies that SAI API is not called in this case. Refactored portsorch_ut.cpp by moving out Orchs creation/deletion into SetUp()/TearDown() **Why I did it** To fix errors in log. **How I verified it** Ran test_po_update.py test.
In scenario that is executed in sonic-mgmt in test_po_update.py a portchannel member is deleted from one portchannel and added to another portchannel. It is possible that requests from teamsynd will arrive in different order This reordering happens because teamsyncd has single event handler/selectable TeamSync::TeamPortSync::onChange() per team device so when two of them are ready it is swss::Select implementation detail in which order they are going to be returned. This is a fundamental issue of Producer/ConsumerStateTable, thus orchagent must be aware of this and treat it as normal situation and figure out the right order and not crash or print an errors. - What I did Check if port is already a lag member beforehand. Added an UT to cover this scenario, this UT verifies that SAI API is not called in this case. Refactored portsorch_ut.cpp by moving out Orchs creation/deletion into SetUp()/TearDown() - Why I did it To fix errors in log. - How I verified it Ran test_po_update.py test. Signed-off-by: Stepan Blyschak [email protected]
…ic-net#1797) In scenario that is executed in sonic-mgmt in test_po_update.py a portchannel member is deleted from one portchannel and added to another portchannel. It is possible that requests from teamsynd will arrive in different order This reordering happens because teamsyncd has single event handler/selectable TeamSync::TeamPortSync::onChange() per team device so when two of them are ready it is swss::Select implementation detail in which order they are going to be returned. This is a fundamental issue of Producer/ConsumerStateTable, thus orchagent must be aware of this and treat it as normal situation and figure out the right order and not crash or print an errors. - What I did Check if port is already a lag member beforehand. Added an UT to cover this scenario, this UT verifies that SAI API is not called in this case. Refactored portsorch_ut.cpp by moving out Orchs creation/deletion into SetUp()/TearDown() - Why I did it To fix errors in log. - How I verified it Ran test_po_update.py test. Signed-off-by: Stepan Blyschak [email protected]
…faces by default (sonic-net#1797) * Modifed the 'show ipv6 link-local-mode' command to display all interfaces by default Signed-off-by: Akhilesh Samineni <[email protected]>
In scenario that is executed in sonic-mgmt in test_po_update.py a portchannel member is deleted from one portchannel and added to another portchannel:
It is possible that requests from teamsynd will arrive to PortsOrch different order:
Consider following python simulated flow between teamsyncd and orchangent:
This is a fundamental issue of Producer/ConsumerStateTable, thus orchagent must be aware of this and treat it as normal situation and figure out the right order and not crash or print an errors.
Signed-off-by: Stepan Blyschak [email protected]
What I did
Check if port is already a lag member beforehand.
Added an UT to cover this scenario, this UT verifies that SAI API is not called in this case.
Refactored portsorch_ut.cpp by moving out Orchs creation/deletion into SetUp()/TearDown()
Why I did it
To fix errors in log.
How I verified it
Ran test_po_update.py test.
Details if related