[portsorch] fix errors when moving port from one lag to another. #1797

stepanblyschak · 2021-06-22T16:55:11Z

In scenario that is executed in sonic-mgmt in test_po_update.py a portchannel member is deleted from one portchannel and added to another portchannel:

config portchannel member del PortChannel9999 Ethernet0; config
portchannel member add PortChannel0001 Ethernet0;

It is possible that requests from teamsynd will arrive to PortsOrch different order:

2021-06-17.14:01:19.405245|LAG_MEMBER_TABLE:PortChannel0001:Ethernet0|SET|status:enabled
2021-06-17.14:01:19.405318|LAG_MEMBER_TABLE:PortChannel999:Ethernet0|DEL

Consider following python simulated flow between teamsyncd and orchangent:

from swsscommon import swsscommon

db = swsscommon.DBConnector('APPL_DB', 0)
table = swsscommon.ProducerStateTable(db, 'LAG_MEMBER_TABLE')
consumer = swsscommon.ConsumerStateTable(db, 'LAG_MEMBER_TABLE')

fvs = swsscommon.FieldValuePairs([('field', 'value')])

# teamsyncd deletes member from one lag and adds to another lag.
# orchagent is blocked due to other tasks (updating routes for example)
table.delete('PortChannel999:Ethernet112')
table.set('PortChannel0001:Ethernet112', fvs)

# orchagent unblocks and pops data
print(consumer.pop())
print(consumer.pop())

# orchagent will receive this:
# ['PortChannel0001:Ethernet112', 'SET', (('field', 'value'),)]
# ['PortChannel999:Ethernet112', 'DEL', ()]

This is a fundamental issue of Producer/ConsumerStateTable, thus orchagent must be aware of this and treat it as normal situation and figure out the right order and not crash or print an errors.

Signed-off-by: Stepan Blyschak [email protected]

What I did

Check if port is already a lag member beforehand.
Added an UT to cover this scenario, this UT verifies that SAI API is not called in this case.
Refactored portsorch_ut.cpp by moving out Orchs creation/deletion into SetUp()/TearDown()

Why I did it

To fix errors in log.

How I verified it

Ran test_po_update.py test.

Details if related

In schenario that is executed in sonic-mgmt in test_po_update.py a portchannel member is deleted from one portchannel and added to another portchannel: ``` config portchannel member del PortChannel9999 Ethernet0; config portchannel member add PortChannel0001 Ethernet0; ``` It is possible that requests from teamsynd will arrive in different order: ``` 2021-06-17.14:01:19.405245|LAG_MEMBER_TABLE:PortChannel0001:Ethernet0|SET|status:enabled 2021-06-17.14:01:19.405318|LAG_MEMBER_TABLE:PortChannel999:Ethernet0|DEL ``` This reordering happenes because teamsyncd has single event handler/selectable TeamSync::TeamPortSync::onChange() per team device so when two of them are ready it is swss::Select implementation detail in which order they are going to be returned. Signed-off-by: Stepan Blyschak <[email protected]>

Signed-off-by: stepanb <[email protected]>

…-err Signed-off-by: Stepan Blyschak <[email protected]>

tests/mock_tests/portsorch_ut.cpp

liat-grozovik

small but not must comment

prsunny · 2021-06-25T15:36:37Z

So here the netlink events is coming correctly but teamsyncd events are in different order to orchagent? @qiluo-msft to take a look

prsunny · 2021-06-25T15:39:45Z

orchagent/portsorch.cpp

+
+                if (port.m_lag_member_id != SAI_NULL_OBJECT_ID)
+                {
+                    SWSS_LOG_NOTICE("Port %s is already a LAG member", port.m_alias.c_str());


IMO, please move it to INFO as we dont want this possible retry code to continuously spew logs. Otherwise the change lgtm. @judyjoseph to sign-off.

This will not continuously print in logs, when the situation described above happens, you'll get only 1 notice message. I did not want to make it lower level because if user configured lag member which is already a part of different lag he will not see any notification in the log.

But you are not erasing the object, right? So there will be a retry in the next doTask? Can you please confirm?

I think Prince is referring to a case where Ethernet0 is already part of a different Port Channel (eg: PortChannel0002) & user issues this command only --> config portchannel member add PortChannel0001 Ethernet0; (no delete command issued )

Does this mean orchagent should not print anything on such invalid configuration?

@qiluo-msft Then how to deal with reorder? or it is a ConsumerStateTable bug? although ConsumerStateTable always worked without preserving order as far as I know.

@prsunny I suggest mark it for 201911 as well as I see a possibility for this to happen there as well.

I am only comment on "infinite retry". For reorder case, I think retry will pass finally.

I dont think this is required for 201911 as it is relatively stable. Secondly, please fix the log level as discussed above. As mentioned, if it is a reorder case, it will pass.. if not it will infinitely spew log.

@prsunny changed. Regarding 201911 I just want to make it clear that it is possible althoug I didn't tested it (a reporduction may require to put some artificial load on CPU while doing member add/remove in the loop). In 201911 it is even worth then in 202012. In 201911 the call is done in sairedis async mode which means it will crash orchagent if the call is not successful. In 202012 it will call SAI but orchagent will retry later.

liat-grozovik · 2021-06-28T13:36:11Z

@judyjoseph can you please provide your input?

judyjoseph · 2021-06-29T17:11:24Z

So here the netlink events is coming correctly but teamsyncd events are in different order to orchagent? @qiluo-msft to take a look

Even I would like to find a bit more on this ..shouldn't the producer events be added in the order it is received -- otherwise there could be issues with other use cases eg: vlan member addition

qiluo-msft · 2021-06-30T11:39:28Z

orchagent/portsorch.cpp

-                /* Assert the port doesn't belong to any LAG already */
-                assert(!port.m_lag_id && !port.m_lag_member_id);
+                /* Assert the port is not a LAG */
+                assert(port.m_lag_id == SAI_NULL_OBJECT_ID);


assert

assert will be nop in release build. This is a runtime eror, we need check and treat it as an error. #Closed

Right, are you saying this needs to be handled in this PR?

yes, please

Removed this assert, placed a check for correct port type instead at the begginning.

Actually there are lots of asserts in portsorch, only this is addressed

Signed-off-by: Stepan Blyschak <[email protected]>

stepanblyschak · 2021-07-02T11:04:03Z

@judyjoseph There could be other similar bugs in orchs that do not expect reorder.
@qiluo-msft Could you please comment on this?

qiluo-msft · 2021-07-02T13:12:45Z

The design of ProducerStateTable and ConsumerStateTable only respect the final state, not the order. Even m_toSync also design with this philosophy.

You could use ProducerTable and ConsumerTable to conquer first shortage, but not the second.

qiluo-msft

LGTM. Please check with other reviewers.

Signed-off-by: Stepan Blyschak <[email protected]>

liat-grozovik · 2021-07-04T06:25:13Z

@judyjoseph and @prsunny could you please review and approve? we need this for 202012 as well.

qiluo-msft

LGTM. Please check with other reviewers.

prsunny

lgtm, minor comment. @judyjoseph to approve

prsunny · 2021-07-07T01:50:42Z

orchagent/portsorch.cpp

@@ -3676,6 +3682,15 @@ void PortsOrch::doLagMemberTask(Consumer &consumer)
            continue;
        }

+        /* Fast failure if a port type is not a valid type for beeing a LAG member port.


Could you please revisit the comment? What is fast failure? Also beeing -> typo

Signed-off-by: Stepan Blyschak <[email protected]>

…-err Signed-off-by: Stepan Blyschak <[email protected]>

Signed-off-by: Stepan Blyschak <[email protected]>

liat-grozovik · 2021-07-07T17:34:21Z

@judyjoseph kindly reminder, can you please review following the changes integrated?

orchagent/portsorch.cpp

judyjoseph · 2021-07-07T19:53:57Z

@judyjoseph kindly reminder, can you please review following the changes integrated?

Sure, have a minor comment - else looks good !

qiluo-msft · 2021-07-14T06:44:27Z

This commit could not be cleanly cherry-pick to 202012. Please submit another PR.

…ic-net#1797) In scenario that is executed in sonic-mgmt in test_po_update.py a portchannel member is deleted from one portchannel and added to another portchannel. It is possible that requests from teamsynd will arrive in different order This reordering happens because teamsyncd has single event handler/selectable TeamSync::TeamPortSync::onChange() per team device so when two of them are ready it is swss::Select implementation detail in which order they are going to be returned. This is a fundamental issue of Producer/ConsumerStateTable, thus orchagent must be aware of this and treat it as normal situation and figure out the right order and not crash or print an errors. - What I did Check if port is already a lag member beforehand. Added an UT to cover this scenario, this UT verifies that SAI API is not called in this case. Refactored portsorch_ut.cpp by moving out Orchs creation/deletion into SetUp()/TearDown() - Why I did it To fix errors in log. - How I verified it Ran test_po_update.py test. Signed-off-by: Stepan Blyschak [email protected]

…er. (#1819) In scenario that is executed in sonic-mgmt in test_po_update.py a portchannel member is deleted from one portchannel and added to another portchannel. It is possible that requests from teamsynd will arrive in different order This reordering happens because teamsyncd has single event handler/selectable TeamSync::TeamPortSync::onChange() per team device so when two of them are ready it is swss::Select implementation detail in which order they are going to be returned. This is a fundamental issue of Producer/ConsumerStateTable, thus orchagent must be aware of this and treat it as normal situation and figure out the right order and not crash or print an errors. Original PR #1797 **What I did** Check if port is already a lag member beforehand. Added an UT to cover this scenario, this UT verifies that SAI API is not called in this case. Refactored portsorch_ut.cpp by moving out Orchs creation/deletion into SetUp()/TearDown() **Why I did it** To fix errors in log. **How I verified it** Ran test_po_update.py test.

In scenario that is executed in sonic-mgmt in test_po_update.py a portchannel member is deleted from one portchannel and added to another portchannel. It is possible that requests from teamsynd will arrive in different order This reordering happens because teamsyncd has single event handler/selectable TeamSync::TeamPortSync::onChange() per team device so when two of them are ready it is swss::Select implementation detail in which order they are going to be returned. This is a fundamental issue of Producer/ConsumerStateTable, thus orchagent must be aware of this and treat it as normal situation and figure out the right order and not crash or print an errors. - What I did Check if port is already a lag member beforehand. Added an UT to cover this scenario, this UT verifies that SAI API is not called in this case. Refactored portsorch_ut.cpp by moving out Orchs creation/deletion into SetUp()/TearDown() - Why I did it To fix errors in log. - How I verified it Ran test_po_update.py test. Signed-off-by: Stepan Blyschak [email protected]

…ic-net#1797) In scenario that is executed in sonic-mgmt in test_po_update.py a portchannel member is deleted from one portchannel and added to another portchannel. It is possible that requests from teamsynd will arrive in different order This reordering happens because teamsyncd has single event handler/selectable TeamSync::TeamPortSync::onChange() per team device so when two of them are ready it is swss::Select implementation detail in which order they are going to be returned. This is a fundamental issue of Producer/ConsumerStateTable, thus orchagent must be aware of this and treat it as normal situation and figure out the right order and not crash or print an errors. - What I did Check if port is already a lag member beforehand. Added an UT to cover this scenario, this UT verifies that SAI API is not called in this case. Refactored portsorch_ut.cpp by moving out Orchs creation/deletion into SetUp()/TearDown() - Why I did it To fix errors in log. - How I verified it Ran test_po_update.py test. Signed-off-by: Stepan Blyschak [email protected]

…faces by default (sonic-net#1797) * Modifed the 'show ipv6 link-local-mode' command to display all interfaces by default Signed-off-by: Akhilesh Samineni <[email protected]>

stepanblyschak and others added 2 commits June 17, 2021 15:56

add ut

89cbcbd

Signed-off-by: stepanb <[email protected]>

stepanblyschak requested a review from prsunny as a code owner June 22, 2021 16:55

stepanblyschak changed the title ~~Lag mem add err~~ [portsorch] fix errors when moving port from one lag to another. Jun 22, 2021

Merge branch 'master' of github.com:azure/sonic-swss into lag-mem-add…

9ffcc88

…-err Signed-off-by: Stepan Blyschak <[email protected]>

stepanblyschak force-pushed the lag-mem-add-err branch from ddc6758 to 9ffcc88 Compare June 22, 2021 17:19

liat-grozovik reviewed Jun 23, 2021

View reviewed changes

tests/mock_tests/portsorch_ut.cpp Show resolved Hide resolved

liat-grozovik previously approved these changes Jun 23, 2021

View reviewed changes

liat-grozovik requested a review from judyjoseph June 23, 2021 09:57

liat-grozovik added the Request for 202012 Branch label Jun 25, 2021

prsunny requested a review from qiluo-msft June 25, 2021 15:36

prsunny reviewed Jun 25, 2021

View reviewed changes

qiluo-msft reviewed Jun 30, 2021

View reviewed changes

stepanblyschak added 4 commits June 30, 2021 12:31

Merge branch 'master' of https://github.com/Azure/sonic-swss into HEAD

2b14813

remove assert

a084a19

Signed-off-by: Stepan Blyschak <[email protected]>

Merge branch 'master' of https://github.com/Azure/sonic-swss into HEAD

480b576

Signed-off-by: Stepan Blyschak <[email protected]>

add test description

820714f

Signed-off-by: Stepan Blyschak <[email protected]>

stepanblyschak dismissed liat-grozovik’s stale review via 820714f July 2, 2021 10:57

qiluo-msft previously approved these changes Jul 2, 2021

View reviewed changes

change log levl to INFO

a7e44f0

Signed-off-by: Stepan Blyschak <[email protected]>

stepanblyschak dismissed qiluo-msft’s stale review via a7e44f0 July 2, 2021 16:22

qiluo-msft previously approved these changes Jul 4, 2021

View reviewed changes

prsunny reviewed Jul 7, 2021

View reviewed changes

fix comment

460d4b0

Signed-off-by: Stepan Blyschak <[email protected]>

stepanblyschak dismissed qiluo-msft’s stale review via 460d4b0 July 7, 2021 12:43

stepanblyschak added 2 commits July 7, 2021 15:46

Merge branch 'master' of github.com:azure/sonic-swss into lag-mem-add…

681c822

…-err Signed-off-by: Stepan Blyschak <[email protected]>

remove changes added during merge resolving

736f739

Signed-off-by: Stepan Blyschak <[email protected]>

prsunny approved these changes Jul 7, 2021

View reviewed changes

judyjoseph reviewed Jul 7, 2021

View reviewed changes

orchagent/portsorch.cpp Show resolved Hide resolved

judyjoseph approved these changes Jul 7, 2021

View reviewed changes

liat-grozovik merged commit 4f1d726 into sonic-net:master Jul 8, 2021

stepanblyschak mentioned this pull request Jul 19, 2021

[202012][portsorch] fix errors when moving port from one lag to another. #1819

Merged

qiluo-msft removed the Request for 202012 Branch label Jul 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[portsorch] fix errors when moving port from one lag to another. #1797

[portsorch] fix errors when moving port from one lag to another. #1797

stepanblyschak commented Jun 22, 2021 •

edited by liat-grozovik

Loading

liat-grozovik left a comment

prsunny commented Jun 25, 2021

prsunny Jun 25, 2021

stepanblyschak Jun 29, 2021

prsunny Jun 29, 2021

judyjoseph Jun 29, 2021 •

edited

Loading

stepanblyschak Jun 29, 2021

stepanblyschak Jul 2, 2021

stepanblyschak Jul 2, 2021

qiluo-msft Jul 2, 2021

prsunny Jul 2, 2021

stepanblyschak Jul 2, 2021

liat-grozovik commented Jun 28, 2021

judyjoseph commented Jun 29, 2021

qiluo-msft Jun 30, 2021 •

edited

Loading

stepanblyschak Jun 30, 2021

qiluo-msft Jun 30, 2021

stepanblyschak Jul 2, 2021

stepanblyschak Jul 2, 2021 •

edited

Loading

stepanblyschak commented Jul 2, 2021

qiluo-msft commented Jul 2, 2021 •

edited

Loading

qiluo-msft left a comment

liat-grozovik commented Jul 4, 2021

qiluo-msft left a comment

prsunny left a comment

prsunny Jul 7, 2021

stepanblyschak Jul 7, 2021

liat-grozovik commented Jul 7, 2021

judyjoseph commented Jul 7, 2021

qiluo-msft commented Jul 14, 2021

[portsorch] fix errors when moving port from one lag to another. #1797

[portsorch] fix errors when moving port from one lag to another. #1797

Conversation

stepanblyschak commented Jun 22, 2021 • edited by liat-grozovik Loading

liat-grozovik left a comment

Choose a reason for hiding this comment

prsunny commented Jun 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

judyjoseph Jun 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liat-grozovik commented Jun 28, 2021

judyjoseph commented Jun 29, 2021

qiluo-msft Jun 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stepanblyschak Jul 2, 2021 • edited Loading

Choose a reason for hiding this comment

stepanblyschak commented Jul 2, 2021

qiluo-msft commented Jul 2, 2021 • edited Loading

qiluo-msft left a comment

Choose a reason for hiding this comment

liat-grozovik commented Jul 4, 2021

qiluo-msft left a comment

Choose a reason for hiding this comment

prsunny left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liat-grozovik commented Jul 7, 2021

judyjoseph commented Jul 7, 2021

qiluo-msft commented Jul 14, 2021

stepanblyschak commented Jun 22, 2021 •

edited by liat-grozovik

Loading

judyjoseph Jun 29, 2021 •

edited

Loading

qiluo-msft Jun 30, 2021 •

edited

Loading

stepanblyschak Jul 2, 2021 •

edited

Loading

qiluo-msft commented Jul 2, 2021 •

edited

Loading