Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[portsorch] fix errors when moving port from one lag to another. #1797

Merged
merged 11 commits into from
Jul 8, 2021

Conversation

stepanblyschak
Copy link
Contributor

@stepanblyschak stepanblyschak commented Jun 22, 2021

In scenario that is executed in sonic-mgmt in test_po_update.py a portchannel member is deleted from one portchannel and added to another portchannel:

config portchannel member del PortChannel9999 Ethernet0; config
portchannel member add PortChannel0001 Ethernet0;

It is possible that requests from teamsynd will arrive to PortsOrch different order:

2021-06-17.14:01:19.405245|LAG_MEMBER_TABLE:PortChannel0001:Ethernet0|SET|status:enabled
2021-06-17.14:01:19.405318|LAG_MEMBER_TABLE:PortChannel999:Ethernet0|DEL

Consider following python simulated flow between teamsyncd and orchangent:

from swsscommon import swsscommon

db = swsscommon.DBConnector('APPL_DB', 0)
table = swsscommon.ProducerStateTable(db, 'LAG_MEMBER_TABLE')
consumer = swsscommon.ConsumerStateTable(db, 'LAG_MEMBER_TABLE')

fvs = swsscommon.FieldValuePairs([('field', 'value')])

# teamsyncd deletes member from one lag and adds to another lag.
# orchagent is blocked due to other tasks (updating routes for example)
table.delete('PortChannel999:Ethernet112')
table.set('PortChannel0001:Ethernet112', fvs)

# orchagent unblocks and pops data
print(consumer.pop())
print(consumer.pop())

# orchagent will receive this:
# ['PortChannel0001:Ethernet112', 'SET', (('field', 'value'),)]
# ['PortChannel999:Ethernet112', 'DEL', ()]

This is a fundamental issue of Producer/ConsumerStateTable, thus orchagent must be aware of this and treat it as normal situation and figure out the right order and not crash or print an errors.

Signed-off-by: Stepan Blyschak [email protected]

What I did

Check if port is already a lag member beforehand.
Added an UT to cover this scenario, this UT verifies that SAI API is not called in this case.
Refactored portsorch_ut.cpp by moving out Orchs creation/deletion into SetUp()/TearDown()

Why I did it

To fix errors in log.

How I verified it

Ran test_po_update.py test.

Details if related

stepanblyschak and others added 2 commits June 17, 2021 15:56
In schenario that is executed in sonic-mgmt in test_po_update.py a
portchannel member is deleted from one portchannel and added to another
portchannel:

```
config portchannel member del PortChannel9999 Ethernet0; config
portchannel member add PortChannel0001 Ethernet0;
```

It is possible that requests from teamsynd will arrive in
different order:

```
2021-06-17.14:01:19.405245|LAG_MEMBER_TABLE:PortChannel0001:Ethernet0|SET|status:enabled
2021-06-17.14:01:19.405318|LAG_MEMBER_TABLE:PortChannel999:Ethernet0|DEL
```

This reordering happenes because teamsyncd has single event
handler/selectable TeamSync::TeamPortSync::onChange() per team device so when two of them
are ready it is swss::Select implementation detail in which order they are going to be returned.

Signed-off-by: Stepan Blyschak <[email protected]>
Signed-off-by: stepanb <[email protected]>
@stepanblyschak stepanblyschak requested a review from prsunny as a code owner June 22, 2021 16:55
@stepanblyschak stepanblyschak changed the title Lag mem add err [portsorch] fix errors when moving port from one lag to another. Jun 22, 2021
liat-grozovik
liat-grozovik previously approved these changes Jun 23, 2021
Copy link
Collaborator

@liat-grozovik liat-grozovik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small but not must comment

@prsunny
Copy link
Collaborator

prsunny commented Jun 25, 2021

So here the netlink events is coming correctly but teamsyncd events are in different order to orchagent? @qiluo-msft to take a look

@prsunny prsunny requested a review from qiluo-msft June 25, 2021 15:36

if (port.m_lag_member_id != SAI_NULL_OBJECT_ID)
{
SWSS_LOG_NOTICE("Port %s is already a LAG member", port.m_alias.c_str());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, please move it to INFO as we dont want this possible retry code to continuously spew logs. Otherwise the change lgtm. @judyjoseph to sign-off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not continuously print in logs, when the situation described above happens, you'll get only 1 notice message. I did not want to make it lower level because if user configured lag member which is already a part of different lag he will not see any notification in the log.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you are not erasing the object, right? So there will be a retry in the next doTask? Can you please confirm?

Copy link
Contributor

@judyjoseph judyjoseph Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Prince is referring to a case where Ethernet0 is already part of a different Port Channel (eg: PortChannel0002) & user issues this command only --> config portchannel member add PortChannel0001 Ethernet0; (no delete command issued )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean orchagent should not print anything on such invalid configuration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiluo-msft Then how to deal with reorder? or it is a ConsumerStateTable bug? although ConsumerStateTable always worked without preserving order as far as I know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prsunny I suggest mark it for 201911 as well as I see a possibility for this to happen there as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am only comment on "infinite retry". For reorder case, I think retry will pass finally.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think this is required for 201911 as it is relatively stable. Secondly, please fix the log level as discussed above. As mentioned, if it is a reorder case, it will pass.. if not it will infinitely spew log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prsunny changed. Regarding 201911 I just want to make it clear that it is possible althoug I didn't tested it (a reporduction may require to put some artificial load on CPU while doing member add/remove in the loop). In 201911 it is even worth then in 202012. In 201911 the call is done in sairedis async mode which means it will crash orchagent if the call is not successful. In 202012 it will call SAI but orchagent will retry later.

@liat-grozovik
Copy link
Collaborator

@judyjoseph can you please provide your input?

@judyjoseph
Copy link
Contributor

So here the netlink events is coming correctly but teamsyncd events are in different order to orchagent? @qiluo-msft to take a look

Even I would like to find a bit more on this ..shouldn't the producer events be added in the order it is received -- otherwise there could be issues with other use cases eg: vlan member addition

/* Assert the port doesn't belong to any LAG already */
assert(!port.m_lag_id && !port.m_lag_member_id);
/* Assert the port is not a LAG */
assert(port.m_lag_id == SAI_NULL_OBJECT_ID);
Copy link
Contributor

@qiluo-msft qiluo-msft Jun 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert

assert will be nop in release build. This is a runtime eror, we need check and treat it as an error. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, are you saying this needs to be handled in this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this assert, placed a check for correct port type instead at the begginning.

Copy link
Contributor Author

@stepanblyschak stepanblyschak Jul 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there are lots of asserts in portsorch, only this is addressed

@stepanblyschak
Copy link
Contributor Author

@judyjoseph There could be other similar bugs in orchs that do not expect reorder.
@qiluo-msft Could you please comment on this?

@qiluo-msft
Copy link
Contributor

qiluo-msft commented Jul 2, 2021

The design of ProducerStateTable and ConsumerStateTable only respect the final state, not the order. Even m_toSync also design with this philosophy.

You could use ProducerTable and ConsumerTable to conquer first shortage, but not the second.

qiluo-msft
qiluo-msft previously approved these changes Jul 2, 2021
Copy link
Contributor

@qiluo-msft qiluo-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please check with other reviewers.

Signed-off-by: Stepan Blyschak <[email protected]>
@liat-grozovik
Copy link
Collaborator

@judyjoseph and @prsunny could you please review and approve? we need this for 202012 as well.

qiluo-msft
qiluo-msft previously approved these changes Jul 4, 2021
Copy link
Contributor

@qiluo-msft qiluo-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please check with other reviewers.

Copy link
Collaborator

@prsunny prsunny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, minor comment. @judyjoseph to approve

@@ -3676,6 +3682,15 @@ void PortsOrch::doLagMemberTask(Consumer &consumer)
continue;
}

/* Fast failure if a port type is not a valid type for beeing a LAG member port.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please revisit the comment? What is fast failure? Also beeing -> typo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Signed-off-by: Stepan Blyschak <[email protected]>
@liat-grozovik
Copy link
Collaborator

@judyjoseph kindly reminder, can you please review following the changes integrated?

@judyjoseph
Copy link
Contributor

@judyjoseph kindly reminder, can you please review following the changes integrated?

Sure, have a minor comment - else looks good !

@liat-grozovik liat-grozovik merged commit 4f1d726 into sonic-net:master Jul 8, 2021
@qiluo-msft
Copy link
Contributor

This commit could not be cleanly cherry-pick to 202012. Please submit another PR.

stepanblyschak added a commit to stepanblyschak/sonic-swss that referenced this pull request Jul 14, 2021
…ic-net#1797)

In scenario that is executed in sonic-mgmt in test_po_update.py a portchannel member is deleted from one portchannel and added to another portchannel.
It is possible that requests from teamsynd will arrive in different order
This reordering happens because teamsyncd has single event handler/selectable TeamSync::TeamPortSync::onChange() per team device so when two of them are ready it is swss::Select implementation detail in which order they are going to be returned.
This is a fundamental issue of Producer/ConsumerStateTable, thus orchagent must be aware of this and treat it as normal situation and figure out the right order and not crash or print an errors.

- What I did
Check if port is already a lag member beforehand.
Added an UT to cover this scenario, this UT verifies that SAI API is not called in this case.
Refactored portsorch_ut.cpp by moving out Orchs creation/deletion into SetUp()/TearDown()

- Why I did it
To fix errors in log.

- How I verified it
Ran test_po_update.py test.

Signed-off-by: Stepan Blyschak [email protected]
stepanblyschak added a commit to stepanblyschak/sonic-swss that referenced this pull request Jul 14, 2021
…ic-net#1797)

In scenario that is executed in sonic-mgmt in test_po_update.py a portchannel member is deleted from one portchannel and added to another portchannel.
It is possible that requests from teamsynd will arrive in different order
This reordering happens because teamsyncd has single event handler/selectable TeamSync::TeamPortSync::onChange() per team device so when two of them are ready it is swss::Select implementation detail in which order they are going to be returned.
This is a fundamental issue of Producer/ConsumerStateTable, thus orchagent must be aware of this and treat it as normal situation and figure out the right order and not crash or print an errors.

- What I did
Check if port is already a lag member beforehand.
Added an UT to cover this scenario, this UT verifies that SAI API is not called in this case.
Refactored portsorch_ut.cpp by moving out Orchs creation/deletion into SetUp()/TearDown()

- Why I did it
To fix errors in log.

- How I verified it
Ran test_po_update.py test.

Signed-off-by: Stepan Blyschak [email protected]
qiluo-msft pushed a commit that referenced this pull request Jul 20, 2021
…er. (#1819)

In scenario that is executed in sonic-mgmt in test_po_update.py a portchannel member is deleted from one portchannel and added to another portchannel.
It is possible that requests from teamsynd will arrive in different order
This reordering happens because teamsyncd has single event handler/selectable TeamSync::TeamPortSync::onChange() per team device so when two of them are ready it is swss::Select implementation detail in which order they are going to be returned.
This is a fundamental issue of Producer/ConsumerStateTable, thus orchagent must be aware of this and treat it as normal situation and figure out the right order and not crash or print an errors.

Original PR #1797

**What I did**
Check if port is already a lag member beforehand.
Added an UT to cover this scenario, this UT verifies that SAI API is not called in this case.
Refactored portsorch_ut.cpp by moving out Orchs creation/deletion into SetUp()/TearDown()

**Why I did it**
To fix errors in log.

**How I verified it**
Ran test_po_update.py test.
judyjoseph pushed a commit that referenced this pull request Aug 10, 2021
In scenario that is executed in sonic-mgmt in test_po_update.py a portchannel member is deleted from one portchannel and added to another portchannel.
It is possible that requests from teamsynd will arrive in different order
This reordering happens because teamsyncd has single event handler/selectable TeamSync::TeamPortSync::onChange() per team device so when two of them are ready it is swss::Select implementation detail in which order they are going to be returned.
This is a fundamental issue of Producer/ConsumerStateTable, thus orchagent must be aware of this and treat it as normal situation and figure out the right order and not crash or print an errors.

- What I did
Check if port is already a lag member beforehand.
Added an UT to cover this scenario, this UT verifies that SAI API is not called in this case.
Refactored portsorch_ut.cpp by moving out Orchs creation/deletion into SetUp()/TearDown()

- Why I did it
To fix errors in log.

- How I verified it
Ran test_po_update.py test.

Signed-off-by: Stepan Blyschak [email protected]
raphaelt-nvidia pushed a commit to raphaelt-nvidia/sonic-swss that referenced this pull request Oct 5, 2021
…ic-net#1797)

In scenario that is executed in sonic-mgmt in test_po_update.py a portchannel member is deleted from one portchannel and added to another portchannel.
It is possible that requests from teamsynd will arrive in different order
This reordering happens because teamsyncd has single event handler/selectable TeamSync::TeamPortSync::onChange() per team device so when two of them are ready it is swss::Select implementation detail in which order they are going to be returned.
This is a fundamental issue of Producer/ConsumerStateTable, thus orchagent must be aware of this and treat it as normal situation and figure out the right order and not crash or print an errors.

- What I did
Check if port is already a lag member beforehand.
Added an UT to cover this scenario, this UT verifies that SAI API is not called in this case.
Refactored portsorch_ut.cpp by moving out Orchs creation/deletion into SetUp()/TearDown()

- Why I did it
To fix errors in log.

- How I verified it
Ran test_po_update.py test.

Signed-off-by: Stepan Blyschak [email protected]
EdenGri pushed a commit to EdenGri/sonic-swss that referenced this pull request Feb 28, 2022
…faces by default (sonic-net#1797)

* Modifed the 'show ipv6 link-local-mode' command to display all interfaces by default
Signed-off-by: Akhilesh Samineni <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants