Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[portsorch] fix PortsOrch::allPortsReady() returns true when it should not #1103

Merged
merged 11 commits into from
Nov 1, 2019

Conversation

stepanblyschak
Copy link
Contributor

@stepanblyschak stepanblyschak commented Oct 23, 2019

Warm start flow before the change:

1st iteration:

- BufferOrch::doTask(): returns since PortInitDone hasn't arrived yet
- PortsOrch::doTask():  processes all PORT_TABLE untill PortInitDone flag
                        m_pendingPortSet is empty yet and m_portInitDone is true
                        so allPortsReady() will return true
- AnyOrch::doTask():    check g_portsOrch->allPortsReady()

2nd iteration:

- BufferOrch::doTask(): now buffers are applied

This causes BufferOrch override PfcWdOrch's zero-buffer profile.

The change swaps BufferOrch and PortsOrch in m_orchList, because 1st
BufferOrch iteration will always skip processing and eliminates possibility
of having m_pendingPortSet not filled with ports after m_initDone is set to true.

Signed-off-by: Stepan Blyschak [email protected]

What I did

Why I did it

How I verified it

Details if related

orchagent/portsorch.cpp Outdated Show resolved Hide resolved
@stepanblyschak stepanblyschak marked this pull request as ready for review October 24, 2019 16:24
@wendani wendani changed the title [portsorch] fix PortsOrch::allPortsReady() returns true when it shoul… [portsorch] fix PortsOrch::allPortsReady() returns true when it should not Oct 25, 2019
@@ -193,7 +193,7 @@ bool OrchDaemon::init()
* when iterating ConsumerMap.
* That is ensured implicitly by the order of map key, "LAG_TABLE" is smaller than "VLAN_TABLE" in lexicographic order.
*/
m_orchList = { gSwitchOrch, gCrmOrch, gBufferOrch, gPortsOrch, gIntfsOrch, gNeighOrch, gRouteOrch, copp_orch, tunnel_decap_orch, qos_orch, wm_orch, policer_orch };
m_orchList = { gSwitchOrch, gCrmOrch, gPortsOrch, gBufferOrch, gIntfsOrch, gNeighOrch, gRouteOrch, copp_orch, tunnel_decap_orch, qos_orch, wm_orch, policer_orch };
Copy link
Contributor

@wendani wendani Oct 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a better order as BufferOrch::doTask() needs to wait for PortsOrch::isInitDone() to be true to proceed

Copy link
Contributor

@wendani wendani Oct 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is interesting that BufferOrch was placed behind PortsPorch earlier as of this PR #515

orchagent/portsorch.cpp Outdated Show resolved Hide resolved
@stepanblyschak
Copy link
Contributor Author

retest this please

@stepanblyschak
Copy link
Contributor Author

Retest this please

@wendani wendani requested review from qiluo-msft and prsunny October 26, 2019 00:29
@@ -193,7 +193,7 @@ bool OrchDaemon::init()
* when iterating ConsumerMap.
* That is ensured implicitly by the order of map key, "LAG_TABLE" is smaller than "VLAN_TABLE" in lexicographic order.
*/
m_orchList = { gSwitchOrch, gCrmOrch, gBufferOrch, gPortsOrch, gIntfsOrch, gNeighOrch, gRouteOrch, copp_orch, tunnel_decap_orch, qos_orch, wm_orch, policer_orch };
m_orchList = { gSwitchOrch, gCrmOrch, gPortsOrch, gBufferOrch, gIntfsOrch, gNeighOrch, gRouteOrch, copp_orch, tunnel_decap_orch, qos_orch, wm_orch, policer_orch };
Copy link
Contributor

@qiluo-msft qiluo-msft Oct 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gPortsOrch [](start = 42, length = 10)

Do not reorder if you don't have a strong reason. Any order should work for warm-reboot if we iterate enough rounds. #Closed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Qi

Copy link
Contributor Author

@stepanblyschak stepanblyschak Oct 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiluo-msft @prsunny I agree that it is just a matter of enough iterations, however this order makes more sense at least to me, so it's more a cosmetic change.
Besides, now I think with this order 3 instead of 4 iterations are needed in warm boot, let me check.
Do you have concern that this may break something? #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you have proof that 4 iterations -> 3. Otherwise I don't see any improvement.


In reply to: 339371158 [](ancestors = 339371158)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check one more commit to have 3 iterations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, let me separate this change in another PR

@qiluo-msft
Copy link
Contributor

qiluo-msft commented Oct 26, 2019

return true;

Can we initialize m_pendingPortSet here to fix everything you are trying to solve? #Closed


Refers to: orchagent/portsorch.cpp:1469 in c6b9c08. [](commit_id = c6b9c08, deletion_comment = False)

@qiluo-msft
Copy link
Contributor

qiluo-msft commented Oct 26, 2019

This PR will be an import bug fix. To make lives easier, could you please provide a vs test case, which failed currently version but fixed by this PR? #Closed

Copy link
Contributor

@qiluo-msft qiluo-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented

@wendani
Copy link
Contributor

wendani commented Oct 26, 2019

Stepan, as you analyze in sonic-net/sonic-mgmt#834 (comment), design of m_pendingPortSet has an issue. If we can populate all physical ports to m_pendingPortSet in the PortsOrch::bake() phase, we can drop the change to move the section of code down pasted below:

1681             if (!m_portConfigDone)
1682             {
1683                 // Not yet receive PortConfigDone. Save it for future retry
1684                 it++;
1685                 continue;
1686             }

#1103 (comment)

@stepanblyschak
Copy link
Contributor Author

@qiluo-msft @wendani Not everything in this PR can be fixed by initializing m_pendingPortSet in bake().
Originaly I assumed that under some circumstances in cold boot there could be 'PortConfigDone' and 'PortInitDone' at the same time in m_toSync. Now, it looks like it is imposible to achive in cold boot.
So, yes, to make the fix more explicit I did a change to populate m_pendingPortSet in PortsOrch::bake().

Another issue is with Pfc wd start action is not protected by allPortsReady() causing errors in logs in warm boot and it is fixed as part of this PR, not related to m_pendingPortSet.

@stepanblyschak
Copy link
Contributor Author

@qiluo-msft
Regarding VS test case, usually for such change it is hard to create VS test that checks the initialization order.
I could create a VS for pfcwd zero-buffer approach and, before warm restart, simulate storm using redis channel, but I do not see any pfcwd related tests in swss and don't think it will somehow work on VS.
If you have ideas I can try.

@stepanblyschak
Copy link
Contributor Author

retest this please

@qiluo-msft
Copy link
Contributor

Regarding test, I don't mean an end-to-end simulation of a warm-reboot plus PFC storm. Maybe an google test (unit test) could help here. You could just prepare some mock data in Redis or mock redis, and let orchagent consume them.

This test case should fail old code, but pass your new code.


In reply to: 546554272 [](ancestors = 546554272)

@stepanblyschak
Copy link
Contributor Author

stepanblyschak commented Oct 28, 2019

@qiluo-msft google test (unit test) would be simpler to create than VS, but it fails to compile tests in tests/mock_tests on recent master and I don't see it is run on PR. Are these tests expected to work? #Resolved

@qiluo-msft
Copy link
Contributor

qiluo-msft commented Oct 28, 2019

I recently fixed the google test. If you check vs test result you will find it included.

PASS: tests
============================================================================
Testsuite summary for sonic-swss 1.0
============================================================================
# TOTAL: 1
# PASS:  1
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0
============================================================================

tests/mock_tests are not currently included


In reply to: 547004643 [](ancestors = 547004643)

Stepan Blyschak added 3 commits October 31, 2019 12:18
…d not

Warm start flow before the change:
1st iteration:
    - BufferOrch::doTask(): returns since PortInitDone hasn't arived yet
    - PortsOrch::doTask():  processes all PORT_TABLE untill PortInitDone flag
                            m_pendingPortSet is empty yet and m_portInitDone is true
                            so allPortsReady() will return true
    - AnyOrch::doTask():    check g_portsOrch->allPortsRead()

2nd iteration:
    - BufferOrch::doTask(): now buffers are applied

This causes BufferOrch override PfcWdOrch's zero-buffer profile.

The change swaps BufferOrch and PortsOrch in m_orchList, because 1st
BufferOrch iteration will always skip processing and eliminates possibility
of having m_pendingPortSet not filled with ports after m_initDone is set to true.

Signed-off-by: Stepan Blyschak <[email protected]>
Signed-off-by: Stepan Blyschak <[email protected]>
… started in warm boot

It appeared that pfc watchdog relied on a buggy behaviour of PortsOrch::allPortsReady().
In fixed PortsOrch::allPortsReady() you'll see that watchdog action is trying to start
before watchdog was started, because allPortsReady() in PfcWdOrch::doTask() returned false.
Before the fix watchdog was started before, because allPortsReady() lied that ports are ready
when they were not.

Signed-off-by: Stepan Blyschak <[email protected]>
Signed-off-by: Stepan Blyschak <[email protected]>
@stepanblyschak
Copy link
Contributor Author

Running main() from gtest_main.cc
[==========] Running 10 tests from 4 test cases.
[----------] Global test environment set-up.
[----------] 2 tests from PortsOrchTest
[ RUN      ] PortsOrchTest.PortReadinessColdBoot
[       OK ] PortsOrchTest.PortReadinessColdBoot (26 ms)
[ RUN      ] PortsOrchTest.PortReadinessWarmBoot
[       OK ] PortsOrchTest.PortReadinessWarmBoot (16 ms)
[----------] 2 tests from PortsOrchTest (42 ms total)

[----------] 1 test from AclTest
[ RUN      ] AclTest.Create_L3_Acl_Table
[       OK ] AclTest.Create_L3_Acl_Table (1 ms)
[----------] 1 test from AclTest (1 ms total)

[----------] 3 tests from AclOrchTest
[ RUN      ] AclOrchTest.ACL_Creation_and_Destorying
[       OK ] AclOrchTest.ACL_Creation_and_Destorying (1000 ms)
[ RUN      ] AclOrchTest.L3Acl_Matches_Actions
[       OK ] AclOrchTest.L3Acl_Matches_Actions (1001 ms)
[ RUN      ] AclOrchTest.L3V6Acl_Matches_Actions
[       OK ] AclOrchTest.L3V6Acl_Matches_Actions (1000 ms)
[----------] 3 tests from AclOrchTest (3001 ms total)

[----------] 4 tests from SaiSpy
[ RUN      ] SaiSpy.CURD
[       OK ] SaiSpy.CURD (0 ms)
[ RUN      ] SaiSpy.Same_Function_Signature_In_Same_API_Table
[       OK ] SaiSpy.Same_Function_Signature_In_Same_API_Table (1 ms)
[ RUN      ] SaiSpy.Same_Function_Signature_In_Different_API_Table
[       OK ] SaiSpy.Same_Function_Signature_In_Different_API_Table (0 ms)
[ RUN      ] SaiSpy.create_switch_and_acl_table
[       OK ] SaiSpy.create_switch_and_acl_table (0 ms)
[----------] 4 tests from SaiSpy (1 ms total)

[----------] Global test environment tear-down
[==========] 10 tests from 4 test cases ran. (4004 ms total)
[  PASSED  ] 10 tests.

Without fix:

Running main() from gtest_main.cc
[==========] Running 10 tests from 4 test cases.
[----------] Global test environment set-up.
[----------] 2 tests from PortsOrchTest
[ RUN      ] PortsOrchTest.PortReadinessColdBoot
[       OK ] PortsOrchTest.PortReadinessColdBoot (28 ms)
[ RUN      ] PortsOrchTest.PortReadinessWarmBoot
portsorch_ut.cpp:265: Failure
Value of: gPortsOrch->allPortsReady()
  Actual: true
Expected: false
[  FAILED  ] PortsOrchTest.PortReadinessWarmBoot (14 ms)
[----------] 2 tests from PortsOrchTest (42 ms total)

[----------] 1 test from AclTest
[ RUN      ] AclTest.Create_L3_Acl_Table
[       OK ] AclTest.Create_L3_Acl_Table (1 ms)
[----------] 1 test from AclTest (1 ms total)

[----------] 3 tests from AclOrchTest
[ RUN      ] AclOrchTest.ACL_Creation_and_Destorying
[       OK ] AclOrchTest.ACL_Creation_and_Destorying (1000 ms)
[ RUN      ] AclOrchTest.L3Acl_Matches_Actions
[       OK ] AclOrchTest.L3Acl_Matches_Actions (1001 ms)
[ RUN      ] AclOrchTest.L3V6Acl_Matches_Actions
[       OK ] AclOrchTest.L3V6Acl_Matches_Actions (1000 ms)
[----------] 3 tests from AclOrchTest (3002 ms total)

[----------] 4 tests from SaiSpy
[ RUN      ] SaiSpy.CURD
[       OK ] SaiSpy.CURD (0 ms)
[ RUN      ] SaiSpy.Same_Function_Signature_In_Same_API_Table
[       OK ] SaiSpy.Same_Function_Signature_In_Same_API_Table (0 ms)
[ RUN      ] SaiSpy.Same_Function_Signature_In_Different_API_Table
[       OK ] SaiSpy.Same_Function_Signature_In_Different_API_Table (0 ms)
[ RUN      ] SaiSpy.create_switch_and_acl_table
[       OK ] SaiSpy.create_switch_and_acl_table (0 ms)
[----------] 4 tests from SaiSpy (0 ms total)

[----------] Global test environment tear-down
[==========] 10 tests from 4 test cases ran. (4004 ms total)
[  PASSED  ] 9 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] PortsOrchTest.PortReadinessWarmBoot

 1 FAILED TEST

@stepanblyschak
Copy link
Contributor Author

retest this please

@qiluo-msft qiluo-msft merged commit bab7b93 into sonic-net:master Nov 1, 2019
@yxieca
Copy link
Contributor

yxieca commented Nov 1, 2019

@stepanblyschak This change cannot be cherry-picked cleanly into 201811 branch. Please create an PR for 201811 branch. Thanks!

Comment on lines +28 to +57
$(top_srcdir)/orchagent/orchdaemon.cpp \
$(top_srcdir)/orchagent/orch.cpp \
$(top_srcdir)/orchagent/notifications.cpp \
$(top_srcdir)/orchagent/routeorch.cpp \
$(top_srcdir)/orchagent/neighorch.cpp \
$(top_srcdir)/orchagent/intfsorch.cpp \
$(top_srcdir)/orchagent/portsorch.cpp \
$(top_srcdir)/orchagent/copporch.cpp \
$(top_srcdir)/orchagent/tunneldecaporch.cpp \
$(top_srcdir)/orchagent/qosorch.cpp \
$(top_srcdir)/orchagent/bufferorch.cpp \
$(top_srcdir)/orchagent/mirrororch.cpp \
$(top_srcdir)/orchagent/fdborch.cpp \
$(top_srcdir)/orchagent/aclorch.cpp \
$(top_srcdir)/orchagent/saihelper.cpp \
$(top_srcdir)/orchagent/switchorch.cpp \
$(top_srcdir)/orchagent/pfcwdorch.cpp \
$(top_srcdir)/orchagent/pfcactionhandler.cpp \
$(top_srcdir)/orchagent/policerorch.cpp \
$(top_srcdir)/orchagent/crmorch.cpp \
$(top_srcdir)/orchagent/request_parser.cpp \
$(top_srcdir)/orchagent/vrforch.cpp \
$(top_srcdir)/orchagent/countercheckorch.cpp \
$(top_srcdir)/orchagent/vxlanorch.cpp \
$(top_srcdir)/orchagent/vnetorch.cpp \
$(top_srcdir)/orchagent/dtelorch.cpp \
$(top_srcdir)/orchagent/flexcounterorch.cpp \
$(top_srcdir)/orchagent/watermarkorch.cpp \
$(top_srcdir)/orchagent/chassisorch.cpp \
$(top_srcdir)/orchagent/sfloworch.cpp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stepanblyschak @Pterosaur @nazariig @lguohan please make orchagent as library *.la, it will be one file, orchagent is already compiled, and here in tests you are compiling it again, which is twice the same compilation, it extend compilation time twice !, now it takes 19 minutes to compile and using it as a lib it could take 10 min

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants