Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reload BCM SDK kmods on syncd start to handle syncd restart issues #12804

Merged
merged 1 commit into from
Nov 30, 2022

Conversation

michaelli10
Copy link
Contributor

Signed-off-by: Michael Li [email protected]

Why I did it

There is an issue on the Arista PikeZ platform (using T3.X2: BCM56274) while running SONiC. If the 'syncd' container in SONiC is restarted, the expected behaviour is that syncd will automatically restart/recover; however it does not and always fails at create_switch due to BCM SDK kmod DMA operation cancellation getting stuck.

Sep 16 22:19:44.855125 pkz208 ERR syncd#syncd: [none] SAI_API_SWITCH:platform_process_command:428 Platform command "init soc" failed, rc = -1. Sep 16 22:19:44.855206 pkz208 INFO syncd#supervisord: syncd CMIC_CMC0_PKTDMA_CH4_DESC_COUNT_REQ:0x33#015 Sep 16 22:19:44.855264 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:platformInit:1909 initialization command "init soc" failed, rc = -1 (Internal error). Sep 16 22:19:44.855403 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:sai_driver_init:642 Error initializing driver, rc = -1. ... Sep 16 22:19:44.855891 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:brcm_sai_create_switch:1173 initializing SDK failed with error Operation failed (0xfffffff5).

Reloading the BCM SDK kmods allows the switch init to continue properly.

How I did it

If BCM SDK kmods are loaded, unload and load them again on syncd docker start script.

How to verify it

Steps to reproduce:

  1. In SONiC, run 'docker ps' to see current running containers; 'syncd' should be present.
  2. Run 'docker stop syncd'
  3. Wait ~1 minute.
  4. Run 'docker ps' to see that syncd is missing.
  5. Check logs to see messages similar to the above.

Which release branch to backport (provide reason below if selected)

202205
202205 is the community branch used for PikeZ platform

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • [x ] 202205

Description for the changelog

Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@Blueve Blueve merged commit f725b83 into sonic-net:master Nov 30, 2022
yxieca pushed a commit that referenced this pull request Dec 1, 2022
…12804)

Why I did it
There is an issue on the Arista PikeZ platform (using T3.X2: BCM56274) while running SONiC. If the 'syncd' container in SONiC is restarted, the expected behaviour is that syncd will automatically restart/recover; however it does not and always fails at create_switch due to BCM SDK kmod DMA operation cancellation getting stuck.

Sep 16 22:19:44.855125 pkz208 ERR syncd#syncd: [none] SAI_API_SWITCH:platform_process_command:428 Platform command "init soc" failed, rc = -1. Sep 16 22:19:44.855206 pkz208 INFO syncd#supervisord: syncd CMIC_CMC0_PKTDMA_CH4_DESC_COUNT_REQ:0x33#015 Sep 16 22:19:44.855264 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:platformInit:1909 initialization command "init soc" failed, rc = -1 (Internal error). Sep 16 22:19:44.855403 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:sai_driver_init:642 Error initializing driver, rc = -1. ... Sep 16 22:19:44.855891 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:brcm_sai_create_switch:1173 initializing SDK failed with error Operation failed (0xfffffff5).

Reloading the BCM SDK kmods allows the switch init to continue properly.

How I did it
If BCM SDK kmods are loaded, unload and load them again on syncd docker start script.

How to verify it
Steps to reproduce:

In SONiC, run 'docker ps' to see current running containers; 'syncd' should be present.
Run 'docker stop syncd'
Wait ~1 minute.
Run 'docker ps' to see that syncd is missing.
Check logs to see messages similar to the above.

Signed-off-by: Michael Li <[email protected]>
@lguohan
Copy link
Collaborator

lguohan commented Dec 1, 2022

this change seems very generic? is this change only affecting td3.x2 or it affect other sku as well?

@@ -28,6 +28,19 @@ function startplatform() {
debug "Firmware update procedure ended"
fi

if [[ x"$sonic_asic_platform" == x"broadcom" ]]; then
if [[ x"$WARM_BOOT" != x"true" ]]; then
is_bcm0=$(ls /sys/class/net | grep bcm0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is is_bcm0, this is unclear to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_bcm0 indicates the presence of the base netdevice created by the BRCM SDK kernel module. If the bcm0 netdevice is present, the BRCM SDK kernel modules are loaded. We only take action to remove and reload the kmods if they are already loaded.

yxieca added a commit that referenced this pull request Dec 1, 2022
Blueve pushed a commit that referenced this pull request Dec 7, 2022
Why I did it
Limiting #12804 changes to PikeZ platform only (Arista-720DT-48S). Note that this is a short term workaround for this platform until SDK investigation on SDK init failure on docker syncd restart due to DMA issues is resolved.

How I did it
Retrieve platform name from /host/machine.conf and only reload SDK kmods on Arista-720DT-48S platform.

Signed-off-by: Michael Li <[email protected]>
StormLiangMS pushed a commit to StormLiangMS/sonic-buildimage that referenced this pull request Dec 8, 2022
…onic-net#12804)

Why I did it
There is an issue on the Arista PikeZ platform (using T3.X2: BCM56274) while running SONiC. If the 'syncd' container in SONiC is restarted, the expected behaviour is that syncd will automatically restart/recover; however it does not and always fails at create_switch due to BCM SDK kmod DMA operation cancellation getting stuck.

Sep 16 22:19:44.855125 pkz208 ERR syncd#syncd: [none] SAI_API_SWITCH:platform_process_command:428 Platform command "init soc" failed, rc = -1. Sep 16 22:19:44.855206 pkz208 INFO syncd#supervisord: syncd CMIC_CMC0_PKTDMA_CH4_DESC_COUNT_REQ:0x33#015 Sep 16 22:19:44.855264 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:platformInit:1909 initialization command "init soc" failed, rc = -1 (Internal error). Sep 16 22:19:44.855403 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:sai_driver_init:642 Error initializing driver, rc = -1. ... Sep 16 22:19:44.855891 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:brcm_sai_create_switch:1173 initializing SDK failed with error Operation failed (0xfffffff5).

Reloading the BCM SDK kmods allows the switch init to continue properly.

How I did it
If BCM SDK kmods are loaded, unload and load them again on syncd docker start script.

How to verify it
Steps to reproduce:

In SONiC, run 'docker ps' to see current running containers; 'syncd' should be present.
Run 'docker stop syncd'
Wait ~1 minute.
Run 'docker ps' to see that syncd is missing.
Check logs to see messages similar to the above.

Signed-off-by: Michael Li <[email protected]>
yxieca added a commit that referenced this pull request Dec 8, 2022
yxieca pushed a commit that referenced this pull request Dec 8, 2022
Why I did it
Limiting #12804 changes to PikeZ platform only (Arista-720DT-48S). Note that this is a short term workaround for this platform until SDK investigation on SDK init failure on docker syncd restart due to DMA issues is resolved.

How I did it
Retrieve platform name from /host/machine.conf and only reload SDK kmods on Arista-720DT-48S platform.

Signed-off-by: Michael Li <[email protected]>
StormLiangMS pushed a commit that referenced this pull request Dec 10, 2022
…12804)

Why I did it
There is an issue on the Arista PikeZ platform (using T3.X2: BCM56274) while running SONiC. If the 'syncd' container in SONiC is restarted, the expected behaviour is that syncd will automatically restart/recover; however it does not and always fails at create_switch due to BCM SDK kmod DMA operation cancellation getting stuck.

Sep 16 22:19:44.855125 pkz208 ERR syncd#syncd: [none] SAI_API_SWITCH:platform_process_command:428 Platform command "init soc" failed, rc = -1. Sep 16 22:19:44.855206 pkz208 INFO syncd#supervisord: syncd CMIC_CMC0_PKTDMA_CH4_DESC_COUNT_REQ:0x33#015 Sep 16 22:19:44.855264 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:platformInit:1909 initialization command "init soc" failed, rc = -1 (Internal error). Sep 16 22:19:44.855403 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:sai_driver_init:642 Error initializing driver, rc = -1. ... Sep 16 22:19:44.855891 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:brcm_sai_create_switch:1173 initializing SDK failed with error Operation failed (0xfffffff5).

Reloading the BCM SDK kmods allows the switch init to continue properly.

How I did it
If BCM SDK kmods are loaded, unload and load them again on syncd docker start script.

How to verify it
Steps to reproduce:

In SONiC, run 'docker ps' to see current running containers; 'syncd' should be present.
Run 'docker stop syncd'
Wait ~1 minute.
Run 'docker ps' to see that syncd is missing.
Check logs to see messages similar to the above.

Signed-off-by: Michael Li <[email protected]>
@mssonicbld
Copy link
Collaborator

@michaelli10 PR conflicts with 202211 branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants