Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xcvrd crash and restart should not cause link flap on platforms needing custom NPU SI settings #541

Merged

Conversation

mihirpat1
Copy link
Contributor

@mihirpat1 mihirpat1 commented Sep 26, 2024

Description

Ensure Xcvrd crash and restart should be handled gracefully on platforms that need custom SI settings.
The goal is to avoid link flap on platforms that need custom SI settings during xcvrd restart.

Motivation and Context

Changeset 1
The changeset is partial implementation of the changes required for sonic-net/SONiC#1432.
Specifically, the current change adds NPU_SI_SETTINGS_SYNC_STATUS key to the PORT_TABLE of STATE_DB. Following is the usage of NPU_SI_SETTINGS_SYNC_STATUS key

Value Modifier thread and event Consumer thread and purpose
NPU_SI_SETTINGS_DEFAULT 1. XCVRD main thread during cold start of XCVRD 2. SfpStateUpdateTask during transceiver removal XCVRD main thread during boot-up for deciding to notify NPU SI settings
NPU_SI_SETTINGS_NOTIFIED 1. SfpStateUpdateTask while updating and notifying the NPU SI settings Not being used currently

Please note that the above table needs to be further modified while implementing complete OA crash HLD. The table assumes NPU SI settings are applied from SfpStateUpdateTask thread always (unlike the HLD wherein CMIS transceivers have the NPU SI settings being sent from CmisManagerTask).

Changeset 2
In addition to the above change, TRANSCEIVER_INFO table will not be deleted as part of xcvrd deinit.
On some platforms, TRANSCEIVER_INFO table is used to detect transceiver presence to handle host_tx_ready behavior and hence, deleting TRANSCEIVR_INFO table will cause host_tx_ready to change to false during xcvrd shutdown.

Impact due to skipping TRANSCEIVER_INFO table deletion

Scenario
In case if xcvrd crashes permanently after initializing 6 out of 32 ports (i.e. xcvrd never respawns), then "show int status" CLI will show 6 transceiver present and other ports would not show xcvr present unless we use sfputil. 

Changeset 3
PHYSICAL_PORT_NOT_EXIST has been defined in media_settings_parser.py to prevent crash due to undefined value.
When the media_settings_parser.py file was created, the below line was moved from xcvrd.py. However, the definition of
PHYSICAL_PORT_NOT_EXIST was never ported. Hence, adding the definition now.

How Has This Been Tested?

On platforms which do not restart xcvrd upon swss/syncd crash/restart, following behavior is observed
The first xcvrd crash after swss/syncd restarts will cause a link flap for ports which require NPU SI settings.
This is expected since as part of swss crash handling, the entire PORT_TABLE of STATE_DB is deleted.
Since xcvrd is not restarted as part of swss/syncd crash handling, the NPU_SI_SETTINGS_SYNC_STATUS field is never populated. Hence, when xcvrd crashes/restarts first time after swss/syncd crash is triggered, xcvrd will update the NPU SI settings in the APPL_DB which in turn forces the port to go through a link flap while OA configures the NPU SI settings

Test summary

  1. Platforms with xcvrd not restarting during swss/syncd crash
Event Xcvrd restarted NPU SI settings renotify NPU_SI_SETTINGS_SYNC_STATUS value after action (for optics which require NPU SI settings) Link flap
Config shut N/A No NPU_SI_SETTINGS_NOTIFIED N/A
Config no shut N/A No NPU_SI_SETTINGS_NOTIFIED N/A
Xcvrd restart Yes No NPU_SI_SETTINGS_NOTIFIED No
Xcvrd kill Yes No NPU_SI_SETTINGS_NOTIFIED No
Crash xcvrd just before updating NPU_SI_SETTINGS_SYNC_STATUS Yes Yes NPU_SI_SETTINGS_NOTIFIED N/A
Pmon restart Yes No NPU_SI_SETTINGS_NOTIFIED No
Swss restart No No Key is deleted Yes
Syncd restart No No Key is deleted Yes
config reload Yes Yes NPU_SI_SETTINGS_NOTIFIED Yes
Cold reboot N/A No Inline with expected value N/A
  1. Platforms with xcvrd restarting during swss/syncd crash
Event Xcvrd restarted NPU SI settings renotify NPU_SI_SETTINGS_SYNC_STATUS value after action (for optics which require NPU SI settings) Link flap
Config shut N/A No NPU_SI_SETTINGS_NOTIFIED N/A
Config no shut N/A No NPU_SI_SETTINGS_NOTIFIED N/A
Xcvrd restart Yes No NPU_SI_SETTINGS_NOTIFIED No
Xcvrd kill Yes No NPU_SI_SETTINGS_NOTIFIED No
Crash xcvrd just before updating NPU_SI_SETTINGS_SYNC_STATUS Yes Yes NPU_SI_SETTINGS_NOTIFIED N/A
Pmon restart Yes No NPU_SI_SETTINGS_NOTIFIED No
Swss restart Yes Yes NPU_SI_SETTINGS_NOTIFIED Yes
Syncd restart Yes Yes NPU_SI_SETTINGS_NOTIFIED Yes
config reload Yes Yes NPU_SI_SETTINGS_NOTIFIED Yes
Cold reboot N/A No NPU_SI_SETTINGS_NOTIFIED N/A
Warm reboot N/A No NPU_SI_SETTINGS_NOTIFIED No
  1. Verified transceiver OIR

Additional Information (Optional)

MSFT ADO - 29278409

@@ -2437,6 +2439,30 @@ def wait_for_port_config_done(self, namespace):
if key in ["PortConfigDone", "PortInitDone"]:
break

"""
Initialize NPU_SI_SETTINGS_SYNC_STATUS_KEY field in STATE_DB PORT_TABLE|<lport>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mihirpat1 where do we delete the key when optics is removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor We don't delete the key when optics is removed. We instead mark it as NPU_SI_SETTINGS_DEFAULT.
This change is at line 2163-2164 in xcvrd.py.

@mihirpat1
Copy link
Contributor Author

@bingwang-ms Can you please help to cherry-pick this PR to 202405 branch?
MSFT ADO - 29278409

mssonicbld pushed a commit to mssonicbld/sonic-platform-daemons that referenced this pull request Oct 8, 2024
…ng custom NPU SI settings (sonic-net#541)

* Xcvrd crash and restart should not cause link flap on platforms needing custom SI settings

Signed-off-by: Mihir Patel <[email protected]>

* Improved code coverage

---------

Signed-off-by: Mihir Patel <[email protected]>
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202405: #547

mssonicbld pushed a commit that referenced this pull request Oct 8, 2024
…ng custom NPU SI settings (#541)

* Xcvrd crash and restart should not cause link flap on platforms needing custom SI settings

Signed-off-by: Mihir Patel <[email protected]>

* Improved code coverage

---------

Signed-off-by: Mihir Patel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants