Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[202012] [Mellanox] Fix issue: SFP eeprom corrupted after replacing cable with different sfp type #13543

Merged
merged 2 commits into from
Feb 19, 2023

Conversation

Junchao-Mellanox
Copy link
Collaborator

@Junchao-Mellanox Junchao-Mellanox commented Jan 30, 2023

Why I did it

There are 3 tasks in xcvrd:

  1. main task, run a loop to recover missing SFP static information to DB every 1 minute
  2. SFP state task, a process which listens cable plug in/out event, insert SFP static information to DB while a cable is inserted
  3. SFP DOM update task, a thread which handles cable DOM information update every 1 minute

Let assume user replaces QSFP with QSFP-DD. There are two issues:

  1. Only SFP state task listens cable plug in/out event, main task and SFP DOM update task does not know SFP type has changed, they still “think” the SFP type is QSFP. So, main task and SFP DOM update task uses QSFP standard to parse QSFP-DD EEPROM which causes corrupted data.
  2. There is a race condition between main task and SFP state task. They both insert SFP static information to DB. Depends on timing, it is possible that main task using wrong SFP type to override SFP static information.

The PR is to fix these two issues.

There is no such issue on 202205 and above because there is a refactor for xcvrd:

  1. SFP state task was changed from process to thread, so that all 3 tasks share the same memory space, they always have correct SFP type.
  2. Recover missing SFP information logical was moved from main task to SFP state task. There is no race condition anymore.

How I did it

It is difficult to back port latest xcvrd because there are many refactor/new features in xcvrd after 202012 release. It will be huge effort to do so. Based on that, we decided to fix the issue on Nvidia platform API side. The fix is that: refreshing SFP type before any SFP API which accessing SFP EEPROM. Refreshing SFP type before any SFP API would cause a small performance down: Due to my test on 202012 branch, accessing transceiver INFO and DOM INFO for 32 ports takes 1.7 seconds before the change. The number changes to 2.4 seconds after the change. I suppose the performance down is acceptable.

How to verify it

  1. Manual test
  2. Regression

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211

Description for the changelog

Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@liat-grozovik
Copy link
Collaborator

@Junchao-Mellanox please provide PR description for the change

@liat-grozovik liat-grozovik requested a review from keboliu January 30, 2023 18:07
@liat-grozovik
Copy link
Collaborator

@Junchao-Mellanox pls confirm this is needed only for 202012 and not needed for 202205 and above.
is it handled with the new sfp refactor and is just missing in 202012?
looks like it is Mellanox specific, is it? should label be added?

@Junchao-Mellanox Junchao-Mellanox marked this pull request as draft January 31, 2023 01:19
@liat-grozovik
Copy link
Collaborator

@keboliu @prgeor please review/approve.

@Junchao-Mellanox Junchao-Mellanox changed the title Fix issue: SFP eeprom corrupted after replacing cable with different sfp type [202012 Only] [Mellanox] Fix issue: SFP eeprom corrupted after replacing cable with different sfp type Feb 7, 2023
@Junchao-Mellanox Junchao-Mellanox marked this pull request as ready for review February 8, 2023 03:31
@Junchao-Mellanox Junchao-Mellanox changed the title [202012 Only] [Mellanox] Fix issue: SFP eeprom corrupted after replacing cable with different sfp type [202012] [Mellanox] Fix issue: SFP eeprom corrupted after replacing cable with different sfp type Feb 8, 2023
@liat-grozovik
Copy link
Collaborator

@prgeor kindly reminder to review this PR

@prgeor
Copy link
Contributor

prgeor commented Feb 17, 2023

@Junchao-Mellanox pls confirm this is needed only for 202012 and not needed for 202205 and above. is it handled with the new sfp refactor and is just missing in 202012? looks like it is Mellanox specific, is it? should label be added?

@liat-grozovik I don't think its Mellanox specific. But worth checking if this issue exist in 202205 and above

@Junchao-Mellanox
Copy link
Collaborator Author

There is no such issue on 202205 and above because there is a refactor for xcvrd:

SFP state task was changed from process to thread, so that all 3 tasks share the same memory space, they always have correct SFP type.
Recover missing SFP information logical was moved from main task to SFP state task. There is no race condition anymore.

@liat-grozovik liat-grozovik merged commit 7543993 into sonic-net:202012 Feb 19, 2023
@Junchao-Mellanox Junchao-Mellanox deleted the fix-eeprom-202012 branch June 9, 2023 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants