-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix xcvrd to support 400G ZR optic #293
Conversation
@abohanyang can you please check if on_port_update_event() is triggered even after this PR #285. If so please share the syslog file. |
sonic-xcvrd/xcvrd/xcvrd.py
Outdated
@@ -923,6 +923,7 @@ class CmisManagerTask: | |||
|
|||
CMIS_MAX_RETRIES = 3 | |||
CMIS_DEF_EXPIRED = 60 # seconds, default expiration time | |||
CMIS_DEF_EXPIRED_ZR = 200 # seconds, expiration time for ZR module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we read the module advertised datapath init time and use it so that we have it working for both ZR and non-ZR without hardcoding this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To read CMIS dp state duration, I create another PR in sonic-platform-common (sonic-net/sonic-platform-common#312).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abohanyang - How is #312 (above) different from #290 ?
Looks same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @shyam77git , it's to address the same issue. but some of my code changes (sonic-net/sonic-platform-common#312 ) are in a different repo. I think we need a separate PR for that part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @abohanyang
Is #312 sufficient to take care of what's mentioned in #290 or anything more required?
Note that #294 to take care for xcvrd side changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. 312 contains the implementation of API, get_datapath_init_duration. I just commit more changes in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. 312 contains the implementation of API, get_datapath_init_duration. I just commit more changes in this PR.
Thanks for the clarification.
This PR now is going to take care of git issue #293 - I'm reviewing this further
sonic-xcvrd/xcvrd/xcvrd.py
Outdated
@@ -1516,7 +1539,7 @@ def task_worker(self): | |||
# D.1.3 Software Configuration and Initialization | |||
api.set_datapath_init(host_lanes) | |||
# TODO: Use fine grained timeout when the CMIS memory map is available | |||
self.port_dict[lport]['cmis_expired'] = now + datetime.timedelta(seconds=self.CMIS_DEF_EXPIRED) | |||
self.port_dict[lport]['cmis_expired'] = now + datetime.timedelta(seconds=self.get_cmis_expired(api)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abohanyang : get_cmis_expired() - is this a temporary function defined to validate this fix - #293?
practically, get_datapath_init_duration() and get_datapath_deinit_duration() to be invoked instead.
Would you be taking care of this as part of this PR? or shall we take care of this in xcvrd.py via git issue #294 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I just commit more changes in this PR to call get_datapath_init_duration and get the expiration time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I just commit more changes in this PR to call get_datapath_init_duration and get the expiration time.
OK.
@abohanyang : can you please attach UT cases, logs for this case (400G ZR optics FSM working via xcrd.py as orchestrator with these timeouts all the way to interface in 'operational up' state)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. Clean up comments. 2. Revert port breakout fix and corresponding test change. 3. Check deinit duration for deinit case 4. Check datapath init pending
sonic-xcvrd/xcvrd/xcvrd.py
Outdated
def check_datapath_init_pending(self, api, channel): | ||
""" | ||
Check if the CMIS datapath init is pending | ||
|
||
Args: | ||
api: | ||
XcvrApi object | ||
channel: | ||
Integer, a bitmask of the lanes on the host side | ||
e.g. 0x5 for lane 0 and lane 2. | ||
|
||
Returns: | ||
Boolean, true if any lanes are pending datapath init, otherwise false | ||
""" | ||
pending = False | ||
dpinit_pending_dict = api.get_dpinit_pending() | ||
for lane in range(self.CMIS_NUM_CHANNELS): | ||
if ((1 << lane) & channel) == 0: | ||
continue | ||
key = "DPInitPending{}".format(lane + 1) | ||
if dpinit_pending_dict[key]: | ||
pending = True | ||
break | ||
|
||
return pending | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On further thought and checking , please look further into this.
should this be interpreted as: as long as one of the lanes is found in DPInitPending state, outcome is pending (i.e. true). Caller of this function on getting 'true', would perform force_cmis_reinit() instead of setting state to CMIS_STATE_DP_INIT (& proceeding with normal workflow)
sonic-xcvrd/xcvrd/xcvrd.py
Outdated
@@ -1494,8 +1542,11 @@ def task_worker(self): | |||
self.force_cmis_reinit(lport, retries + 1) | |||
continue | |||
|
|||
# TODO: Use fine grained time when the CMIS memory map is available | |||
self.port_dict[lport]['cmis_expired'] = now + datetime.timedelta(seconds=self.CMIS_DEF_EXPIRED) | |||
if self.check_datapath_init_pending(api, host_lanes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if any of the lanes has DPInitPending=0 i.e if this fn call return False then we should retry re-init
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the code based on our discussion.
2. check DpInitPending on module which supports CMIS 5.0 3. Specify sec in return value
@abohanyang could you update on UT/IT? all done or anything still pending etc.? If so, whats the ETA. FYI: For 400G ZR optics, you may get in touch with 'Mihir Patel (MSFT)' as he validated 400G ZR optics with this changeset on his setup |
Hi @shyam77git - The code is pending review. I am waiting for @prgeor comments/approval on my latest commit. |
@abohanyang can you check the description. I think you are addressing only #1 |
sonic-xcvrd/xcvrd/xcvrd.py
Outdated
@@ -1458,7 +1490,7 @@ def task_worker(self): | |||
# TODO: Make sure this doesn't impact other datapaths | |||
api.set_lpmode(False) | |||
self.port_dict[lport]['cmis_state'] = self.CMIS_STATE_AP_CONF | |||
self.port_dict[lport]['cmis_expired'] = now + datetime.timedelta(seconds=self.CMIS_DEF_EXPIRED) | |||
self.port_dict[lport]['cmis_expired'] = now + datetime.timedelta(seconds=self.get_cmis_dp_deinit_duration_secs(api)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you log the deinit duration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I just update the code and the summary of this PR.
sonic-xcvrd/xcvrd/xcvrd.py
Outdated
@@ -1515,8 +1553,7 @@ def task_worker(self): | |||
|
|||
# D.1.3 Software Configuration and Initialization | |||
api.set_datapath_init(host_lanes) | |||
# TODO: Use fine grained timeout when the CMIS memory map is available | |||
self.port_dict[lport]['cmis_expired'] = now + datetime.timedelta(seconds=self.CMIS_DEF_EXPIRED) | |||
self.port_dict[lport]['cmis_expired'] = now + datetime.timedelta(seconds=self.get_cmis_dp_init_duration_secs(api)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please log the dp_init pending duration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oct 19 19:10:24.470226 cmp206-4 NOTICE pmon#xcvrd[38]: CMIS: Ethernet256: 400G, lanemask=0xff, state=INSERTED, retries=0
Oct 19 19:10:24.584837 cmp206-4 NOTICE pmon#xcvrd[38]: CMIS: Ethernet256: force Datapath reinit
Oct 19 19:10:25.589603 cmp206-4 NOTICE pmon#xcvrd[38]: CMIS: Ethernet256: 400G, lanemask=0xff, state=DP_DEINIT, retries=0
Oct 19 19:10:26.621621 cmp206-4 NOTICE pmon#xcvrd[38]: CMIS: Ethernet256 DpDeinit duration 1.0 secs
Oct 19 19:10:27.627613 cmp206-4 NOTICE pmon#xcvrd[38]: CMIS: Ethernet256: 400G, lanemask=0xff, state=AP_CONFIGURED, retries=0
Oct 19 19:10:28.711402 cmp206-4 NOTICE pmon#xcvrd[38]: CMIS: Ethernet256: 400G, lanemask=0xff, state=DP_INIT, retries=0
Oct 19 19:10:28.729570 cmp206-4 NOTICE pmon#xcvrd[38]: CMIS: Ethernet256 DpInit duration 300.0 secs
Oct 19 19:10:29.735655 cmp206-4 NOTICE pmon#xcvrd[38]: CMIS: Ethernet256: 400G, lanemask=0xff, state=DP_TXON, retries=0
Oct 19 19:11:26.497042 cmp206-4 NOTICE pmon#xcvrd[38]: message repeated 56 times: [ CMIS: Ethernet256: 400G, lanemask=0xff, state=DP_TXON, retries=0]
Oct 19 19:11:26.497149 cmp206-4 NOTICE pmon#xcvrd[38]: CMIS: Ethernet256: Turning ON tx power
Oct 19 19:11:27.502474 cmp206-4 NOTICE pmon#xcvrd[38]: CMIS: Ethernet256: 400G, lanemask=0xff, state=DP_ACTIVATION, retries=0
Oct 19 19:11:48.910446 cmp206-4 NOTICE pmon#xcvrd[38]: message repeated 21 times: [ CMIS: Ethernet256: 400G, lanemask=0xff, state=DP_ACTIVATION, retries=0]
Oct 19 19:11:48.910446 cmp206-4 NOTICE pmon#xcvrd[38]: CMIS: Ethernet256: READY
@yxieca can you help cherry-pick to 202205? |
* Fix xcvrd to support 400G ZR/DR optics * Fix xcvrd to support 400G ZR/DR optics * Revert changes in DomInfoUpdateTask::on_port_config_change * Fix xcvrd test after modifying on_port_config_change * Call get_datapath_init_duration to get the init expiration time. * Address comments 1. Clean up comments. 2. Revert port breakout fix and corresponding test change. 3. Check deinit duration for deinit case 4. Check datapath init pending * 1. Revert changes in on_port_update_event 2. check DpInitPending on module which supports CMIS 5.0 3. Specify sec in return value * Add code coverage * Add log for DpInit/DpDeinit duration
Fix Xcvrd code to support 400G ZR optic
Description
While bringing the 400G ZR module, I found an issue where some 400G ZR optics may take more than 60 secs to finish up the Datapath initialization. This is a problem because the xcvrd uses a hardcoded 60 secs Datapath initialization timeout for all CMIS modules and apparently, this 60 secs timeout is not sufficient for the 400G ZR module.
The CMIS spec has Datapath init/de-init timeout values defined in Byte 144. We can use these values from the EEPROM instead of presenting any hardcoded values.
Motivation and Context
This is needed to bring up 400G ZR module.
How Has This Been Tested?
These changes are tested against arista_7800r3a_36d2_lc and arista_7280cr3_32p4 platforms.