Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[xcvrd] Enhance xcvrd to handle new system level event/error #39

Merged
merged 14 commits into from
Aug 10, 2019
Merged

[xcvrd] Enhance xcvrd to handle new system level event/error #39

merged 14 commits into from
Aug 10, 2019

Conversation

keboliu
Copy link
Collaborator

@keboliu keboliu commented Aug 1, 2019

This PR has a dependency on PRs:
sonic-net/sonic-platform-common#50
sonic-net/sonic-buildimage#3261

In the above PRs introduced the new system-level event, this PR to enhance xcvrd to be more robust by handling the more new system-level event:

  1. wait for a certain period on system_not_ready event to overcome race condition when xcvrd started but system not ready.

  2. start to routing work on system_become_ready event, this event indicates that the system is ready for SFP change event report.

  3. exit on system_fail event

for key, value in port_dict.iteritems():
if key == EVENT_ON_ALL_SFP:
logger.log_error("Receive event with unexpected content, key: {}, value: {}".format(key, value))
Copy link
Contributor

@qiluo-msft qiluo-msft Aug 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Receive [](start = 42, length = 7)

Past tense is better. As original 'Got'. Many time below. #Closed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

for key, value in port_dict.iteritems():
if key == EVENT_ON_ALL_SFP:
if value == SYSTEM_NOT_READY:
# system not ready, wait and retry
Copy link
Contributor

@qiluo-msft qiluo-msft Aug 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[](start = 32, length = 1)

wrong indentation. #Closed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

if value == SYSTEM_NOT_READY:
# system not ready, wait and retry
if self._retry_for_system_ready():
continue
Copy link
Contributor

@qiluo-msft qiluo-msft Aug 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[](start = 33, length = 3)

inconsistent indentation #Closed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

while retry < RETRY_TIMES_FOR_SYSTEM_READY:
#time.sleep(RETRY_PERIOD_FOR_SYSTEM_READY_SECS)
time_start = time.time()
rc, event_dict = platform_sfputil.get_transceiver_change_event(RETRY_PERIOD_FOR_SYSTEM_READY_MSECS)
Copy link
Contributor

@qiluo-msft qiluo-msft Aug 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

platform_sfputil.get_transceiver_change_event [](start = 29, length = 45)

You can call platform_sfputil.get_transceiver_change_event function only once, and implement the logic as a normal state machine.

I don't mean your implementation is wrong, state machine may be much more readable and future proof. #Closed

Copy link
Collaborator

@stephenxs stephenxs Aug 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Qi,
I admit that rewriting it in a state machine approach can make the code a little shorter by eliminating some duplicate code.
But I'm afraid it can also introduce negative effects: We have the logic to control retry timespan and times and introduce some local variables to control that. Rewriting it as a state machine will combine the retry control logic into the main procedure of xcvrd, which make the main function complicated. #Closed

Copy link
Contributor

@qiluo-msft qiluo-msft Aug 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benefit for state machine pattern is not shorter code, but a correct design and implementation.

For the design part, you need to consider any sequence of change events returned by vendor plugin code. For example, SYSTEM_NOT_READY followed by an inserted event. You need to draw the state machine on paper first to consider all the events at each state. You need to document the state machine diagram first to consider all the events at each state.

For the implementation part, you could design you state transition code in standalone function, and achieve 'control retry timespan and times' or 'local variables'.


In reply to: 310075723 [](ancestors = 310075723)

Copy link
Collaborator

@stephenxs stephenxs Aug 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Qi,
I understand the advantage of state machine approach. I just feel that the logic of xcvrd daemon is relatively simple and it seems that using state machine is less necessary.
IMO, state machine is adopted in event handling systems which have a lot of state and event, like network protocols. In those systems using state machine makes the flow clear and relatively easy to understand and also benefits the formal verification of that kind of system.
But for systems with less states and events the state machine approach doesn't have overwhelming advantage, since the current logic is straightforward. I have considered the state machine approach and found that only 2 states and 4 events. In this sense it's less necessary to go in this way.
meanwhile, I've created a PR in which I've updated the handling logic in a state-machine-like way, I think it is better than the current implementation. but it's not a strictly-speaking state machine. the pr is sonic-platform-daemons PR#5. You can check it.

Copy link
Contributor

@qiluo-msft qiluo-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As comments
Also blocked on sonic-net/sonic-platform-common#50

Copy link
Contributor

@qiluo-msft qiluo-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed again. Pending on #39 (comment)

Stephen Sun and others added 4 commits August 6, 2019 08:50
update the logic of sfp_state_update_task in a state machine approach
the definition of state machine is in front of sfp_state_update_task.task_worker
@stephenxs
Copy link
Collaborator

stephenxs commented Aug 8, 2019

As comments
Also blocked on Azure/sonic-platform-common#50

code updated with state machine approach
states definition

  • Initial state: INIT, before received system ready or a normal event
  • Final state: EXIT
  • other state: NORMAL, after has received system-ready or a normal event

events definition

  • SYSTEM_NOT_READY
  • SYSTEM_BECOME_READY
  • NORMAL_EVENT
    • sfp insertion/removal
    • timeout returned by sfputil.get_change_event with status = true
  • SYSTEM_FAIL

State transmit:

  1. SYSTEM_NOT_READY
    • INIT
      • retry < RETRY_TIMES_FOR_SYSTEM_READY
        retry ++
      • else
        max retry reached, treat as fatal, exit
    • NORMAL
      Treat as a fatal error, exit
  2. SYSTEM_BECOME_READY
    • INIT
      transmit to NORMAL
    • NORMAL
      log the event
      nop
  3. NORMAL_EVENT
    • INIT (for the vendors who don't implement SYSTEM_BECOME_READY)
      transmit to NORMAL
      handle the event normally
    • NORMAL
      handle the event normally
  4. SYSTEM_FAIL
    treat as a fatal error

State event next state
INIT SYSTEM NOT READY INIT / EXIT
INIT SYSTEM BECOME READY NORMAL
NORMAL SYSTEM BECOME READY NORMAL
INIT/NORMAL SYSTEM FAIL EXIT
INIT/NORMAL NORMAL EVENT NORMAL
NORMAL SYSTEM NOT READY EXIT
EXIT -

status, port_dict = platform_sfputil.get_transceiver_change_event(timeout)
logger.log_debug("Got event {} {} in state {}".format(status, port_dict, state))
event = self._mapping_event_from_change_event(status, port_dict)
for key, value in port_dict.iteritems():
Copy link
Contributor

@qiluo-msft qiluo-msft Aug 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for key, value in port_dict.iteritems(): [](start = 12, length = 40)

Why you put the for-loop outside the event/state branches? Can you move it inside a smaller branch? #Closed

Copy link
Collaborator

@stephenxs stephenxs Aug 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. You are correct.
Only for NORMAL_EVENT is this for loop required.
The "break" statements which originally break the inner "for" loop are also removed.

SYSTEM_FAIL = 'system_fail'
NORMAL_EVENT = 'normal'
# states definition
STATE_INIT = 'INIT STATE'
Copy link
Contributor

@qiluo-msft qiluo-msft Aug 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'INIT STATE' [](start = 13, length = 12)

You can use integer as value, don't rely on python to optimize the string comparison. #Closed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to use string is better since it makes logging information more human-readable.
Even though a dictionary can be adopted to map from integer value to string, the case that the state is not in the key set of the dictionary has to be handled. That's too complicated.
BTW, I'm not sure whether to use string address to compare the string is part of python's language specification or just optimization that depends on python interpreter's implementation?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

states have been updated to integer value.
The events, like SYSTEM_FAIL, are already used by the plugins. In this sence, we'd better not to change them. It seems that to use string has already been a common practice in plugins, like the normal plugin/out events which are represented by '1', '0' respectively.

# If get_transceiver_change_event() return error, will clean up the DB and then exit
# TODO: next step need to define more error types to handle accordingly.
logger.log_error("Failed to get transceiver change event. Exiting...")
next_state = None
Copy link
Contributor

@qiluo-msft qiluo-msft Aug 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next_state [](start = 12, length = 10)

You may initialize it with state, next_state will never be None. #Closed


if unexpected_state:
next_state = STATE_EXIT
logger.log_error("Facing unexpected state {}, exiting...".format(state))
Copy link
Contributor

@qiluo-msft qiluo-msft Aug 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove unexpected_state. This is redundant logic. #Closed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only for log.
By setting unexpected_state to true we can know why the state machine reach STATE_EXIT.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only for log.
By setting unexpected_state to true we can know why the state machine reach STATE_EXIT.

I decide to remove this. It is very unlikely to enter an unexpected state since all states are handled for each event.

else:
unexpected_state = True
next_state = STATE_EXIT
break
Copy link
Contributor

@qiluo-msft qiluo-msft Aug 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repeated code, you can extract function. #Closed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now, only next_state is set when facing unexpected state. in this sense, no need to extract function.

os.kill(os.getppid(), signal.SIGTERM)
break
elif next_state == STATE_NORMAL:
timeout = 0
Copy link
Contributor

@qiluo-msft qiluo-msft Aug 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can move this part into event handling branches. #Closed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to set it here since it can ensure that wherever the state reaches STATE_NORMAL the timeout can be properly set.

Copy link
Contributor

@qiluo-msft qiluo-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check my new comment, and also please resolve merge conflict for this PR.

logger.log_error("Failed to get transceiver change event. Exiting...")
logger.log_warning("Got unknown event {} on state {}.".format(event, state))

if not next_state == state:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not next_state == state: [](start = 12, length = 27)

For readability, use if next_state != state

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@qiluo-msft qiluo-msft merged commit 88d6d5a into sonic-net:master Aug 10, 2019
@keboliu keboliu deleted the retry-on-fail branch August 13, 2019 03:05
jleveque pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Aug 14, 2019
[sonic-platform-common]

[sonic_sfp] Interpret sff 'int' element =0 as valid value (sonic-net/sonic-platform-common#51)
add more error code to get_transceiver_change_event ((sonic-net/sonic-platform-common#50)
[sonic_platform_base] support new-platform-api-based daemons ((sonic-net/sonic-platform-common#48)
sync change to sonic_platform_base/sonic_sfp and create symbol link ((sonic-net/sonic-platform-common#49)
Add parser support for Tx_RxLos,TxFault, PowerControl, ResetStatus in sff8436.py ((sonic-net/sonic-platform-common#45)
readd type_abbrv_name in sonic_sfp/sff8436.py ((sonic-net/sonic-platform-common#44)
[psu_base] get_status_led() returns current state of the status LED ((sonic-net/sonic-platform-common#39)
Fix abbrv name for OSFP ((sonic-net/sonic-platform-common#36)
[sff8436] support "Control Bytes" and "Options" ((sonic-net/sonic-platform-common#38)
sonic_sfp: avoid possible key error in get_physical_to_logical() ((sonic-net/sonic-platform-common#37)

[sonic-platform-daemons]

[xcvrd] Enhance xcvrd to handle new system level event/error (sonic-net/sonic-platform-daemons#39)
[xcvrd] Support both new platform API and old platform plugins (sonic-net/sonic-platform-daemons#38)
[psud] Support both new platform API and old platform plugins (sonic-net/sonic-platform-daemons#37)
[syseepromd] Support both new platform API and old platform plugins (sonic-net/sonic-platform-daemons#36)
Add missing import statemet (sonic-net/sonic-platform-daemons#32)
sonic_xcvrd: Support for DOM Threshold values for EEPROM dump (sonic-net/sonic-platform-daemons#29)
yxieca added a commit that referenced this pull request Aug 16, 2019
[xcvrd] backport PR(#39) "Enhance xcvrd to handle new system level event/error" to 201811
wangshengjun pushed a commit to wangshengjun/sonic-buildimage that referenced this pull request Nov 16, 2020
…ic-net#3333)

[sonic-platform-common]

[sonic_sfp] Interpret sff 'int' element =0 as valid value (sonic-net/sonic-platform-common#51)
add more error code to get_transceiver_change_event ((sonic-net/sonic-platform-common#50)
[sonic_platform_base] support new-platform-api-based daemons ((sonic-net/sonic-platform-common#48)
sync change to sonic_platform_base/sonic_sfp and create symbol link ((sonic-net/sonic-platform-common#49)
Add parser support for Tx_RxLos,TxFault, PowerControl, ResetStatus in sff8436.py ((sonic-net/sonic-platform-common#45)
readd type_abbrv_name in sonic_sfp/sff8436.py ((sonic-net/sonic-platform-common#44)
[psu_base] get_status_led() returns current state of the status LED ((sonic-net/sonic-platform-common#39)
Fix abbrv name for OSFP ((sonic-net/sonic-platform-common#36)
[sff8436] support "Control Bytes" and "Options" ((sonic-net/sonic-platform-common#38)
sonic_sfp: avoid possible key error in get_physical_to_logical() ((sonic-net/sonic-platform-common#37)

[sonic-platform-daemons]

[xcvrd] Enhance xcvrd to handle new system level event/error (sonic-net/sonic-platform-daemons#39)
[xcvrd] Support both new platform API and old platform plugins (sonic-net/sonic-platform-daemons#38)
[psud] Support both new platform API and old platform plugins (sonic-net/sonic-platform-daemons#37)
[syseepromd] Support both new platform API and old platform plugins (sonic-net/sonic-platform-daemons#36)
Add missing import statemet (sonic-net/sonic-platform-daemons#32)
sonic_xcvrd: Support for DOM Threshold values for EEPROM dump (sonic-net/sonic-platform-daemons#29)
vdahiya12 pushed a commit to vdahiya12/sonic-platform-daemons that referenced this pull request Apr 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants