Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HWMON_DIR is not found for Dell platform causes crash in determine-reboot-cause #6082

Closed
vaibhavhd opened this issue Dec 2, 2020 · 4 comments · Fixed by #6322
Closed

HWMON_DIR is not found for Dell platform causes crash in determine-reboot-cause #6082

vaibhavhd opened this issue Dec 2, 2020 · 4 comments · Fixed by #6322

Comments

@vaibhavhd
Copy link
Contributor

Description

At the time of reboot, HWMON_DIR is not found for Dell platform which causes the determine-reboot-cause script to crash unexpectedly.

Failing line of code:
https://github.com/Azure/sonic-buildimage/blob/master/platform/broadcom/sonic-platform-modules-dell/s6100/sonic_platform/fan.py#L26

No such file or directory: '/sys/devices/platform/SMF.512/hwmon/

However, the DIR is present:

root@str-s6100-acs-4:~# ls /sys/devices/platform/SMF.512/hwmon/
hwmon1
root@str-s6100-acs-4:~#

Steps to reproduce the issue:

  1. Install latest SONiC master image on Arista device. I hit the issue in 508
  2. Check output of show reboot-cause. The reboot cause is either not detected or printed from an old reboot cause.
  3. Issue is seen in syslog at the beginning stage where DUT is booting into the new image.

Install latest SONiC master image on Arista device. I hit the issue in 500
Issue is seen in syslog at the beginning stage where DUT is booting into the new image.
Describe the results you received:

Dec  2 00:06:23.343347 str-s6100-acs-4 INFO determine-reboot-cause: /proc/cmdline indicates reboot type: warm-reboot
Dec  2 00:06:23.343497 str-s6100-acs-4 INFO ntpd[643]: Listen normally on 0 lo 127.0.0.1:123
Dec  2 00:06:23.343652 str-s6100-acs-4 INFO ntpd[643]: Listen normally on 1 lo [::1]:123
Dec  2 00:06:23.343826 str-s6100-acs-4 INFO ntpd[643]: Listening on routing socket on fd #18 for interface updates
Dec  2 00:06:23.343987 str-s6100-acs-4 INFO ntpd[643]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Dec  2 00:06:23.344134 str-s6100-acs-4 INFO ntpd[643]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Dec  2 00:06:23.345629 str-s6100-acs-4 INFO determine-reboot-cause[641]: Traceback (most recent call last):
Dec  2 00:06:23.345767 str-s6100-acs-4 INFO determine-reboot-cause[641]:   File "/usr/local/bin/determine-reboot-cause", line 236, in <module>
Dec  2 00:06:23.345888 str-s6100-acs-4 INFO determine-reboot-cause[641]:     main()
Dec  2 00:06:23.346007 str-s6100-acs-4 INFO determine-reboot-cause[641]:   File "/usr/local/bin/determine-reboot-cause", line 185, in main
Dec  2 00:06:23.346129 str-s6100-acs-4 INFO determine-reboot-cause[641]:     (hardware_reboot_cause, additional_reboot_info) = find_hardware_reboot_cause()
Dec  2 00:06:23.346249 str-s6100-acs-4 INFO determine-reboot-cause[641]:   File "/usr/local/bin/determine-reboot-cause", line 120, in find_hardware_reboot_cause
Dec  2 00:06:23.346375 str-s6100-acs-4 INFO determine-reboot-cause[641]:     hardware_reboot_cause_major, hardware_reboot_cause_minor = get_reboot_cause_from_platform()
Dec  2 00:06:23.346506 str-s6100-acs-4 INFO determine-reboot-cause[641]:   File "/usr/local/bin/determine-reboot-cause", line 106, in get_reboot_cause_from_platform
Dec  2 00:06:23.346626 str-s6100-acs-4 INFO determine-reboot-cause[641]:     import sonic_platform
Dec  2 00:06:23.346748 str-s6100-acs-4 INFO determine-reboot-cause[641]:   File "/usr/local/lib/python3.7/dist-packages/sonic_platform/__init__.py", line 2, in <module>
Dec  2 00:06:23.346869 str-s6100-acs-4 INFO determine-reboot-cause[641]:     from sonic_platform import *
Dec  2 00:06:23.346990 str-s6100-acs-4 INFO determine-reboot-cause[641]:   File "/usr/local/lib/python3.7/dist-packages/sonic_platform/platform.py", line 13, in <module>
Dec  2 00:06:23.347120 str-s6100-acs-4 INFO determine-reboot-cause[641]:     from sonic_platform.chassis import Chassis
Dec  2 00:06:23.347242 str-s6100-acs-4 INFO determine-reboot-cause[641]:   File "/usr/local/lib/python3.7/dist-packages/sonic_platform/chassis.py", line 15, in <module>
Dec  2 00:06:23.347368 str-s6100-acs-4 INFO determine-reboot-cause[641]:     from sonic_platform.psu import Psu
Dec  2 00:06:23.347490 str-s6100-acs-4 INFO determine-reboot-cause[641]:   File "/usr/local/lib/python3.7/dist-packages/sonic_platform/psu.py", line 15, in <module>
Dec  2 00:06:23.347611 str-s6100-acs-4 INFO determine-reboot-cause[641]:     from sonic_platform.fan import Fan
Dec  2 00:06:23.347810 str-s6100-acs-4 INFO determine-reboot-cause[641]:   File "/usr/local/lib/python3.7/dist-packages/sonic_platform/fan.py", line 22, in <module>
Dec  2 00:06:23.347946 str-s6100-acs-4 INFO determine-reboot-cause[641]:     class Fan(FanBase):
Dec  2 00:06:23.348071 str-s6100-acs-4 INFO determine-reboot-cause[641]:   File "/usr/local/lib/python3.7/dist-packages/sonic_platform/fan.py", line 26, in Fan
Dec  2 00:06:23.348192 str-s6100-acs-4 INFO determine-reboot-cause[641]:     HWMON_NODE = os.listdir(HWMON_DIR)[0]
Dec  2 00:06:23.348316 str-s6100-acs-4 INFO determine-reboot-cause[641]: FileNotFoundError: [Errno 2] No such file or directory: '/sys/devices/platform/SMF.512/hwmon/'
Dec  2 00:06:23.375962 str-s6100-acs-4 NOTICE systemd[1]: determine-reboot-cause.service: Main process exited, code=exited, status=1/FAILURE
Dec  2 00:06:23.376179 str-s6100-acs-4 WARNING systemd[1]: determine-reboot-cause.service: Failed with result 'exit-code'.

Describe the results you expected:

Reboot cause should be determined and processed.

Additional information you deem important (e.g. issue happens only occasionally):

**Output of `show version`:**

```
(paste your output here)
```

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```
@vaibhavhd
Copy link
Contributor Author

@ArunSaravananBalachandran tagging you as this seems to be DELL platform library issue.

@ArunSaravananBalachandran
Copy link
Contributor

The mentioned error is seen because the platform APIs are called before the platform initialization is complete.For process-reboot-cause, there is a systemd rule to prevent this, but the issue is seen since a new determine-reboot-cause service is used.
We will work on the fix.

@vaibhavhd
Copy link
Contributor Author

Hi @ArunSaravananBalachandran, do you have an ETA for this fix? Thanks.

@ArunSaravananBalachandran
Copy link
Contributor

@vaibhavhd, will raise a PR by end of next week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants