Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect status when server is powered off (Dell poweredge R740) #110

Closed
weeboo opened this issue Mar 27, 2023 · 24 comments
Closed

Incorrect status when server is powered off (Dell poweredge R740) #110

weeboo opened this issue Mar 27, 2023 · 24 comments
Labels
awaiting reply waiting for a reply, close issue after 14 days of no response bug Something isn't working
Milestone

Comments

@weeboo
Copy link

weeboo commented Mar 27, 2023

Hello,
my server is a Dell PowerEdge R740
when the system is powered off, the status does not represant reality :

.\CSH-SYS-x86_check_redfish.py -H x.x.x.x -f xxxxxxxxx.txt --info --detailed
[CRITICAL]: INFO: Dell Inc. PowerEdge R740 (CPU: 1, MEM: 128GB) - BIOS: 2.17.1 - Serial: Cxxxxxxxxxx0 - ServiceTag: J8P6SK3 - Power: Off - Name: NOT SET
[CRITICAL]: Sensor "CPU1 FIVR PG": Unknown (Enabled/Unknown)
[CRITICAL]: Sensor "CPU1 MEM012 VDDQ PG": Unknown (Enabled/Unknown)
[CRITICAL]: Sensor "CPU1 MEM012 VPP PG": Unknown (Enabled/Unknown)
[CRITICAL]: Sensor "CPU1 MEM012 VTT PG": Unknown (Enabled/Unknown)
[CRITICAL]: Sensor "CPU1 MEM345 VDDQ PG": Unknown (Enabled/Unknown)
[CRITICAL]: Sensor "CPU1 MEM345 VPP PG": Unknown (Enabled/Unknown)
[CRITICAL]: Sensor "CPU1 MEM345 VTT PG": Unknown (Enabled/Unknown)
[CRITICAL]: Sensor "CPU1 VCCIO PG": Unknown (Enabled/Unknown)
[CRITICAL]: Sensor "CPU1 VCORE PG": Unknown (Enabled/Unknown)
[CRITICAL]: Sensor "CPU1 VSA PG": Unknown (Enabled/Unknown)
[CRITICAL]: Sensor "System Board 1.8V SW PG": Unknown (Enabled/Unknown)
[CRITICAL]: Sensor "System Board 2.5V SW PG": Unknown (Enabled/Unknown)
[CRITICAL]: Sensor "System Board 3.3V B PG": Unknown (Enabled/Unknown)
[CRITICAL]: Sensor "System Board 5V SW PG": Unknown (Enabled/Unknown)
[CRITICAL]: Sensor "System Board BP0 PG": Unknown (Enabled/Unknown)
[CRITICAL]: Sensor "System Board BP1 PG": Unknown (Enabled/Unknown)
[CRITICAL]: Sensor "System Board BP2 PG": Unknown (Enabled/Unknown)
[CRITICAL]: Sensor "System Board NDC PG": Unknown (Enabled/Unknown)
[CRITICAL]: Sensor "System Board PS1 PG FAIL": Unknown (Enabled/Unknown)
[CRITICAL]: Sensor "System Board PS2 PG FAIL": Unknown (Enabled/Unknown)
[CRITICAL]: Sensor "System Board PVNN SW PG": Unknown (Enabled/Unknown)
[CRITICAL]: Sensor "System Board VSB11 SW PG": Unknown (Enabled/Unknown)
[CRITICAL]: Sensor "System Board VSBM SW PG": Unknown (Enabled/Unknown)
[OK]: Sensor "System Board CMOS Battery": OK (Enabled/Good)
[OK]: Sensor "System Board DIMM PG": OK (Enabled/Good)
[OK]: Sensor "System Board Intrusion": OK (Enabled/No Breach)

When the status of a sensor is unknown, why affect the critical status and not the unknown status ?
And, is it possible to exclude Unknown sensor when the server is in powered off state ?

@bb-Ricardo
Copy link
Owner

Hi,

good point, I will have a look at it.

@lgmu
Copy link

lgmu commented Apr 4, 2023

Hi,
similar problem on a HPE ProLiant BL460c Gen10

When the server is powered off, the FAN and Memory Checks go critical

[OK]: INFO: HPE ProLiant BL460c Gen10 (CPU: 2, MEM: 64GB) - BIOS: I41 v1.46 (10/02/2018) - Serial: *** - Power: Off - Name: NOT SET
[CRITICAL]: Chassi 1 : Fan '1' (0%) status is: UnavailableOffline
[CRITICAL]: Chassi enclosurechassis : Fan '1' (0%) status is: UnavailableOffline
[CRITICAL]: Memory module PROC 1 DIMM 1 (0.0GB) status is: None
[CRITICAL]: Memory module PROC 1 DIMM 2 (16.0GB) status is: None
[CRITICAL]: Memory module PROC 1 DIMM 3 (16.0GB) status is: None
[CRITICAL]: Memory module PROC 1 DIMM 4 (0.0GB) status is: None
[CRITICAL]: Memory module PROC 1 DIMM 5 (0.0GB) status is: None
[CRITICAL]: Memory module PROC 1 DIMM 6 (0.0GB) status is: None
[CRITICAL]: Memory module PROC 1 DIMM 7 (0.0GB) status is: None
[CRITICAL]: Memory module PROC 1 DIMM 8 (0.0GB) status is: None
[CRITICAL]: Memory module PROC 2 DIMM 1 (0.0GB) status is: None
[CRITICAL]: Memory module PROC 2 DIMM 2 (16.0GB) status is: None
[CRITICAL]: Memory module PROC 2 DIMM 3 (16.0GB) status is: None
[CRITICAL]: Memory module PROC 2 DIMM 4 (0.0GB) status is: None
[CRITICAL]: Memory module PROC 2 DIMM 5 (0.0GB) status is: None
[CRITICAL]: Memory module PROC 2 DIMM 6 (0.0GB) status is: None
[CRITICAL]: Memory module PROC 2 DIMM 7 (0.0GB) status is: None
[CRITICAL]: Memory module PROC 2 DIMM 8 (0.0GB) status is: None
[OK]: BMC: iLO 5 (Firmware: iLO 5 v2.72) and all nics are in 'OK' state.
[OK]: All network adapter (1) and ports (0) are in good condition
[OK]: Chassi 1 : No power supplies detected
[OK]: Chassi enclosurechassis : No power supplies detected
[OK]: All processors (2) are in good condition
[OK]: Chassi 1 : All temp sensors (0) are in good condition
[OK]: Chassi enclosurechassis : All temp sensors (0) are in good condition

@bb-Ricardo bb-Ricardo added this to the 1.5.1 milestone Apr 5, 2023
@bb-Ricardo
Copy link
Owner

Hey, I just pushed a change to the next-release branch. Can you check it out and test if it works now?

thank you.

@lgmu
Copy link

lgmu commented Apr 6, 2023

Hi, thanks! I tried the latest changes on the next-release branch:

On the same Server it works great now:

[OK]: BMC: iLO 5 (Firmware: iLO 5 v2.72) and all nics are in 'OK' state.
[OK]: Chassi 1 : All fans (1) are in good condition
[OK]: Chassi enclosurechassis : All fans (1) are in good condition
[OK]: All 16 memory modules (Total 64.0GB) are in good condition
[OK]: All network adapter (1) and ports (0) are in good condition
[OK]: Chassi 1 : No power supplies detected
[OK]: Chassi enclosurechassis : No power supplies detected
[OK]: All processors (2) are in good condition
[OK]: INFO: HPE ProLiant BL460c Gen10 (CPU: 2, MEM: 64GB) - BIOS: I41 v1.46 (10/02/2018) - Serial: *** - Power: Off - Name: NOT SET
[OK]: Chassi 1 : All temp sensors (0) are in good condition
[OK]: Chassi enclosurechassis : All temp sensors (0) are in good condition

I've found some other servers though:

[CRITICAL]: Processor CPU.Socket.1 (Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz) status is: None
[CRITICAL]: Processor CPU.Socket.2 (Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz) status is: None
[OK]: All fans (8) are in good condition
[OK]: All 16 memory modules (Total 1024.0GB) are in good condition
[OK]: All network adapter (3) and ports (5) are in good condition
[OK]: All power supplies (2) are in good condition and 1 Voltages are OK
[OK]: One or more storage components report an issue
[OK]: INFO: Dell Inc. PowerEdge C6420 (CPU: 2, MEM: 1024GB) - BIOS: 2.11.2 - Serial: *** - ServiceTag: *** - Power: Off - Name: NOT SET - 32 health sensors are in 'OK' state
[OK]: All temp sensors (1) are in good condition
|'ps_1'=266 'ps_2'=27 'temp_Inlet_Temp'=21.0;43;47 'Fan_1A'=-2147483648;; 'Fan_1B'=-2147483648;; 'Fan_2A'=-2147483648;; 'Fan_2B'=-2147483648;; 'Fan_3A'=-2147483648;; 'Fan_3B'=-2147483648;; 'Fan_4A'=-2147483648;; 'Fan_4B'=-2147483648;; 

For CPU it's not working yet and also I recieve negative integer overflow for the fans

And on another server I have problems with the temp sensors when turned off:

[CRITICAL]: Temp sensor 01-Inlet Ambient status is: Offline (0.0 °C) (max: 42.0 °C)
[CRITICAL]: Temp sensor 02-CPU 1 status is: Offline (0.0 °C) (max: 70.0 °C)
[CRITICAL]: Temp sensor 03-CPU 2 status is: Offline (0.0 °C) (max: 70.0 °C)
[CRITICAL]: Temp sensor 04-P1 DIMM 1-6 status is: Offline (0.0 °C) (max: 89.0 °C)
[CRITICAL]: Temp sensor 05-P1 DIMM 7-12 status is: Offline (0.0 °C) (max: 89.0 °C)
[CRITICAL]: Temp sensor 06-P2 DIMM 1-6 status is: Offline (0.0 °C) (max: 89.0 °C)
[CRITICAL]: Temp sensor 07-P2 DIMM 7-12 status is: Offline (0.0 °C) (max: 89.0 °C)
[CRITICAL]: Temp sensor 08-HD Max status is: Offline (0.0 °C) (max: 60.0 °C)
[CRITICAL]: Temp sensor 09-Exp Bay Drive status is: Offline (0.0 °C) (max: 75.0 °C)
[CRITICAL]: Temp sensor 10-Chipset status is: Offline (0.0 °C) (max: 105.0 °C)
[CRITICAL]: Temp sensor 11-PS 1 Inlet status is: Offline (0.0 °C) (max: N/A °C)
[CRITICAL]: Temp sensor 12-PS 2 Inlet status is: Offline (0.0 °C) (max: N/A °C)
[CRITICAL]: Temp sensor 13-VR P1 status is: Offline (0.0 °C) (max: 115.0 °C)
[CRITICAL]: Temp sensor 14-VR P2 status is: Offline (0.0 °C) (max: 115.0 °C)
[CRITICAL]: Temp sensor 15-VR P1 Mem status is: Offline (0.0 °C) (max: 115.0 °C)
[CRITICAL]: Temp sensor 16-VR P1 Mem status is: Offline (0.0 °C) (max: 115.0 °C)
[CRITICAL]: Temp sensor 17-VR P2 Mem status is: Offline (0.0 °C) (max: 115.0 °C)
[CRITICAL]: Temp sensor 18-VR P2 Mem status is: Offline (0.0 °C) (max: 115.0 °C)
[CRITICAL]: Temp sensor 19-PS 1 Internal status is: Offline (0.0 °C) (max: N/A °C)
[CRITICAL]: Temp sensor 20-PS 2 Internal status is: Offline (0.0 °C) (max: N/A °C)
[CRITICAL]: Temp sensor 21-PCI 1 status is: Offline (0.0 °C) (max: 100.0 °C)
[CRITICAL]: Temp sensor 22-PCI 2 status is: Offline (0.0 °C) (max: 100.0 °C)
[CRITICAL]: Temp sensor 23-PCI 3 status is: Offline (0.0 °C) (max: 100.0 °C)
[CRITICAL]: Temp sensor 24-PCI 4 status is: Offline (0.0 °C) (max: 100.0 °C)
[CRITICAL]: Temp sensor 25-PCI 5 status is: Offline (0.0 °C) (max: 100.0 °C)
[CRITICAL]: Temp sensor 26-PCI 6 status is: Offline (0.0 °C) (max: 100.0 °C)
[CRITICAL]: Temp sensor 27-HD Controller status is: Offline (0.0 °C) (max: 100.0 °C)
[CRITICAL]: Temp sensor 28-LOM Card status is: Offline (0.0 °C) (max: 100.0 °C)
[CRITICAL]: Temp sensor 29-LOM status is: Offline (0.0 °C) (max: 100.0 °C)
[CRITICAL]: Temp sensor 30-Front Ambient status is: Offline (0.0 °C) (max: 65.0 °C)
[CRITICAL]: Temp sensor 31-PCI 1 Zone. status is: Offline (0.0 °C) (max: 70.0 °C)
[CRITICAL]: Temp sensor 32-PCI 2 Zone. status is: Offline (0.0 °C) (max: 70.0 °C)
[CRITICAL]: Temp sensor 33-PCI 3 Zone. status is: Offline (0.0 °C) (max: 70.0 °C)
[CRITICAL]: Temp sensor 34-PCI 4 Zone status is: Offline (0.0 °C) (max: 70.0 °C)
[CRITICAL]: Temp sensor 35-PCI 5 Zone status is: Offline (0.0 °C) (max: 70.0 °C)
[CRITICAL]: Temp sensor 36-PCI 6 Zone status is: Offline (0.0 °C) (max: 70.0 °C)
[CRITICAL]: Temp sensor 37-HD Cntlr Zone status is: Offline (0.0 °C) (max: 75.0 °C)
[CRITICAL]: Temp sensor 38-I/O Zone status is: Offline (0.0 °C) (max: 75.0 °C)
[CRITICAL]: Temp sensor 39-P/S 2 Zone status is: Offline (0.0 °C) (max: 70.0 °C)
[CRITICAL]: Temp sensor 40-Battery Zone status is: Offline (0.0 °C) (max: 75.0 °C)
[CRITICAL]: Temp sensor 41-iLO Zone status is: Offline (0.0 °C) (max: 90.0 °C)
[CRITICAL]: Temp sensor 42-Rear HD Max status is: Offline (0.0 °C) (max: 60.0 °C)
[CRITICAL]: Temp sensor 43-Storage Batt status is: Offline (0.0 °C) (max: 60.0 °C)
[CRITICAL]: Temp sensor 44-Fuse status is: Offline (0.0 °C) (max: 100.0 °C)
[OK]: BMC: iLO 4 (Firmware: iLO 4 v2.81) and all nics are in 'OK' state.
[OK]: All fans (6) are in good condition
[OK]: All 5 memory modules (Total 48.0GB) are in good condition
[OK]: All network adapter (1) and ports (4) are in good condition
[OK]: All power supplies (0) are in good condition
[OK]: All processors (1) are in good condition
[OK]: INFO: HPE ProLiant DL380 Gen9 (CPU: 1, MEM: 48GB) - BIOS: P89 v2.64 (10/17/2018) - Serial: *** - Power: Off - Name: ***

@lgmu
Copy link

lgmu commented Apr 6, 2023

Here I additionally recieve a CRITICAL because of a Unknown Battery RAID Controller Status:

[CRITICAL]: Processor CPU.Socket.2 (Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz) status is: None
[CRITICAL]: Processor CPU.Socket.4 (Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz) status is: None
[CRITICAL]: Processor CPU.Socket.1 (Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz) status is: None
[CRITICAL]: Processor CPU.Socket.3 (Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz) status is: None
[CRITICAL]: Battery on RAID Controller in Slot 1 Status: Unknown
[OK]: BMC: iDRAC 9 (Firmware: 5.10.30.00) and all nics are in 'OK' state.
[OK]: Chassi has no fans installed/reported
[OK]: All 32 memory modules (Total 2048.0GB) are in good condition
[OK]: All network adapter (5) and ports (12) are in good condition
[OK]: All power supplies (2) are in good condition and Power redundancy 1 status is: Disabled and 1 Voltages are OK
[OK]: INFO: Dell Inc. PowerEdge R840 (CPU: 4, MEM: 2048GB) - BIOS: 2.14.2 - Serial: *** - ServiceTag: *** - Power: Off - Name: NOT SET - 57 health sensors are in 'OK' state
[OK]: All temp sensors (1) are in good condition

@bb-Ricardo
Copy link
Owner

uiui, this needs a more general approach then

can you check if the negative fan values are being sent directly from the iDRAC?

@lgmu
Copy link

lgmu commented Apr 6, 2023

Yes, sorry I didn't check that before:

'Fans': [{'@odata.id': '/redfish/v1/Chassis/System.Embedded.1/Thermal#/Fans/0',
           '@odata.type': '#Thermal.v1_7_1.Fan',
           'Assembly': {'@odata.id': '/redfish/v1/Chassis/System.Embedded.1/Assembly'},
           'FanName': 'FAN1A',
           'HotPluggable': False,
           'LowerThresholdCritical': None,
           'LowerThresholdFatal': None,
           'LowerThresholdNonCritical': None,
           'MaxReadingRange': None,
           'MemberId': '0',
           'MinReadingRange': None,
           'Name': 'FAN1A',
           'PhysicalContext': 'Fan',
           'Reading': -2147483648,
           'ReadingUnits': 'RPM',
           'Redundancy': [],
           '[email protected]': 0,
           'RelatedItem': [{'@odata.id': '/redfish/v1/Chassis/System.Embedded.1'}],
           '[email protected]': 1,
           'SensorNumber': 56,
           'Status': {'Health': None, 'State': None},
           'UpperThresholdCritical': None,
           'UpperThresholdFatal': None,
           'UpperThresholdNonCritical': None},
          {'@odata.id': '/redfish/v1/Chassis/System.Embedded.1/Thermal#/Fans/1',
           '@odata.type': '#Thermal.v1_7_1.Fan',
           'Assembly': {'@odata.id': '/redfish/v1/Chassis/System.Embedded.1/Assembly'},
           'FanName': 'FAN1B',
           'HotPluggable': False,
           'LowerThresholdCritical': None,
           'LowerThresholdFatal': None,
           'LowerThresholdNonCritical': None,
           'MaxReadingRange': None,
           'MemberId': '1',
           'MinReadingRange': None,
           'Name': 'FAN1B',
           'PhysicalContext': 'Fan',
           'Reading': -2147483648,
           'ReadingUnits': 'RPM',
           ...

@bb-Ricardo
Copy link
Owner

😄, well, This is quite something. I should add some sanity checks to the returned values and if they are out of range then they should default to 0.

What do you think?

@lgmu
Copy link

lgmu commented Apr 6, 2023

Sounds good!

@weeboo
Copy link
Author

weeboo commented Apr 6, 2023

This is better but I have the same problem with the --proc check when the system is powered off :
My server is a Dell poweredge R640

[CRITICAL]: Processor CPU.Socket.1 (Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz) status is: None
[CRITICAL]: Processor CPU.Socket.2 (Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz) status is: None

@bb-Ricardo
Copy link
Owner

Will fix it in the next version but wont be until after easter holiday break.

bb-Ricardo added a commit that referenced this issue Apr 20, 2023
@bb-Ricardo
Copy link
Owner

bb-Ricardo commented Apr 20, 2023

Hey @weeboo, @lgmu,

I just pushed another commit to next-release.

Would you mind testing it?

@bb-Ricardo bb-Ricardo added the bug Something isn't working label Apr 20, 2023
@lgmu
Copy link

lgmu commented Apr 21, 2023

Hey,
works great now. Thanks!

One thing I've noticed (on Hosts that are Power: on):

Sometimes I randomly get
[CRITICAL]: Processor Proc 1 (Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz) status is: None
but on the next reschedule it's OK again - when I check the logs I don't see that this has happened before - I'll need to keep an eye on this and will give you feedback.
But I don't see any change in the code that would have changed this behaviour

@bb-Ricardo
Copy link
Owner

Good to hear. But this behavior only occurs with the latest change? And not with older versions?

@lgmu
Copy link

lgmu commented Apr 21, 2023

I don't know yet, I couldn't reproduce it on the command line - I'll check after the weekend

@weeboo
Copy link
Author

weeboo commented Apr 24, 2023

Now it's OK for --proc but the problem is now with --power :
[CRITICAL]: Power supply 1 (PWR SPLY,1100W,RDNT,LTON) status is: None
[CRITICAL]: Power supply 2 (PWR SPLY,1100W,RDNT,LTON) status is: None

@bb-Ricardo
Copy link
Owner

Hi,

I especially left out the power supply section. This should be monitored correctly by the BMC even if the server is switched off. I assume it would be important if a power supply fails when the server is in standby.

What do you think?

@weeboo
Copy link
Author

weeboo commented Apr 24, 2023

The BMC say the status is None but you affect the CRITICAL status.
I think, it's not consistent. In my opinion, the None could be asign to the unknown status.
In this case, le system is powered off, so all none status could be ignore.

@lgmu
Copy link

lgmu commented Apr 24, 2023

Good to hear. But this behavior only occurs with the latest change? And not with older versions?

Seems to be fine, didn't see any more Criticals

@bb-Ricardo
Copy link
Owner

Hi,

I just pushed another commit regarding status of power supply if server is switched off. Can you try again please?

@bb-Ricardo
Copy link
Owner

@weeboo, @lgmu: any chance testing this commit?

@bb-Ricardo bb-Ricardo added the awaiting reply waiting for a reply, close issue after 14 days of no response label May 25, 2023
@weeboo
Copy link
Author

weeboo commented May 25, 2023

I will try today

@weeboo
Copy link
Author

weeboo commented May 25, 2023

All seems to be fine now
thanks !!

@bb-Ricardo
Copy link
Owner

Thank you for testing, then I will close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting reply waiting for a reply, close issue after 14 days of no response bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants