Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support redfish #734

Merged
merged 14 commits into from
Jul 18, 2023
Merged

Support redfish #734

merged 14 commits into from
Jul 18, 2023

Conversation

rootfs
Copy link
Contributor

@rootfs rootfs commented Jun 13, 2023

Get power info from BMCs that support Redfish. Kepler only needs to access chassis and system to get the power info. Currently the existing redfish golang bindings based on OpenAPI client or gofish are overkill. This PR uses simple access models.

#644

Tasks:

  • Redfish access credential management
  • Create Redfish API model
  • Use Redfish API model to retrieve power metrics
  • Kepler collector reads Redfish power metrics
  • Kepler emits Redfish power metrics
  • Test on real servers (Dell/HPE)
  • Document BMC power metrics
  • (stretch) Support temperature, fan, humidity metrics

@rootfs rootfs force-pushed the redfish branch 4 times, most recently from 1a22c96 to 7b004e2 Compare June 13, 2023 17:23
@rootfs rootfs force-pushed the redfish branch 4 times, most recently from 02cb879 to 2bf807a Compare June 15, 2023 18:46
@rootfs rootfs changed the title [WIP] Support redfish Support redfish Jun 20, 2023
@rootfs rootfs force-pushed the redfish branch 2 times, most recently from dc7d486 to be078c5 Compare June 20, 2023 22:07
@@ -61,6 +62,7 @@ type NodeMetrics struct {
TotalEnergyInGPU *types.UInt64StatCollection
TotalEnergyInOther *types.UInt64StatCollection
TotalEnergyInPlatform *types.UInt64StatCollection
TotalEnergyInRedfish *types.UInt64StatCollection
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking, if redfish, is a kind of platform energy api, then we may don't need a new var as TotalEnergyInRedfish but reuse TotalEnergyInPlatform?
as redfish api is a type of Platform API?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

Also, if redfish is enabled, we don't need to collect the platform power from other sources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, working on the refactoring

return true
}

func getNodeName() string {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this function?

func getNodeName() string {
?

Copy link
Collaborator

@marceloamaral marceloamaral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use redfish to update the Platform Energy.
So, we should verify in the function updatePlatformEnergy if redfish is enabled, otherwise we use the ACPI or LPAR power sources.

@marceloamaral
Copy link
Collaborator

The PR #748 refactor the code and make it easier to include the redfish support.


func getRedfishPower(access RedfishAccessInfo, system string) (*RedfishPowerModel, error) {
var power RedfishPowerModel
err := getRedfishModel(access, "/redfish/v1/Chassis/"+system+"/Power/PowerControl", &power)
Copy link
Contributor

@tiwatsuka tiwatsuka Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/Power/PowerControl is not specified as a resource in Redfish Resource and Schema Guide and HPE iLo Redfish API does not provide it.
cf. https://hewlettpackard.github.io/ilo-rest-api-docs/ilo5/#resource-map
Instead, all of /Power/#PowerControl, /Power#PowerControl and /Power#/PowerControl are available on my machine. You can explore HPE environment at https://ilorestfulapiexplorer.ext.hpe.com/ .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @tiwatsuka for the info! I am going to update the API model to work with your suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tiwatsuka can you get the output of Power#/PowerControl on your HPE? The simulator doesn't have it, thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't access on my machine right now, but it was just same as output of Power. On second thoughts, it is just interpreted as a URL fragment that the part after #.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
  "@odata.context": "/redfish/v1/$metadata#Power.Power",
  "@odata.etag": "W/\"2613CD55\"",
  "@odata.id": "/redfish/v1/Chassis/1/Power",
  "@odata.type": "#Power.v1_3_0.Power",
  "Id": "Power",
  "Name": "PowerMetrics",
  "Oem": {
    "Hpe": {
      "@odata.context": "/redfish/v1/$metadata#HpePowerMetricsExt.HpePowerMetricsExt",
      "@odata.type": "#HpePowerMetricsExt.v2_3_0.HpePowerMetricsExt",
      "BrownoutRecoveryEnabled": true,
      "HasCpuPowerMetering": false,
      "HasDimmPowerMetering": false,
      "HasGpuPowerMetering": false,
      "HasPowerMetering": true,
      "HighEfficiencyMode": "Balanced",
      "Links": {
        "PowerMeter": {
          "@odata.id": "/redfish/v1/Chassis/1/Power/PowerMeter"
        },
        "FastPowerMeter": {
          "@odata.id": "/redfish/v1/Chassis/1/Power/FastPowerMeter"
        },
        "SlowPowerMeter": {
          "@odata.id": "/redfish/v1/Chassis/1/Power/SlowPowerMeter"
        },
        "FederatedGroupCapping": {
          "@odata.id": "/redfish/v1/Chassis/1/Power/FederatedGroupCapping"
        }
      },
      "MinimumSafelyAchievableCap": null,
      "MinimumSafelyAchievableCapValid": false,
      "SNMPPowerThresholdAlert": {
        "DurationInMin": 0,
        "ThresholdWatts": 0,
        "Trigger": "Disabled"
      }
    }
  },
  "PowerControl": [
    {
      "@odata.id": "/redfish/v1/Chassis/1/Power#PowerControl/0",
      "MemberId": "0",
      "PowerCapacityWatts": 500,
      "PowerConsumedWatts": 44,
      "PowerMetrics": {
        "AverageConsumedWatts": 40,
        "IntervalInMin": 20,
        "MaxConsumedWatts": 78,
        "MinConsumedWatts": 38
      }
    }
  ],
  "PowerSupplies": [
    {
      "@odata.id": "/redfish/v1/Chassis/1/Power#PowerSupplies/0",
      "FirmwareVersion": "1.00",
      "LastPowerOutputWatts": 44,
      "LineInputVoltage": 105,
      "LineInputVoltageType": "ACLowLine",
      "Manufacturer": "LTEON",
      "MemberId": "0",
      "Model": "865408-B21",
      "Name": "HpeServerPowerSupply",
      "Oem": {
        "Hpe": {
          "@odata.context": "/redfish/v1/$metadata#HpeServerPowerSupply.HpeServerPowerSupply",
          "@odata.type": "#HpeServerPowerSupply.v2_0_0.HpeServerPowerSupply",
          "AveragePowerOutputWatts": 44,
          "BayNumber": 1,
          "HotplugCapable": true,
          "MaxPowerOutputWatts": 58,
          "Mismatched": false,
          "PowerSupplyStatus": {
            "State": "Ok"
          },
          "iPDUCapable": false
        }
      },
      "PowerCapacityWatts": 500,
      "PowerSupplyType": "AC",
      "SerialNumber": "5WBXK0FLLG25CW",
      "SparePartNumber": "866729-001",
      "Status": {
        "Health": "OK",
        "State": "Enabled"
      }
    },
    {
      "@odata.id": "/redfish/v1/Chassis/1/Power#PowerSupplies/1",
      "MemberId": "1",
      "Oem": {
        "Hpe": {
          "@odata.context": "/redfish/v1/$metadata#HpeServerPowerSupply.HpeServerPowerSupply",
          "@odata.type": "#HpeServerPowerSupply.v2_0_0.HpeServerPowerSupply",
          "BayNumber": 2
        }
      },
      "Status": {
        "Health": "Warning",
        "State": "Absent"
      }
    }
  ]
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @tiwatsuka this looks consistent with the new model definition here

I'll finish it next week and hopefully get into the next release in mid July.

// calculate the elapsed time since the last power query in seconds
elapsed := now.Sub(system.timestamp).Seconds()
klog.V(5).Infof("power info: %+v\n", system)
power[system.system] = float64(system.consumedWatts * 1000 * int(elapsed)) // convert to mW
Copy link
Contributor

@YaSuenag YaSuenag Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line seems to calculate value in Joule (consumedWatts x elapsed), so is it better to name to GetEnergyFromRedfish rather than GetPower in func name? (like platform/source/acpi.go)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense, @rootfs can you rename this function in the Class level?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never-mind, the Class already defines the right function name: GetEnergyFromPlatform()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed function name

Copy link
Collaborator

@marceloamaral marceloamaral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rootfs can you change the logic to get the redfish power as the Platform power?
You can follow the logic structure in hmc and the logic in acpi.

Also when we are using the redfish to collect the platform power (i.e. the total node power), we don't need to use the ACPI or HMC. So the platform package should verify if redfish is available and use this instead of ACPI or HMC.

This should be a similar logic as we collect the component energy that can come from different power sources.

}

// set a timer to check the power info every intervalInMin minutes
if intervalInMin > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IntervalInMin in Redfish is irrelevant to the interval of updating the value of PowerConsumedWatts.
I believe that PowerConsumedWatts is updated every one or few seconds. So it would rather be collected more frequently than using IntervalInMin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

rf.mutex.Lock()
now := time.Now()
// calculate the elapsed time since the last power query in seconds
elapsed := now.Sub(system.timestamp).Seconds()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timestamp is based on the ticker defined at IsPowerSupported(), but GetPower() is called periodically based on another ticker. Is the energy value calculated correctly? I suspect that some energy is added multiple times or lost due to asynchronous two tickers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tiwatsuka you are right, the GetPower() happens asynchronously. Since the HTTP request could take time, it is not a good idea to put it in the critical path. So I let the redfish call happen in the background and Kepler collector just reads the existing reading without having to issue HTTP calls. The accuracy and synchronization can be further improved when using a smaller probe interval window or post processing calibration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timestamp should be refreshed when the energy is read. If not, it counts same period multiple times.
Current implementation seems like the image shown below. (I hope it makes sense to you. )
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tiwatsuka thanks for the review. I have pushed the code. Will test it and ask for more reviews once I get a new system to test it again.

@rootfs
Copy link
Contributor Author

rootfs commented Jul 10, 2023

@marceloamaral redfish and ACPI are now in the same platform pkg. If redfish is available, then platform uses redfish reading, otherwise use ACPI.

Signed-off-by: Huamin Chen <[email protected]>
Copy link
Contributor

@tiwatsuka tiwatsuka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some suggestions to get energy correctly.

pkg/power/platform/source/redfish.go Outdated Show resolved Hide resolved
pkg/power/platform/source/redfish.go Outdated Show resolved Hide resolved
pkg/power/platform/source/redfish.go Outdated Show resolved Hide resolved
pkg/power/platform/source/redfish.go Outdated Show resolved Hide resolved

func getRedfishPower(access RedfishAccessInfo, system string) (*RedfishPowerModel, error) {
var power RedfishPowerModel
err := getRedfishModel(access, "/redfish/v1/Chassis/"+system+"/Power%23/PowerControl", &power)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is URL encoding required? It's not work in my environment and using '#' is working fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit test needs %23 but I haven't tested my setup yet.

host := access.Host

// Create a HTTP client and set up the basic authentication header
client := &http.Client{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't it required to set TLSClientConfig to &tls.Config{InsecureSkipVerify: true}?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My old Dell setup uses http, but I haven't tried https yet. Please add it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to set InsecureSkipVerify as an option (e.g. environment variable) because it is important to secure access.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tiwatsuka @YaSuenag sounds good. can you suggest a change and I'll merge it here. thanks

rootfs and others added 4 commits July 13, 2023 07:25
Co-authored-by: Takuya Iwatsuka <[email protected]>
Signed-off-by: Huamin Chen <[email protected]>
Co-authored-by: Takuya Iwatsuka <[email protected]>
Signed-off-by: Huamin Chen <[email protected]>
Co-authored-by: Takuya Iwatsuka <[email protected]>
Signed-off-by: Huamin Chen <[email protected]>
Co-authored-by: Takuya Iwatsuka <[email protected]>
Signed-off-by: Huamin Chen <[email protected]>
@rootfs
Copy link
Contributor Author

rootfs commented Jul 13, 2023

@tiwatsuka @YaSuenag thank you for reviewing. My setup will be ready later this week and run the tests on it. Will keep you updated.

Copy link
Contributor

@tiwatsuka tiwatsuka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add config to skip SSL verification.
@YaSuenag Does this satisfy your request?

pkg/config/config.go Show resolved Hide resolved
pkg/config/config.go Show resolved Hide resolved
pkg/power/platform/source/redfish_util.go Show resolved Hide resolved
pkg/power/platform/source/redfish_util.go Show resolved Hide resolved
pkg/power/platform/source/redfish_util.go Outdated Show resolved Hide resolved
rootfs and others added 5 commits July 14, 2023 07:27
Co-authored-by: Takuya Iwatsuka <[email protected]>
Signed-off-by: Huamin Chen <[email protected]>
Co-authored-by: Takuya Iwatsuka <[email protected]>
Signed-off-by: Huamin Chen <[email protected]>
Co-authored-by: Takuya Iwatsuka <[email protected]>
Signed-off-by: Huamin Chen <[email protected]>
Co-authored-by: Takuya Iwatsuka <[email protected]>
Signed-off-by: Huamin Chen <[email protected]>
Co-authored-by: Takuya Iwatsuka <[email protected]>
Signed-off-by: Huamin Chen <[email protected]>
@YaSuenag
Copy link
Contributor

Thanks @tiwatsuka to create the change, and @rootfs for merging! I completely agree with your change.

BTW new parameters / environment variables will be documented? Will it happen on https://github.com/sustainable-computing-io/kepler-doc ?

@rootfs
Copy link
Contributor Author

rootfs commented Jul 15, 2023

@YaSuenag yes, kepler-doc will have the redfish document. If you or your team can come up with more tests and instructions, please share with the community over there. Thanks!

if decErr != nil {
if strings.HasPrefix(decErr.Error(), "json: unknown field ") {
// ignore unknown field error
fieldName := strings.TrimPrefix(decErr.Error(), "json: unknown field ")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dell returns some fields that are not defined in the power model, ignore them here

@rootfs
Copy link
Contributor Author

rootfs commented Jul 17, 2023

@marceloamaral redfish is now platform energy. PTAL

Here is the output:

image

Signed-off-by: Huamin Chen <[email protected]>
Signed-off-by: Huamin Chen <[email protected]>
@rootfs
Copy link
Contributor Author

rootfs commented Jul 17, 2023

@YaSuenag @tiwatsuka the kepler-doc update will be at this PR sustainable-computing-io/kepler-doc#74

@rootfs rootfs dismissed marceloamaral’s stale review July 18, 2023 13:19

redfish is now part of the platform energy

@rootfs rootfs removed the request for review from jiangphcn July 18, 2023 13:19
Copy link
Collaborator

@husky-parul husky-parul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants