Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redfish plugin for hardware monitoring #7082

Merged
merged 12 commits into from
Jun 18, 2020
Merged

Redfish plugin for hardware monitoring #7082

merged 12 commits into from
Jun 18, 2020

Conversation

sarvanikonda
Copy link
Contributor

@sarvanikonda sarvanikonda commented Feb 25, 2020

This PR consists of redfish APIs for hardware monitoring of DELL and HP servers.

Required for all PRs:

  • Signed CLA.
  • Associated README.md updated.
  • Has appropriate unit tests.

admin account and others added 2 commits February 25, 2020 08:07
Signed-off-by: admin account <vzwadmin@tpa_poc_influxdb_1-tampa.vici.verizon.com>
Signed-off-by: admin account <vzwadmin@tpa_poc_influxdb_1-tampa.vici.verizon.com>
Copy link
Contributor

@ssoroka ssoroka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!
I've included a list of recommendations, as well we may want to remove the .dell_power.json.swp file as it doesn't seem to be relevant to the tests.

plugins/inputs/redfish/README.md Outdated Show resolved Hide resolved
plugins/inputs/redfish/redfish.go Outdated Show resolved Hide resolved
func getThermal(host, username, password, id string) error {
r := fmt.Sprint(host, "/redfish/v1/Chassis/", id, "/Thermal")
client := &http.Client{}
http.DefaultTransport.(*http.Transport).TLSClientConfig = &tls.Config{InsecureSkipVerify: true}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple issues here:

  • client should be set globally and reused; it's thread safe (goroutine safe) by default.
  • We probably don't want to use InsecureSkipVerify by default, as it doesn't validate anything about where it's connecting to. If possible we want to default to something a little more secure and give users the option to set this, with similar scary naming.
  • Rather than modifying the TLS config on the default transport, you want to set TLS config on the client's transport that you're creating.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably made this a little less secure by turning TLS off by default. Do we know if Redfish typically has https available by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the docs I read and by some practical knowledge I understood that redfish only supports https by default and even python redfish library only supports https protocol, I have changed the code accordingly.

plugins/inputs/redfish/redfish.go Outdated Show resolved Hide resolved
plugins/inputs/redfish/redfish.go Outdated Show resolved Hide resolved
plugins/inputs/redfish/redfish.go Outdated Show resolved Hide resolved
plugins/inputs/redfish/redfish.go Outdated Show resolved Hide resolved
plugins/inputs/redfish/redfish.go Outdated Show resolved Hide resolved
@sarvanikonda sarvanikonda requested a review from ssoroka May 11, 2020 14:46
Copy link
Contributor

@danielnelson danielnelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The produced data model doesn't align very closely with what I see in the Redfish specification, I'd like to mirror the types as I think this will make it easier to extend to new types and understand for users familiar with Redfish. The most straightforward way to me seems to be to one metric for each supported resource type. For example we could have one type for Thermal, one for Power.

plugins/inputs/redfish/README.md Outdated Show resolved Hide resolved
plugins/inputs/redfish/README.md Outdated Show resolved Hide resolved
plugins/inputs/redfish/redfish.go Outdated Show resolved Hide resolved
plugins/inputs/redfish/redfish_test.go Show resolved Hide resolved
plugins/inputs/redfish/README.md Outdated Show resolved Hide resolved
plugins/inputs/redfish/README.md Outdated Show resolved Hide resolved

### Metrics for Dell Servers

- cpu_temperatures
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

W should name these redfish_thermal, redfish_power which will fit in well with the output of other plugins.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The produced data model doesn't align very closely with what I see in the Redfish specification, I'd like to mirror the types as I think this will make it easier to extend to new types and understand for users familiar with Redfish. The most straightforward way to me seems to be to one metric for each supported resource type. For example we could have one type for Thermal, one for Power.

We have two type data coming from each resource(thermal,power) , if we send all the data from each resource to a single measurement it would be a problem as in the thermal resource the number of fans and CPUs temperature varies, for example there might be 50 CPUs but only 7 Fans, Similar for Power resource as well. It would be appropriate to store them in different measurements like redfish_thermal_temperatures, redfish_thermal_fans,redfish_power_powersupplies, redfish_power_voltages. What would you think of it?

plugins/inputs/redfish/redfish.go Outdated Show resolved Hide resolved
Copy link
Contributor

@ssoroka ssoroka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're pretty much there. A few small comments.

State string
Health string
}
type speed struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct casing isn't consistent. Use Capital first letter for public structs, lowercase for private.

}

} else {
return fmt.Errorf("received status code %d (%s), expected 200",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stops processing as soon as it finds a url that doesn't return a 200. This is probably okay, but you might want to return whatever data you can manage, instead of no data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be appropriate to stop the process when it does not return 200, in some scenarios for example user might give wrong credentials or ID, the process will stop at 1st API call and also if there is a proxy or any connectivity problem the process will be stopped at 1st API call itself. there might be rare situations where we receive data for one API and not for other APIs.

Status CpuStatus `json:"Status"`
}
type Payload struct {
Temperatures []*Cpu `json:",omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to use slices of structs instead of slices of pointers to structs. This way you don't need to worry about the nil pointer case

}
if (len(r.ClientConfig.TLSCA) == 0) || (len(r.ClientConfig.TLSCert) == 0 && len(r.ClientConfig.TLSKey) == 0) {
var insecuretls tls.ClientConfig
insecuretls.InsecureSkipVerify = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the tls.ClientConfig type to provide the insecure option and feed whatever configuration it provides into the http.Transport. Change the host option to be a base url so that it will include the scheme:

address = "https://127.0.0.1"

Copy link
Contributor Author

@sarvanikonda sarvanikonda Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change it as suggested by you but by default redfish only supports HTTPS protocol, if we just feed tls.ClientConfig config to http.Transport , users might not provide TLS configuration in conf file and errors might occur (errors like cannot validate certificate, etc might occur.)

Comment on lines 164 to 169
if len(r.Host) == 0 || len(r.BasicAuthUsername) == 0 || len(r.BasicAuthPassword) == 0 {
return fmt.Errorf("Did not provide IP or username and password")
}
if len(r.Id) == 0 {
return fmt.Errorf("Did not provide all the ID of the resource")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check for Host and ID in the Init() function. Don't check for the username/password, but set it on the request only if one of the options is non-empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to check Host, username/password and ID in the Init() function(included username/password check as well because for redfish APIs the authorisation is mandatory)

Comment on lines 179 to 183
resp, err := r.client.Do(req)
if err != nil {
return err
}
if resp.StatusCode == 200 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to close the response body anytime Do returns without error. This will ensure that connections are properly reused and not "leaked". The easiest way to do this is to use defer, but to do so you will need to split the request code into it's own function, since defer code runs when the function ends.

if resp.StatusCode == 200 {
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
return fmt.Errorf("%v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return err

tags["room"] = payload.Location.PostalAddress.Room
tags["rack"] = payload.Location.Placement.Rack
tags["row"] = payload.Location.Placement.Row
tags["severity"] = Severity(j.UpperThresholdCritical, j.UpperThresholdFatal, j.ReadingCelsius)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the severity tag, we avoid adding computed values since they are best calculated at query time or optionally in a processor. Especially since it seems that this is a new concept, not part of the redfish specification.


### Example Output For Dell
```
redfish_thermal_temperatures,source=test-hostname,name=CPU1\ Temp,source_ip=http://190.0.0.1,host=test-telegraf,datacenter="Tampa",health="OK",rack="12",room="tbc",row="3",state="Enabled",severity="OK" temperature=41,upper_threshold_critical=59,upper_threshold_fatal=64 1582114112000000000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we talked about this earlier, but I'm having second thoughts about the source_ip tag. It seems that we don't necessarily have an IP Address, it could just as easily be a hostname. The source_ip also doesn't necessarily correspond to the source tag, so I feel the name doesn't fit well.

How about we call it address instead?

Let's also make sure it is only the hostname, not a URL.

@@ -0,0 +1,329 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename dell_computer_system.json

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed it as dell_systems.json as it sounds more appropriate.

@@ -0,0 +1,187 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename dell_chassis

@@ -0,0 +1,319 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename hp_computer_system.json

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed it as hp_systems.json as it sounds more appropriate.


func (r *Redfish) Gather(acc telegraf.Accumulator) error {
var url []string
var payload Payload
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parsing all the data into a single Payload type makes it more difficult to determine where the various bits of data are sourced from. It also increases the chances of value conflicts across resources and will make it difficult to support multiple Power or Thermal resources per ComputerSystem. Instead of using this shared type we should parse each resource into its own dedicated type.

I suggest we relay out of the Gather flow as follows:

  • Create a func get(url, type struct) error that will do requests and parse the json into the provided struct. Having this function will help with the response body closing mentioned in the review.

  • In Gather Get the ComputerSystem, Chassis resources. In the future it would be nice to follow the @odata.i links in case a system has multiple Thermal or Power resources, or the ID is not shared for all resources, but it is not required now.

  • Next get the Thermal resource, then I suggest another function that creates the tags and fields using it and the Chassis and ComputerSystem types, and call AddFields.

  • Finally get the Power resource, also in another function create the tags and fields using it and the Chassis and ComputerSystem types, and call AddFields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About parsing data into a single Payload was suggested by Steven the other reviewer, previously the json data was parsed into different structs, What shall we do now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, don't worry about doing this refactor.

password = "test"

## Resource Id for redfish APIs
id="System.Embedded.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ComputerSystem naming is taken from Redfish Resource Guid, do you have a link to where the ServerSystem is described? Also, don't forget to update this README.

}
tlsCfg, err := r.ClientConfig.TLSConfig()
if err != nil {
return fmt.Errorf("%v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return err, use this pattern throughout the PR


func (r *Redfish) Init() error {
if len(r.Address) == 0 || len(r.BasicAuthUsername) == 0 || len(r.BasicAuthPassword) == 0 {
return fmt.Errorf("Did not provide IP or username and password")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


### Example Output
```
redfish_thermal_temperatures,source=test-hostname,name=CPU1\ Temp,address=http://190.0.0.1,host=test-telegraf,datacenter="Tampa",health="OK",rack="12",room="tbc",row="3",state="Enabled" reading_celsius=41,upper_threshold_critical=59,upper_threshold_fatal=64 1582114112000000000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to update this section

@danielnelson danielnelson added this to the 1.15.0 milestone Jun 18, 2020
@danielnelson danielnelson merged commit b0cd913 into influxdata:master Jun 18, 2020
@danielnelson
Copy link
Contributor

Merged for 1.15.0, excellent work @sarvanikonda

@danielnelson
Copy link
Contributor

Hey @sarvanikonda, I'm trying to get the plugin working with this simulator, and I noticed that the HP test examples don't have a file for the Chassis resource. Do these exist, there is a link listed in the System resource to /redfish/v1/Chassis/1/.

@sarvanikonda
Copy link
Contributor Author

Hey @danielnelson from the Chassis resource I am fetching Location details, for HP the /redfish/v1/Chassis/1/ exists but the location details in the resource are not provided by HP redfish, I used the existing the hp_power.json resource file in testdata as input to the /redfish/v1/Chassis/1/ endpoint in redfish_test.go file as the Location data is anyway not available if this is going to confuse people I will add the Chassis resource file in test data and change the redfish_test.go file.

@danielnelson
Copy link
Contributor

Got it, yeah let's add this file for completeness when you have some free time. What I found with this simulator is that the ID is not necessarily the same between the System and the Chassis endpoints.

I am making a small change to the plugin to start with the System and use it's Chassis link. This requires a Chassis response but for now I can just toss in a placeholder.

@sarvanikonda
Copy link
Contributor Author

Hi @danielnelson I was also planning to add LowerThresholdCritical and LowerThresholdFatal values for temperatures,fans and voltages, I see that a PR has been raised by you #7722, would you like to add these values in 7722 PR ? or shall I wait to 7722 PR be merged to master and then add the values and raise a PR.

@danielnelson
Copy link
Contributor

I added these values to #7722 and also adjusted some field types match what is listed in the schema. This will probably require you to drop some measurements from your database but better now than after the release.

@sarvanikonda
Copy link
Contributor Author

Hey @danielnelson could you be more specific about dropping some measurements? and thank you very much for adding the lower threshold values, but in the code these values are also added in structs but not added in gather function?

@danielnelson
Copy link
Contributor

danielnelson commented Jun 24, 2020

On the fields that I switched to be a float, you will get "type conflict error" when you insert into InfluxDB. To clear this up you have a few options but the most straightforward way is to delete the old metrics in the influx cli:

drop series from redfish_thermal_temperatures
drop series from redfish_thermal_fans

but in the code these values are also added in structs but not added in gather function?

Uh oh, me trying to move to fast and failing badly.

@sarvanikonda
Copy link
Contributor Author

oh, got it now, when you mentioned to drop measurements I thought you were thinking to delete some measurements in the code.
Thanks again for updating the code. I will test it once the lower threshold values are added in gather function.

rhajek pushed a commit to bonitoo-io/telegraf that referenced this pull request Jul 13, 2020
idohalevi pushed a commit to idohalevi/telegraf that referenced this pull request Sep 29, 2020
arstercz pushed a commit to arstercz/telegraf that referenced this pull request Mar 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants