Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client driver health checks for Docker #3856

Merged
merged 24 commits into from
Mar 21, 2018
Merged

Conversation

chelseakomlo
Copy link
Contributor

@chelseakomlo chelseakomlo commented Feb 8, 2018

No description provided.

@chelseakomlo chelseakomlo changed the title Client add driverhealth checks Client add driver health checks Feb 8, 2018
@chelseakomlo chelseakomlo changed the title Client add driver health checks [WIP] Client add driver health checks Feb 9, 2018
@chelseakomlo chelseakomlo force-pushed the f-client-add-health-checks branch 2 times, most recently from ee0f21f to 9e06dd9 Compare February 13, 2018 18:02
@chelseakomlo chelseakomlo changed the title [WIP] Client add driver health checks Client driver health checks for Docker Feb 13, 2018
client/client.go Outdated
return c.config.Node
}

func (c *Client) updateNodeFromHealthCheck(response *cstructs.HealthCheckResponse) *structs.Node {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What synchronizes this with the servers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm planning to add this in a separate PR that will address synchronization and scheduling.

@@ -1139,6 +1148,9 @@ type Node struct {
// Raft Indexes
CreateIndex uint64
ModifyIndex uint64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Raft indexes are kept at the bottom of the structs so put this above them. I would put it above attributes.

@@ -1062,6 +1062,15 @@ func ValidNodeStatus(status string) bool {
}
}

// DriverInfo is the current state of a single driver. This is updated
// regularly as driver health changes on the node.
type DriverInfo struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will have to make a duplicate of this in the api/node.go and update the node

// it should be run periodically. The return value is a boolean indicating
// whether it should be done periodically, and the time interval at which
// this check should happen.
CheckHealthPeriodic() (bool, time.Duration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be request response as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of either name.

Check -> HealthCheck
CheckHealthPeriodic -> GetHealthCheckInterval

type HealthCheckRequest struct{}

type HealthCheckResponse struct {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove blank line


fm.nodeLock.Lock()
if node := fm.updateHealthCheck(&response); node != nil {
fm.node = node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock only needs to be held in this if

HealthDescription: "Docker driver is available but unresponsive",
UpdateTime: time.Now(),
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use d.dockerClients()

return err
}

d.logger.Printf("[DEBUG] driver.docker: docker driver is available and is responsive to `docker ps`")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TRACE

if err := hc.Check(request, &response); err != nil {
return err
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the value of the health check hasn't changed avoid calling updateHealthCheck. Otherwise every minute we will be causing a the client to push changes to the server and cause unnecessary Raft transactions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about handling that in updateHealthCheck itself, as #3873 moves this logic to updateNodeFromFingeprint

client/client.go Outdated
for name, val := range response.Drivers {
c.config.Node.Drivers[name] = val
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The driver health seems like a nice metric to emit. Healthy and unhealthy with driver as a label.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. This seemed like the best option but let me know if you had something else in mind. https://github.com/armon/go-metrics/blob/2e4f2be0fe4f6b7096471aa85f2c342bff3b8f4f/metrics.go#L105

@chelseakomlo chelseakomlo force-pushed the f-client-add-health-checks branch 4 times, most recently from 2831093 to 29c49cf Compare February 23, 2018 19:15
@chelseakomlo chelseakomlo force-pushed the f-client-add-health-checks branch 2 times, most recently from b8695fd to 4f1ba6b Compare March 1, 2018 21:05
return false
}

if strings.Compare(di.HealthDescription, other.HealthDescription) != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use ==

option.ID, driverStr, value)
return false
}
// TODO this is a compatibility mechanism- as of Nomad 0.8, nodes have a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use this syntax for compatibility code:

// COMPAT: Remove in 0.X: As of Nomad 0.8....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets us search for compatibility code

client/client.go Outdated
oldVal := c.config.Node.Drivers[name]
if oldVal == nil && newVal != nil {
c.config.Node.Drivers[name] = newVal
} else if newVal != nil && newVal.Detected != oldVal.Detected {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the attributes change? This will skip merging

@@ -552,6 +551,48 @@ func (d *DockerDriver) Fingerprint(req *cstructs.FingerprintRequest, resp *cstru
return nil
}

// HealthChck implements the interface for the HealthCheck interface. This
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spelling


client, _, err := d.dockerClients()
if err != nil {
d.logger.Printf("[WARN] driver.docker: docker driver is available but is unresponsive to `docker ps`")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always log the error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if you see this in logs how will you know which line caused the log? Add distinguishing information in the log lines so they are useful for diagnosing. An example:

d.logger.Printf("[WARN] driver.docker: failed to retrieve Docker client docker health check: %v", err)

case <-time.After(period):
err := fm.healthCheck(name, hc)
if err != nil {
fm.logger.Printf("[DEBUG] client.fingerprint_manager: health checking for %v failed: %+v", name, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use %v for errors

logger *log.Logger
updateNodeAttributes func(*cstructs.FingerprintResponse) *structs.Node

// UpdateHealthCheck is a callback to the client to update the state of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lower case

// away from annotating a node's attributes to demonstrate support for a
// particular driver. Takes the FingerprintResponse and converts it to the
// proper DriverInfo update and then sets the prefix attributes as well
func (fm *FingerprintManager) fingerprintDriver(name string, f fingerprint.Fingerprint) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only being called the first time. Periodic, health checking drivers should always call this. Where "this" is the merging logic into the driverinfo

go fm.run(d, period, name)
go fm.runFingerprint(d, period, name)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what this should do is not use runFingerprint or runHealthCheck but instead gather the fact that the Driver is health checking or periodic and pass all that to a new method that runs the drivers fingerprint/healthcheck periodically and then take the returned info and properly construct the DriverInfo object (merged health check and attributes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit tricky as the fingerprint/healthcheck methods run on different intervals, so if we merge them in lockstep then we only update the node as often has the slowest interval. Thinking through ways to do this more cleanly.

// in a backwards compatible way, where node attributes for drivers will
// eventually be phased out.
di := &structs.DriverInfo{
Attributes: response.Attributes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stripping:

DriverInfo.Attributes: map[string]string{
  "version": "1.12.3",
}

Node.Attributes: {
  "driver.docker.version": "1.12.3"
}

client/client.go Outdated
// update the node with the latest driver health information
for name, newVal := range response.Drivers {
if newVal == nil {
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about when newVal is nil but oldVal is not here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a safety precaution, every driver should have a corresponding driver info value for a node.

}
checker := NewDriverChecker(ctx, drivers)
if act := checker.Feasible(c.Node); act != c.Result {
t.Fatalf("case(%d) failed: got %v; want %v", i, act, c.Result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rmaybe use require rather than t.Fatalf here?

return false
}

if !enabled {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplify by changing this to return enabled

// If the Drivers field has not yet been initialized, it does so here.
func (h *HealthCheckResponse) AddDriverInfo(name string, driverInfo *structs.DriverInfo) {
// initialize Drivers if it has not been already
if h.Drivers == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make a constructor for HealthCheckResponse and create an empty map there? Doing a lazy init is a bit more efficient because it avoids creating the map till you actually need it, but it trades off some readability and I am not sure the small efficiency gains are worth it.

Copy link
Contributor

@preetapan preetapan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAde some comments

@chelseakomlo chelseakomlo force-pushed the f-client-add-health-checks branch 5 times, most recently from bd262da to 244389c Compare March 15, 2018 22:20
client/client.go Outdated

if fingerprint != nil {
if c.config.Node.Drivers[name] == nil {
// if the driver info has not yet been set, do that here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

client/client.go Outdated
events := []*structs.NodeEvent{
{
Subsystem: "Driver",
Message: fmt.Sprintf("Driver status changed: %s", health.HealthDescription),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Driver status changed:" can probably be removed.


_, err = client.ListContainers(docker.ListContainersOptions{All: false})
if err != nil {
d.logger.Printf("[WARN] driver.docker: failed to list Docker containers in the process of a docker health check: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize Docker

option.ID, driverStr, value)
return false
}
// COMPAT: Remove in 0.X: As of Nomad 0.8, nodes have a DriverInfo that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.X -> 0.10

logger *log.Logger
updateNodeAttributes func(*cstructs.FingerprintResponse) *structs.Node

updateNodeFromDriver func(string, *structs.DriverInfo, *structs.DriverInfo) *structs.Node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment

case <-t2.C:
if isHealthCheck {
t2.Reset(healthCheckPeriod)
err := fm.runHealthCheck(name, hc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err := fm.runHealthCheck(name, hc); err != nil {

@@ -118,6 +234,21 @@ func (fm *FingerprintManager) fingerprint(name string, f fingerprint.Fingerprint
return response.Detected, nil
}

// healthcheck checks the health of the specified resource.
func (fm *FingerprintManager) runHealthCheck(name string, hc fingerprint.HealthCheck) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to runDriverHealthCheck and fix comment

@@ -2,6 +2,7 @@ package client

import (
"log"
"strings"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try to arrange the methods such that drivers methods are colocated and fingerprint methods are too. I think it would be simpler to follow if it went:

  1. Struct
  2. NewFingerprintManager
  3. Run.
  4. SetupFingerprinter
  5. SetupDriver
    6-X: Fingerprinter stuff
    7-X: Driver stuff

fm.logger.Printf("[DEBUG] client.fingerprint_manager: periodic fingerprinting for driver %v failed: %+v", name, err)
}
case <-t2.C:
if isHealthCheck {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we shouldn't run the health check if detected isn't true. Otherwise you will log errors continuously, right?

I think you also skip the initial health check here and just rely on this loop: https://github.com/hashicorp/nomad/pull/3856/files#diff-8cf4b6aaa496508270af1cd14fdc1527R92

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with this, but the second point would entail that all drivers that allow health checking will also be periodic. If that is always true, then we can do as you said and skip the initial check and only update on a periodic interval.


client, _, err := d.dockerClients()
if err != nil {
d.logger.Printf("[WARN] driver.docker: failed to retrieve Docker client in the process of a docker health check: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the reason I feel like we skip the health check unless it is detected, otherwise you will mark it as unhealthy and log even if docker just isn't there.

@chelseakomlo chelseakomlo force-pushed the f-client-add-health-checks branch 4 times, most recently from 7ee98af to 2adea66 Compare March 21, 2018 17:14
@chelseakomlo chelseakomlo force-pushed the f-client-add-health-checks branch from 2adea66 to e17343d Compare March 21, 2018 17:32
@chelseakomlo chelseakomlo force-pushed the f-client-add-health-checks branch from e17343d to 89ffc96 Compare March 21, 2018 19:57
@chelseakomlo chelseakomlo merged commit 48c5f0e into master Mar 21, 2018
@chelseakomlo chelseakomlo deleted the f-client-add-health-checks branch March 21, 2018 22:05
@github-actions
Copy link

github-actions bot commented Mar 9, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants