etcd (un)availability #373

bcwaldon · 2014-05-01T01:14:52Z

fleet does not deal with a lack of etcd availability correctly. This can cause the cluster to act in odd ways, even unscheduling all jobs in the cluster.

build a mechanism that determines when etcd is "unavailable"
engines and agents must halt operation once etcd is considered "unavailable"
engine must attempt to rectify the cluster based on the current state of etcd before subscribing to events
agent must react to its own state expiration/deletion by leaving the cluster

bcwaldon · 2014-05-01T01:15:09Z

/cc @jonboulle @unihorn

jonboulle · 2014-05-01T01:23:59Z

For posterity - I propose that every Job has an "SLA", and when contemplating leaving the cluster, an Agent will consider the job terminable only on the expiration of this SLA; this allows a (configurable) recovery window from transient network hiccups.

For example, if a Job has an SLA of 10 minutes, and an Agent has a network partition with the Registry, the Agent would wait 10 minutes to recover connectivity to the Registry before it terminates the job. Similarly, on the other side, an Engine would probably wait up to the SLA before attempting to reschedule the Job (exact semantic here would depend on how exactly agent/job health is heartbeaten.)

Agents and Engines would have a default SLA applied to all Jobs; any SLA configured in the Jobs themselves (e.g. XJobSLA) would override this.

zxvdr · 2014-05-01T07:04:13Z

+1 for an config/SLA based approach. Some jobs have a long warm-up time (think memcache) where it's preferable to wait a certain amount of time for network connectivity to be restored instead of immediately rescheduling replacement jobs. An SLA would allow network maintenance to occur (which disconnects Agents for a short period of time - within the SLA) without adversely impacting those jobs by killing them and starting replacements. This is essentially saying "this job can disappear for X minutes, then start a replacement"

I don't think the Agent should terminate jobs when leaving the cluster (when etcd is unavailable), even for an extended period of time. The action an Agent takes upon rejoining the cluster should be configurable on a per-job basis (defined in the Job's config/SLA). Running jobs could either be killed if a replacement has been started elsewhere, or could continue running and the replacement killed if it exists. This is essentially saying "when jobs disappear start replacements and kill any old jobs that reappear" or "when jobs disappear start replacements but kill them if the old jobs reappear".

bcwaldon · 2014-05-01T21:42:21Z

This was the 80% solution that lets us kill the etcd leader in a cluster again: #377

bcwaldon · 2014-07-08T22:17:01Z

Another big step: #611

jonboulle · 2014-07-25T18:55:34Z

superseded by #708

bcwaldon added this to the vNext milestone May 1, 2014

bcwaldon added bug labels May 1, 2014

bcwaldon removed this from the vNext milestone May 5, 2014

bcwaldon added high-priority and removed feature labels May 12, 2014

jonboulle closed this as completed Jul 25, 2014

jonboulle mentioned this issue Jul 25, 2014

Loss of Agent behavior incorrect #708

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd (un)availability #373

etcd (un)availability #373

bcwaldon commented May 1, 2014

bcwaldon commented May 1, 2014

jonboulle commented May 1, 2014

zxvdr commented May 1, 2014

bcwaldon commented May 1, 2014

bcwaldon commented Jul 8, 2014

jonboulle commented Jul 25, 2014

etcd (un)availability #373

etcd (un)availability #373

Comments

bcwaldon commented May 1, 2014

bcwaldon commented May 1, 2014

jonboulle commented May 1, 2014

zxvdr commented May 1, 2014

bcwaldon commented May 1, 2014

bcwaldon commented Jul 8, 2014

jonboulle commented Jul 25, 2014