Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

etcd (un)availability #373

Closed
3 of 4 tasks
bcwaldon opened this issue May 1, 2014 · 6 comments
Closed
3 of 4 tasks

etcd (un)availability #373

bcwaldon opened this issue May 1, 2014 · 6 comments

Comments

@bcwaldon
Copy link
Contributor

bcwaldon commented May 1, 2014

fleet does not deal with a lack of etcd availability correctly. This can cause the cluster to act in odd ways, even unscheduling all jobs in the cluster.

  • build a mechanism that determines when etcd is "unavailable"
  • engines and agents must halt operation once etcd is considered "unavailable"
  • engine must attempt to rectify the cluster based on the current state of etcd before subscribing to events
  • agent must react to its own state expiration/deletion by leaving the cluster
@bcwaldon bcwaldon added this to the vNext milestone May 1, 2014
@bcwaldon
Copy link
Contributor Author

bcwaldon commented May 1, 2014

/cc @jonboulle @unihorn

@jonboulle
Copy link
Contributor

For posterity - I propose that every Job has an "SLA", and when contemplating leaving the cluster, an Agent will consider the job terminable only on the expiration of this SLA; this allows a (configurable) recovery window from transient network hiccups.

For example, if a Job has an SLA of 10 minutes, and an Agent has a network partition with the Registry, the Agent would wait 10 minutes to recover connectivity to the Registry before it terminates the job. Similarly, on the other side, an Engine would probably wait up to the SLA before attempting to reschedule the Job (exact semantic here would depend on how exactly agent/job health is heartbeaten.)

Agents and Engines would have a default SLA applied to all Jobs; any SLA configured in the Jobs themselves (e.g. XJobSLA) would override this.

@zxvdr
Copy link

zxvdr commented May 1, 2014

+1 for an config/SLA based approach. Some jobs have a long warm-up time (think memcache) where it's preferable to wait a certain amount of time for network connectivity to be restored instead of immediately rescheduling replacement jobs. An SLA would allow network maintenance to occur (which disconnects Agents for a short period of time - within the SLA) without adversely impacting those jobs by killing them and starting replacements. This is essentially saying "this job can disappear for X minutes, then start a replacement"

I don't think the Agent should terminate jobs when leaving the cluster (when etcd is unavailable), even for an extended period of time. The action an Agent takes upon rejoining the cluster should be configurable on a per-job basis (defined in the Job's config/SLA). Running jobs could either be killed if a replacement has been started elsewhere, or could continue running and the replacement killed if it exists. This is essentially saying "when jobs disappear start replacements and kill any old jobs that reappear" or "when jobs disappear start replacements but kill them if the old jobs reappear".

@bcwaldon
Copy link
Contributor Author

bcwaldon commented May 1, 2014

This was the 80% solution that lets us kill the etcd leader in a cluster again: #377

@bcwaldon bcwaldon removed this from the vNext milestone May 5, 2014
@bcwaldon
Copy link
Contributor Author

bcwaldon commented Jul 8, 2014

Another big step: #611

@jonboulle
Copy link
Contributor

superseded by #708

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants