-
Notifications
You must be signed in to change notification settings - Fork 18
Consider a way to automatically re-enable black hole hosts #58
Comments
@cmilloy You are well situated to think of how this could work in a way that would be most helpful. Let's use this comment stream as a sounding board for proposals/ideas. @wkf I know you suggested that blackhole exile could simply be temporary; after a configurable period of time, we start sending a black hole host tasks again as a trial run. |
I'm not sold on this. I think if a host is blackholed, it should be On Thu, Feb 25, 2016 at 2:26 PM Matthew Forsyth [email protected]
|
After socializing this internally we have come up with a few ideas for implementation to start:
@tnn1t1s We can certainly discuss further. This primarily came from the prediction that some issues which cause mesos task failures will not originate inside the cluster (such as those caused by infrastructure). They will occur and be fixed independently of the support team(s) operating the cluster. The concern is that without automatic re-whitelisting such outages will cause unnecessary work and dependency on the support team(s) who are operating the cluster to resume service to users. I think it makes sense for satellite to have a facility for automatic re-whitelisting which is configurable enough to apply to multiple use-cases. If we decide not to use it, that's OK too. |
@corey - i think black hole host detection only occurs on 'task-lost', That said, I'd like to keep this very simple and not overload satellite As an alternative, I can imagine a configurable callback that is triggered Maybe, a process outside of Satellite can monitor the blacklist and try On Fri, Feb 26, 2016 at 4:40 PM, cmilloy [email protected] wrote:
|
@tnn1t1s the Black hole detector does actually care about failed tasks (not lost tasks). I really like the idea of, as a first step, just having the black hole detector alert admins, rather than removing the host from the whitelist. That has some advantages:
|
This seems like the way to go. We can alert and collect the data and build On Sat, Feb 27, 2016 at 6:54 AM Matthew Forsyth [email protected]
|
It would be nice if somehow, manual intervention weren't always required to bring a host back out of its black hole status and get it re-added to the whitelist.
Need to think about a specific strategy for this.
The text was updated successfully, but these errors were encountered: