Simulation of Failures #65

henricasanova · 2018-08-15T00:26:24Z

We should support failure traces to simulate host failures.

henricasanova · 2018-09-15T07:31:08Z

Added a ServiceFailureDetector helper, along with tests Some code clean-up here and there

henricasanova · 2019-04-10T03:01:05Z

We need to discuss the failure semantic of the CloudService:

A VM is created
A job is submitted, and thus started on that VM
The host on which the VM was running fails
What happens?
- Option Feature Request: Memory specifications #1: a VM is started on another host if possible, and job is transparently restarted there?
- Option Needed Development: SimpleStorageService using S4U Storage #2: the job is on hold until that host comes back (which is the typical failure behavior of a BareMetalService)
- Option Desired Development: Implement Vivaldi as a Network Proximity Service #3: the job fails and the submitted is notified

The above requires that the cloud service be re-engineering to keep track of jobs (as opposed to just "submit jobs to BareMetalServices" and give up responsibilities")

rafaelfsilva · 2019-04-10T04:10:24Z

Here are my thoughts on the options:

Option 1: This could be a nice feature to provide, however I would not make it enabled by default. To better support this option we could leverage the VM migration function.
Option 2: I agree this should be the default behavior, however the hold status should be accessible from the submitted class (in case the user wants to cancel the job and perform another action). Also a timeout for the hold status should be provided.
Option 3: I think the job would only fail if hold is not allowed, or timeout == 0

henricasanova · 2019-04-10T19:19:16Z

Here is an even more basic question: A user creates two 1-core VM on a CloudService that only has a two 1-core physical host1. One host is turned off. What happens? Do we implement a mechanism for the user to know about the fact that one of the two VMs is off? (as it may never come back). After all, a job could have service-specific argument for running on that particular VM that is off. Or perhaps we don't provide anything, but if a host comes back on, then that VM will be restarted?

Also, I almost thing that a VM shouldn't run a BareMetalService, but instead run a StandardJobExecutor for each incoming job..... A lot to discuss in a conference call I think.

henricasanova · 2019-04-15T02:57:33Z

I have done a re-design/re-implementation of the CloudService based on last week conference call. An issue has arisen for the HTCondor implementation. In the HTCondor Central Manager Service, the procedure currently is to create a bunch of 1-core VMs, each of them requiring ComputeService::ALL_RAM. This creates a bunch of 1-core BareMetalComputeServices running on VMs.

Given the current redesign (in which the submitter has access to BareMetalServices running on VMs and the CloudService no longer accepts jobs) there are a few problems with this approach:

The above creates on the same physical hosts multiple VMs that use ComputeService::ALL_RAM, which will overcome a host's capacity. (Besides, allowing for things like ComputeService::ALL_CORES/ALL_RAM on a Cloud service in which physical resources are hidden is weird because the hosts can be heterogeneous)
With a CloudService the WMS does not pick physical hosts. Therefore, when I say "create a VM with x cores and y bytes", it's not clear where it will be started. If the hosts are heterogeneous, this is not great as one has to ask for the minimum RAM (i.e., min(RAM/#cores) over all resources).

Here are two easy options:

Require that physical resources managed by a CloudService be homogenous (just like a batch service).
Require a HTCondor service on only use a VirtualizedCluster service (because then HTCondor can start VMs on specific hosts), and throw an error if given a CloudService

I think I prefer option 2), and it's not a big deal to the user at all (just replace the word "Cloud" by "VirtualizedCluster").

Question: Why have HTCondor create a bunch of 1-core VMs? What about one multi-core VM per host??? (which then would allow to run multi-threaded tasks). If this is ok, then the above issues are not a problem: one just creates one VM for each host, done!

rafaelfsilva · 2019-04-15T21:20:59Z

Option 2 is definitely the right option. Option 1 would break the heterogeneity support provided by cloud. Also, I think that's fine to create a single multi-core VM per host in this case.

henricasanova · 2019-04-15T22:45:18Z

thanks! I just implemented (in a branch) Option #2.

henricasanova · 2019-04-19T23:55:16Z

At this point, this issue is ready to be closed I believe. More tests of course can be added, but that's true in general. Fault-tolerance tests are pretty good as it is. The henri_failures branch has been merged into master.

henricasanova added the feature label Aug 15, 2018

henricasanova closed this as completed Sep 15, 2018

henricasanova reopened this Sep 15, 2018

henricasanova changed the title ~~Failure Traces~~ Simulation of Failures Feb 19, 2019

henricasanova added a commit that referenced this issue Feb 19, 2019

(#65)

29ad5f5

Added a ServiceFailureDetector helper, along with tests Some code clean-up here and there

rafaelfsilva added this to the 1.4 milestone Feb 21, 2019

rafaelfsilva modified the milestones: 1.4, 1.5 Apr 13, 2019

rafaelfsilva assigned rafaelfsilva and henricasanova Apr 13, 2019

henricasanova added a commit that referenced this issue Apr 15, 2019

(#65): modified HTCondor so that it cannot use a CloudService

3023a96

henricasanova closed this as completed May 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simulation of Failures #65

Simulation of Failures #65

henricasanova commented Aug 15, 2018

henricasanova commented Sep 15, 2018 •

edited

Loading

henricasanova commented Apr 10, 2019

rafaelfsilva commented Apr 10, 2019

henricasanova commented Apr 10, 2019

henricasanova commented Apr 15, 2019

rafaelfsilva commented Apr 15, 2019

henricasanova commented Apr 15, 2019

henricasanova commented Apr 19, 2019 •

edited

Loading

Simulation of Failures #65

Simulation of Failures #65

Comments

henricasanova commented Aug 15, 2018

henricasanova commented Sep 15, 2018 • edited Loading

henricasanova commented Apr 10, 2019

rafaelfsilva commented Apr 10, 2019

henricasanova commented Apr 10, 2019

henricasanova commented Apr 15, 2019

rafaelfsilva commented Apr 15, 2019

henricasanova commented Apr 15, 2019

henricasanova commented Apr 19, 2019 • edited Loading

henricasanova commented Sep 15, 2018 •

edited

Loading

henricasanova commented Apr 19, 2019 •

edited

Loading