Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simulation of Failures #65

Closed
henricasanova opened this issue Aug 15, 2018 · 8 comments
Closed

Simulation of Failures #65

henricasanova opened this issue Aug 15, 2018 · 8 comments
Assignees
Labels
Milestone

Comments

@henricasanova
Copy link
Contributor

We should support failure traces to simulate host failures.

@henricasanova
Copy link
Contributor Author

henricasanova commented Sep 15, 2018

This issue is being explored using the very latest, and constantly updated, SimGrid (git clone master branch). WRENCH development is thus being done in the henri_failures branch.

  • Implement a "service failure detector" service (so that a service can detect crashes of sub-services)
  • Implement a "host state change detector" (so that a service can be made aware that some of their hosts have become (re)usable).
  • Augment Services so that they can be started with "auto restart"
  • Augment Services that that the default behavior on a host failure is to say "I am not fault-tolerant"
  • Augment the BareMetalComputeService so that it is resilient to failures of its compute nodes
    • Implement the needed features
    • Wait for SimGrid Issue 325 to be resolved
    • Implement tests
  • Augment the StandardJobExecutor so that it is resilient to failures of its compute nodes
    • Implement the needed features
    • Implement tests
  • Augment the NetworkProximityService so that it is resilient to failures of its daemon hosts
    • Implement the needed features
    • Implement tests
  • Augment the Cloud/VirtualizedCluster service so that it is resilient to failures of its compute nodes
    • Modify the API to expose BareMetalServers
    • Implement tests
  • Augment the StorageService so that it is resillient to failures
    • Implement tests

@henricasanova henricasanova reopened this Sep 15, 2018
@henricasanova henricasanova changed the title Failure Traces Simulation of Failures Feb 19, 2019
henricasanova added a commit that referenced this issue Feb 19, 2019
Added a ServiceFailureDetector helper, along with tests
Some code clean-up here and there
@rafaelfsilva rafaelfsilva added this to the 1.4 milestone Feb 21, 2019
@henricasanova
Copy link
Contributor Author

We need to discuss the failure semantic of the CloudService:

The above requires that the cloud service be re-engineering to keep track of jobs (as opposed to just "submit jobs to BareMetalServices" and give up responsibilities")

@rafaelfsilva
Copy link
Member

Here are my thoughts on the options:

  • Option 1: This could be a nice feature to provide, however I would not make it enabled by default. To better support this option we could leverage the VM migration function.
  • Option 2: I agree this should be the default behavior, however the hold status should be accessible from the submitted class (in case the user wants to cancel the job and perform another action). Also a timeout for the hold status should be provided.
  • Option 3: I think the job would only fail if hold is not allowed, or timeout == 0

@henricasanova
Copy link
Contributor Author

Here is an even more basic question: A user creates two 1-core VM on a CloudService that only has a two 1-core physical host1. One host is turned off. What happens? Do we implement a mechanism for the user to know about the fact that one of the two VMs is off? (as it may never come back). After all, a job could have service-specific argument for running on that particular VM that is off. Or perhaps we don't provide anything, but if a host comes back on, then that VM will be restarted?

Also, I almost thing that a VM shouldn't run a BareMetalService, but instead run a StandardJobExecutor for each incoming job..... A lot to discuss in a conference call I think.

@henricasanova
Copy link
Contributor Author

I have done a re-design/re-implementation of the CloudService based on last week conference call. An issue has arisen for the HTCondor implementation. In the HTCondor Central Manager Service, the procedure currently is to create a bunch of 1-core VMs, each of them requiring ComputeService::ALL_RAM. This creates a bunch of 1-core BareMetalComputeServices running on VMs.

Given the current redesign (in which the submitter has access to BareMetalServices running on VMs and the CloudService no longer accepts jobs) there are a few problems with this approach:

  • The above creates on the same physical hosts multiple VMs that use ComputeService::ALL_RAM, which will overcome a host's capacity. (Besides, allowing for things like ComputeService::ALL_CORES/ALL_RAM on a Cloud service in which physical resources are hidden is weird because the hosts can be heterogeneous)
  • With a CloudService the WMS does not pick physical hosts. Therefore, when I say "create a VM with x cores and y bytes", it's not clear where it will be started. If the hosts are heterogeneous, this is not great as one has to ask for the minimum RAM (i.e., min(RAM/#cores) over all resources).

Here are two easy options:

  1. Require that physical resources managed by a CloudService be homogenous (just like a batch service).
  2. Require a HTCondor service on only use a VirtualizedCluster service (because then HTCondor can start VMs on specific hosts), and throw an error if given a CloudService

I think I prefer option 2), and it's not a big deal to the user at all (just replace the word "Cloud" by "VirtualizedCluster").

Question: Why have HTCondor create a bunch of 1-core VMs? What about one multi-core VM per host??? (which then would allow to run multi-threaded tasks). If this is ok, then the above issues are not a problem: one just creates one VM for each host, done!

@rafaelfsilva
Copy link
Member

Option 2 is definitely the right option. Option 1 would break the heterogeneity support provided by cloud. Also, I think that's fine to create a single multi-core VM per host in this case.

@henricasanova
Copy link
Contributor Author

thanks! I just implemented (in a branch) Option #2.

@henricasanova
Copy link
Contributor Author

henricasanova commented Apr 19, 2019

At this point, this issue is ready to be closed I believe. More tests of course can be added, but that's true in general. Fault-tolerance tests are pretty good as it is. The henri_failures branch has been merged into master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants