-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simulation of Failures #65
Comments
This issue is being explored using the very latest, and constantly updated, SimGrid (git clone master branch). WRENCH development is thus being done in the henri_failures branch.
|
We need to discuss the failure semantic of the CloudService:
The above requires that the cloud service be re-engineering to keep track of jobs (as opposed to just "submit jobs to BareMetalServices" and give up responsibilities") |
Here are my thoughts on the options:
|
Here is an even more basic question: A user creates two 1-core VM on a CloudService that only has a two 1-core physical host1. One host is turned off. What happens? Do we implement a mechanism for the user to know about the fact that one of the two VMs is off? (as it may never come back). After all, a job could have service-specific argument for running on that particular VM that is off. Or perhaps we don't provide anything, but if a host comes back on, then that VM will be restarted? Also, I almost thing that a VM shouldn't run a BareMetalService, but instead run a StandardJobExecutor for each incoming job..... A lot to discuss in a conference call I think. |
I have done a re-design/re-implementation of the CloudService based on last week conference call. An issue has arisen for the HTCondor implementation. In the HTCondor Central Manager Service, the procedure currently is to create a bunch of 1-core VMs, each of them requiring ComputeService::ALL_RAM. This creates a bunch of 1-core BareMetalComputeServices running on VMs. Given the current redesign (in which the submitter has access to BareMetalServices running on VMs and the CloudService no longer accepts jobs) there are a few problems with this approach:
Here are two easy options:
I think I prefer option 2), and it's not a big deal to the user at all (just replace the word "Cloud" by "VirtualizedCluster"). Question: Why have HTCondor create a bunch of 1-core VMs? What about one multi-core VM per host??? (which then would allow to run multi-threaded tasks). If this is ok, then the above issues are not a problem: one just creates one VM for each host, done! |
Option 2 is definitely the right option. Option 1 would break the heterogeneity support provided by cloud. Also, I think that's fine to create a single multi-core VM per host in this case. |
thanks! I just implemented (in a branch) Option #2. |
At this point, this issue is ready to be closed I believe. More tests of course can be added, but that's true in general. Fault-tolerance tests are pretty good as it is. The henri_failures branch has been merged into master. |
We should support failure traces to simulate host failures.
The text was updated successfully, but these errors were encountered: