-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test pilot resubmission #255
Comments
Moving this to the next milestone as it requires more discussion |
write test case |
@iparask this need to be picked up again in the context of fault-tolerance/resilience development roadmap. |
Yes, it is scheduled for December. |
I realized that this capability is not offered anymore. @mturilli, do you think we should bring it back? |
Yes, we need to discuss fault-tolerance in EnTK as a main item of the development roadmap. Within that discussion, we need to think about pilot resubmission in case of RP failure. Part of that discussion is what we have already discussed about task resubmission in case of failure (we mentioned call backs and expanding the current API to express resubmission limits and fall-back options). Should we close this ticket and add all this to the EnTK development roadmap at https://github.com/radical-cybertools/radical.entk/wiki/Development ? |
I created the following script to test part of what Vivek wrote in the first comment. It is:
The second unit manager seems to get a different uid:
So I think at least for the Unit Manager that is not an issue anymore. I will develop tests for pilot failures to test the same messages from Vivek. |
I would think that getting a new UID is an issue? Because it won't be able to reconnect to it's pilots, and will not be able to collect the units it was managing - it is basically a new and virgin UMGR then... Or is that what you expect? |
This can be closed as well. See PR #560 |
- corresponding methods of TMGR are already tested (methods `start_manager` and `terminate_manager` handle attribute `_tmgr_process`)
Pilot resubmission / umgr recreation uses the same uid and causes a mongo error
The text was updated successfully, but these errors were encountered: