Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test pilot resubmission #255

Closed
vivek-bala opened this issue Jul 31, 2018 · 9 comments
Closed

Test pilot resubmission #255

vivek-bala opened this issue Jul 31, 2018 · 9 comments

Comments

@vivek-bala
Copy link
Contributor

Pilot resubmission / umgr recreation uses the same uid and causes a mongo error

@vivek-bala
Copy link
Contributor Author

Moving this to the next milestone as it requires more discussion

@andre-merzky
Copy link
Member

write test case

@andre-merzky andre-merzky removed their assignment Jan 27, 2020
@lee212 lee212 added this to the Jan 2021 Release milestone Jul 7, 2020
@mturilli
Copy link
Contributor

@iparask this need to be picked up again in the context of fault-tolerance/resilience development roadmap.

@iparask
Copy link
Contributor

iparask commented Nov 23, 2020

Yes, it is scheduled for December.

@iparask
Copy link
Contributor

iparask commented Dec 2, 2020

I realized that this capability is not offered anymore. @mturilli, do you think we should bring it back?

@mturilli
Copy link
Contributor

mturilli commented Dec 3, 2020

Yes, we need to discuss fault-tolerance in EnTK as a main item of the development roadmap. Within that discussion, we need to think about pilot resubmission in case of RP failure. Part of that discussion is what we have already discussed about task resubmission in case of failure (we mentioned call backs and expanding the current API to express resubmission limits and fall-back options). Should we close this ticket and add all this to the EnTK development roadmap at https://github.com/radical-cybertools/radical.entk/wiki/Development ?

@iparask
Copy link
Contributor

iparask commented Dec 3, 2020

I created the following script to test part of what Vivek wrote in the first comment. It is:

import radical.pilot as rp
import multiprocessing as mp
import time
import os

def create_umgrs(session):
    umgr = rp.UnitManager(session=session)
    print(umgr.uid)
    time.sleep(30)
    umgr.close()


if __name__ == "__main__":
    session = rp.Session()
    tmgr_process = mp.Process(target=create_umgrs,name='task-manager',args=(session,))
    tmgr_process.start()
    time.sleep(10)
    os.kill(tmgr_process.pid, 9)
    tmgr_process2 = mp.Process(target=create_umgrs,name='task-manager',args=(session,))
    tmgr_process2.start()
    time.sleep(120)
    #tmgr_process.join()
    tmgr_process2.join()
    session.close()

The second unit manager seems to get a different uid:

new session: [rp.session.js-17-185.jetstream-cloud.org.iparask.018599.0001]    \
database   : [mongodb://iparask:****@129.114.17.185:27017/iparask/]           ok
create unit managercreate unit manager/home/iparask/miniconda3/envs/rct_devel/lib/python3.7/site-packages/pymongo/topology.py:162: UserWarning: MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: https://pymongo.readthedocs.io/en/stable/faq.html#is-pymongo-fork-safe
  "MongoClient opened before fork. Create MongoClient only "
                                                           ok
umgr.0001
close unit manager                                                            ok
closing session rp.session.js-17-185.jetstream-cloud.org.iparask.018599.0001   \
session lifetime: 142.6s

So I think at least for the Unit Manager that is not an issue anymore. I will develop tests for pilot failures to test the same messages from Vivek.

@andre-merzky
Copy link
Member

The second unit manager seems to get a different uid
[...]
for the Unit Manager that is not an issue anymore

I would think that getting a new UID is an issue? Because it won't be able to reconnect to it's pilots, and will not be able to collect the units it was managing - it is basically a new and virgin UMGR then... Or is that what you expect?

@iparask iparask removed this from the Dec 2020 release milestone Dec 7, 2020
@iparask
Copy link
Contributor

iparask commented Apr 9, 2021

This can be closed as well. See PR #560

@iparask iparask closed this as completed Apr 9, 2021
mtitov added a commit that referenced this issue Jul 27, 2022
- corresponding methods of TMGR are already tested (methods `start_manager` and `terminate_manager` handle attribute `_tmgr_process`)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants