Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pilot execution fails with job_name set #2248

Closed
iparask opened this issue Nov 3, 2020 · 3 comments · Fixed by #2249
Closed

Pilot execution fails with job_name set #2248

iparask opened this issue Nov 3, 2020 · 3 comments · Fixed by #2249

Comments

@iparask
Copy link
Contributor

iparask commented Nov 3, 2020

While using the devel branch and setting the job_name pilot attribute the pilot fails while the job remains in queue. The job name is set correctly, but I think PMGR is checking the wrong pilot id. I see the following error in PMGR log:

1604425514.076 : pmgr.0000            : 16745 : 140557133264640 : DEBUG    : state pulled: pilot.0000: PMGR_LAUNCHING
1604425516.913 : pmgr.0000            : 16745 : 140556688680704 : DEBUG    : state event: {'cmd': 'update', 'arg': [{'session': 'rp.session.js-17-185.jetstream-cloud.org.iparask.018569.0006', 'pmgr': 'pmgr.0000', 'uid': 'pilot.0000', 'type': 'pilot', 'state': 'FAILED', 'log': None, 'stdout': None, 'stderr': None, 'resource': 'xsede.bridges', 'resource_sandbox': 'gsisftp://bridges.psc.xsede.org:2222/pylon5/mc3bggp/paraskev/radical.pilot.sandbox', 'session_sandbox': 'gsisftp://bridges.psc.xsede.org:2222/pylon5/mc3bggp/paraskev/radical.pilot.sandbox/rp.session.js-17-185.jetstream-cloud.org.iparask.018569.0006', 'pilot_sandbox': 'gsisftp://bridges.psc.xsede.org:2222/pylon5/mc3bggp/paraskev/radical.pilot.sandbox/rp.session.js-17-185.jetstream-cloud.org.iparask.018569.0006/pilot.0000/', 'client_sandbox': 'None', 'js_url': 'slurm+gsissh://bridges.psc.xsede.org:2222/', 'js_hop': 'gsissh://bridges.psc.xsede.org:2222/', 'description': {'resource': 'xsede.bridges', 'access_schema': 'gsissh', 'runtime': 30, 'app_comm': [], 'sandbox': None, 'cores': 24, 'gpus': 0, 'memory': 0, 'queue': 'RM', 'job_name': 'test_job', 'project': 'mc3bggp', 'cleanup': False, 'candidate_hosts': [], 'exit_on_error': True, 'input_staging': [], 'output_staging': [], 'layout': 'default'}, 'resource_details': None, '_id': 'pilot.0000', 'control': 'pmgr', 'states': ['NEW'], 'cmd': [], 'cfg': {'bridges': {'agent_staging_input_queue': {'kind': 'queue', 'log_level': 'error', 'stall_hwm': 0, 'bulk_size': 0}, 'agent_scheduling_queue': {'kind': 'queue', 'log_level': 'error', 'stall_hwm': 0, 'bulk_size': 0}, 'agent_executing_queue': {'kind': 'queue', 'log_level': 'error', 'stall_hwm': 0, 'bulk_size': 0}, 'agent_staging_output_queue': {'kind': 'queue', 'log_level': 'error', 'stall_hwm': 0, 'bulk_size': 0}, 'funcs_req_queue': {'kind': 'queue', 'log_level': 'error', 'stall_hwm': 0, 'bulk_size': 1}, 'funcs_res_queue': {'kind': 'queue', 'log_level': 'error', 'stall_hwm': 0, 'bulk_size': 1}, 'agent_unschedule_pubsub': {'kind': 'pubsub', 'log_level': 'error'}, 'agent_schedule_pubsub': {'kind': 'pubsub', 'log_level': 'error'}, 'control_pubsub': {'kind': 'pubsub', 'log_level': 'error'}, 'state_pubsub': {'kind': 'pubsub', 'log_level': 'error'}}, 'bulk_collection_size': 1024, 'bulk_collection_time': 1.0, 'bulk_size': 1024, 'bulk_time': 1.0, 'components': {'update': {'count': 1}, 'agent_staging_input': {'count': 1}, 'agent_scheduling': {'count': 1}, 'agent_executing': {'count': 1}, 'agent_staging_output': {'count': 1}}, 'db_poll_sleeptime': 2.0, 'heartbeat': {'interval': 1.0, 'timeout': 60.0}, 'mode': 'shared', 'target': 'local', 'owner': 'agent.0', 'cores': 24, 'gpus': 0, 'spawner': 'POPEN', 'scheduler': 'CONTINUOUS', 'runtime': 30, 'app_comm': [], 'dburl': 'mongodb://iparask:[email protected]:27017/iparask/', 'sid': 'rp.session.js-17-185.jetstream-cloud.org.iparask.018569.0006', 'pid': 'pilot.0000', 'pmgr': 'pmgr.0000', 'logdir': '.', 'pilot_sandbox': '/pylon5/mc3bggp/paraskev/radical.pilot.sandbox/rp.session.js-17-185.jetstream-cloud.org.iparask.018569.0006/pilot.0000/', 'session_sandbox': '/pylon5/mc3bggp/paraskev/radical.pilot.sandbox/rp.session.js-17-185.jetstream-cloud.org.iparask.018569.0006', 'resource_sandbox': '/pylon5/mc3bggp/paraskev/radical.pilot.sandbox', 'resource_manager': 'SLURM', 'agent_launch_method': 'SSH', 'task_launch_method': 'SSH', 'mpi_launch_method': 'MPIRUN', 'cores_per_node': 0, 'gpus_per_node': 2, 'lfs_path_per_node': '${LOCAL}', 'lfs_size_per_node': 3713368, 'cu_tmp': None, 'export_to_cu': [], 'cu_pre_exec': [], 'cu_post_exec': None, 'resource_cfg': {'description': "The XSEDE 'Bridges' cluster at PSC (https://portal.xsede.org/psc-bridges/).", 'notes': 'Always set the ``project`` attribute in the ComputePilotDescription.', 'schemas': ['gsissh', 'ssh', 'go'], 'gsissh': {'job_manager_endpoint': 'slurm+gsissh://bridges.psc.xsede.org:2222/', 'filesystem_endpoint': 'gsisftp://bridges.psc.xsede.org:2222/'}, 'ssh': {'job_manager_endpoint': 'slurm+ssh://bridges.psc.xsede.org/', 'filesystem_endpoint': 'sftp://bridges.psc.xsede.org/'}, 'go': {'job_manager_endpoint': 'slurm+ssh://bridges.psc.xsede.org/', 'filesystem_endpoint': 'go://xsede#bridges/'}, 'default_queue': 'RM', 'resource_manager': 'SLURM', 'lfs_path_per_node': '${LOCAL}', 'lfs_size_per_node': 3713368, 'agent_scheduler': 'CONTINUOUS', 'agent_spawner': 'POPEN', 'agent_launch_method': 'SSH', 'task_launch_method': 'SSH', 'mpi_launch_method': 'MPIRUN', 'pre_bootstrap_0': ['module reset', 'module load gcc', 'module load mpi/gcc_openmpi', 'module load slurm', 'module load anaconda3'], 'default_remote_workdir': '$SCRATCH', 'valid_roots': ['/home', '/pylon1', '/pylon5'], 'rp_version': 'local', 'virtenv_mode': 'create', 'python_dist': 'anaconda', 'export_to_cu': [], 'cu_pre_exec': [], 'gpus_per_node': 2, 'system_architecture': {'gpu': 'p100'}, 'job_manager_endpoint': 'slurm+gsissh://bridges.psc.xsede.org:2222/', 'filesystem_endpoint': 'gsisftp://bridges.psc.xsede.org:2222/'}, 'debug': 10}}]}
1604425516.913 : pmgr.0000            : 16745 : 140556688680704 : DEBUG    : state push: pilot.0000: FAILED
1604425516.914 : pmgr.0000            : 16745 : 140556688680704 : DEBUG    : update pilot.0000
1604425516.914 : pmgr.0000            : 16745 : 140556688680704 : DEBUG    : call <bound method ComputePilot._default_state_cb of ['pilot.0000', 'xsede.bridges', 'FAILED']>
1604425516.914 : pmgr.0000            : 16745 : 140556688680704 : DEBUG    : pilot.0000 calls cb <bound method ComputePilot._default_state_cb of ['pilot.0000', 'xsede.bridges', 'FAILED']>
1604425516.914 : pmgr.0000            : 16745 : 140556688680704 : INFO     : [Callback]: pilot pilot.0000 state: FAILED.
1604425516.915 : pmgr.0000            : 16745 : 140556688680704 : ERROR    : [Callback]: pilot 'pilot.0000' failed (exit)
1604425516.915 : pmgr.0000            : 16745 : 140556688680704 : ERROR    : listener died
Traceback (most recent call last):
  File "/home/iparask/miniconda3/envs/rct_devel/lib/python3.7/site-packages/radical/utils/zmq/pubsub.py", line 314, in _listener
    cb(t, m)
  File "/home/iparask/miniconda3/envs/rct_devel/lib/python3.7/site-packages/radical/pilot/pilot_manager.py", line 320, in _state_sub_cb
    if not self._update_pilot(thing, publish=False):
  File "/home/iparask/miniconda3/envs/rct_devel/lib/python3.7/site-packages/radical/pilot/pilot_manager.py", line 361, in _update_pilot
    self._pilots[pid]._update(pilot_dict)
  File "/home/iparask/miniconda3/envs/rct_devel/lib/python3.7/site-packages/radical/pilot/compute_pilot.py", line 206, in _update
    else      : cb([self])
  File "/home/iparask/miniconda3/envs/rct_devel/lib/python3.7/site-packages/radical/pilot/compute_pilot.py", line 152, in _default_state_cb
    ru.cancel_main_thread('int')
  File "/home/iparask/miniconda3/envs/rct_devel/lib/python3.7/site-packages/radical/utils/threads.py", line 164, in cancel_main_thread
    sys.exit()
SystemExit
1604425517.159 : pmgr.0000            : 16745 : 140558571181824 : DEBUG    : pilot(s).need(s) cancellation ['pilot.0000']
1604425517.163 : pmgr.0000            : 16745 : 140557720459008 : DEBUG    : command incoming: {'cmd': 'cancel_pilots', 'arg': {'pmgr': 'pmgr.0000', 'uids': ['pilot.0000']}}
1604425517.163 : pmgr.0000            : 16745 : 140557720459008 : DEBUG    : command ignored: cancel_pilots
1604425518.112 : pmgr.0000            : 16745 : 140558571181824 : INFO     : Closed PilotManager pmgr.0000.

and pmgr_launching:

1604425515.109 : pmgr_launching.0000  : 16881 : 140112872584960 : DEBUG    : hb pmgr_launching.0000 beat [cmgr.0001]
1604425516.111 : pmgr_launching.0000  : 16881 : 140112872584960 : DEBUG    : hb pmgr_launching.0000 beat [cmgr.0001]
1604425516.910 : pmgr_launching.0000  : 16881 : 140113422976768 : ERROR    : bulk launch failed
Traceback (most recent call last):
  File "/home/iparask/miniconda3/envs/rct_devel/lib/python3.7/site-packages/radical/pilot/pmgr/launching/default.py", line 562, in work
    self._start_pilot_bulk(resource, schema, pilots)
  File "/home/iparask/miniconda3/envs/rct_devel/lib/python3.7/site-packages/radical/pilot/pmgr/launching/default.py", line 844, in _start_pilot_bulk
    assert(pilot)
AssertionError
1604425516.910 : pmgr_launching.0000  : 16881 : 140113422976768 : DEBUG    : advance bulk: 1 [False, True]
1604425516.913 : pmgr_launching.0000  : 16881 : 140113132627712 : DEBUG    : bulk states: []
1604425517.111 : pmgr_launching.0000  : 16881 : 140112872584960 : DEBUG    : hb pmgr_launching.0000 beat [cmgr.0001]
1604425517.164 : pmgr_launching.0000  : 16881 : 140113157805824 : DEBUG    : command incoming: {'cmd': 'cancel_pilots', 'arg': {'pmgr': 'pmgr.0000', 'uids': ['pilot.0000']}}
@iparask
Copy link
Contributor Author

iparask commented Nov 4, 2020

I found where the error is generated. It's in the for loop in PMGR #839. In that for loop PMGR checks that the pid, which is equal to SAGA job name with the uid the pilot has. I added a debug message and here is what I got:

 839             for p in pilots:
 840                 self._log.debug(' check: %s %s', p['uid'], pid)
 841                 if p['uid'] == pid:
 842                     pilot = p
 843                     break
 844
 845             assert(pilot)
(rct_devel) iparask@js-17-185:~/Git/RADICAL/radical.pilot/examples/rp.session.js-17-185.jetstream-cloud.org.iparask.018570.0009$ grep 'check:' *.log
pmgr_launching.0000.log:1604512377.180 : pmgr_launching.0000  : 28106 : 139787327387392 : DEBUG    :  check: pilot.0000 test_name
(rct_devel) iparask@js-17-185:~/Git/RADICAL/radical.pilot/examples/rp.session.js-17-185.jetstream-cloud.org.iparask.018570.0009$

How should we proceed on this?

@andre-merzky
Copy link
Member

Good catch! The fix might be something like this I think

            pilot = None
            for p in pilots:
                p_name = p['description'].get('name', p['uid'])
                if p_name == jd.name:
                    pilot = p
                    break

            assert(pilot)
            pid = p['uid']

or something like this. Basically: use the name set in the description when that exists, otherwise use the uid. Always set pid to the uid. What do you think?

@iparask
Copy link
Contributor Author

iparask commented Nov 5, 2020

Let me give it a try and I will let you know either with a PR or with a new comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants