Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support job name #2249

Merged
merged 5 commits into from
Nov 10, 2020
Merged

Support job name #2249

merged 5 commits into from
Nov 10, 2020

Conversation

iparask
Copy link
Contributor

@iparask iparask commented Nov 5, 2020

This PR makes sure that the pilot manager gets the correct information when launching pilots that have the job_name attribute set. It also takes into account that multiple pilots have the same job_name.

It closes #2248.

@codecov
Copy link

codecov bot commented Nov 5, 2020

Codecov Report

Merging #2249 into devel will decrease coverage by 0.30%.
The diff coverage is 36.16%.

Impacted file tree graph

@@            Coverage Diff             @@
##            devel    #2249      +/-   ##
==========================================
- Coverage   31.67%   31.37%   -0.31%     
==========================================
  Files          74       81       +7     
  Lines        7308     8640    +1332     
==========================================
+ Hits         2315     2711     +396     
- Misses       4993     5929     +936     
Impacted Files Coverage Δ
src/radical/pilot/agent/executing/sleep.py 25.00% <0.00%> (ø)
src/radical/pilot/db/database.py 15.00% <0.00%> (-0.53%) ⬇️
src/radical/pilot/pilot_manager.py 11.64% <0.00%> (ø)
src/radical/pilot/session.py 14.46% <0.00%> (ø)
src/radical/pilot/utils/component.py 16.25% <0.00%> (+5.99%) ⬆️
src/radical/pilot/task_overlay/master.py 13.93% <2.32%> (-0.35%) ⬇️
src/radical/pilot/task_overlay/worker.py 11.20% <8.33%> (-0.20%) ⬇️
src/radical/pilot/agent/agent_0.py 13.82% <9.09%> (-0.41%) ⬇️
src/radical/pilot/agent/agent_n.py 31.81% <16.66%> (-51.52%) ⬇️
src/radical/pilot/agent/executing/shell_fs.py 8.36% <20.00%> (ø)
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 30514bc...63be5c0. Read the comment docs.

Copy link
Member

@andre-merzky andre-merzky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

if p['uid'] == pid:
# we do not force unique job_names and multiple pilots may have
# the same job_name. By checking if p['uid'] is in PMGR pilots
# we ensure that each pilot is checked only once.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should actually require unique names. I mean, otherwise a cancel() call can well end up on the wrong pilot if we did not get the order right in this piece of code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked with multiple pilots as well. It requires the name only during starting the pilots. After that it uses the job id that the scheduler provides to identify different jobs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sure - but I am not sure the mapping from pilot UID to pilot job is unique then. For uniform pilots is should not matter, and we don't regularly run non-uniform pilots in the same session, so this is really an edge case. And I don't have suggestion... :-P So, ultimately this looks good to me - thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW just to confirm, isn't the order of jobs, jobs descriptions and pilots is consistent? (and kinda it should be forced if not). Thus the comparison of pilot and job names shouldn't be matter (and that extra loops could be skipped)

for j, jd, pilot in zip(jc.get_tasks(), jd_list, pilots):
    ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtitov I am not sure I understand. Can you expand it a bit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iparask yeah, that my proposal will allow to remove L835-853: list jd_list is formed from list pilots (link), thus jobs descriptions order corresponds to the order of pilots, and we don't need to look for which pilot is related to which job description based on their names/uids, but rather use same index in their lists. And this line could be converted into for j, jd, pilot in zip(jc.get_tasks(), jd_list, pilots). Hope I didn't bring more confusion

Copy link
Contributor Author

@iparask iparask Nov 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. This makes the assumption that the pilots are in the same order as their corresponding SAGA jobs. @andre-merzky is this true? If it is I think implementing @mtitov's suggestion makes sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we didn't resolve this? btw, there already was an assumption about SAGA jobs and job descriptions

        # we assume here that the tasks arrive in the same order as the job
        # descriptions.  For uniform sets of pilots the order does not matter
        # much though.  Either way, this needs confirming on SAGA level
        # FIXME

and list jd_list was created from list pilots, which assumes that the order is preserved. Just not dealing with pilot/job names (at least here) simplifies this part.

Copy link
Member

@andre-merzky andre-merzky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See lint-related comment - thanks

src/radical/pilot/pmgr/launching/default.py Outdated Show resolved Hide resolved
Copy link
Member

@andre-merzky andre-merzky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@iparask iparask merged commit 727f3ef into devel Nov 10, 2020
@iparask iparask deleted the fix/issue_2248 branch November 10, 2020 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pilot execution fails with job_name set
3 participants