Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rc1.d/01-enclosing-instance: write URIs as annotations instead of into kvs? #3103

Closed
grondo opened this issue Aug 1, 2020 · 6 comments
Closed

Comments

@grondo
Copy link
Contributor

grondo commented Aug 1, 2020

On slack, a question was asked about the easiest way to run flux jobs against a subinstance.

We already have 01-enclosing-instance which writes a local_uri and remote_uri to the parent's kvs namespace for the job. However, these values are not accessible from flux jobs, so instead they must be retrieved via flux job info.

Another idea would be to record these values as job annotations, now that we have this feature. That would make the URIs immediately accessible via flux jobs. Here's an example:

#!/bin/bash

# Inform the enclosing instance (if any) of the URI's for this instance

level=$(flux getattr instance-level)
if test $level -gt 0; then
    local_uri=${FLUX_URI}
    remote_uri="ssh://$(hostname)/$(echo $local_uri|sed 's,^.*://,,')"
    flux --parent job annotate ${FLUX_JOB_ID} remote_uri "${remote_uri}"
    flux --parent job annotate ${FLUX_JOB_ID} local_uri "${local_uri}"
fi
ƒ(s=4,d=0,builddir) grondo@asp:~$ flux mini batch -n2 --wrap 'flux mini submit sleep 120; flux mini submit sleep 120; flux mini submit sleep 100; flux queue drain'
50380905906176
ƒ(s=4,d=0,builddir) grondo@asp:~$ flux jobs -o '{id.f58:>12} {user.local_uri:<32} {user.remote_uri:<32}'
       JOBID USER.LOCAL_URI                   USER.REMOTE_URI                 
   ƒPpRR8kWB local:///tmp/flux-GTL4N2/0/local ssh://asp//tmp/flux-GTL4N2/0/local
ƒ(s=4,d=0,builddir) grondo@asp:~$ FLUX_URI=$(flux jobs -no {user.local_uri} ƒPpRR8kWB) flux jobs
       JOBID USER     NAME       ST NTASKS NNODES  RUNTIME RANKS
     ƒXPkH51 grondo   sleep      PD      1      -        - -
     ƒRSKAQf grondo   sleep       R      1      1   48.81s 0
     ƒKNx6cw grondo   sleep       R      1      1   49.05s 0

@dongahn
Copy link
Member

dongahn commented Aug 1, 2020

@grondo:

Neat idea!

From today's mini hackathan with the COVID user, I felt it will go a long way if we have a high level command that can walk the entire instance hierarchy and print out job information hierarchically.

Maybe this trick can be extended to make such hierarchical walk automated...

@grondo
Copy link
Contributor Author

grondo commented Aug 2, 2020

I felt it will go a long way if we have a high level command that can walk the entire instance hierarchy and print out job information hierarchically.

I wonder if there should be a different command that lists job information hierarchically? If an instance hierarchy is required, then presumably there are too many jobs for a single instance, so recursive job listing should not be encouraged. If there is a small enough number of jobs that recursive listing will be scalable (and listing all jobs is a requirement), then perhaps a hierarchy of instances wasn't required.

For the cases where a hierarchy is required to handle a large number of jobs, perhaps a more appropriate tool would display aggregate job information instead of listing jobs recursively? (e.g. count of running, pending, completed, failed..).

@dongahn
Copy link
Member

dongahn commented Aug 2, 2020

I wonder if there should be a different command that lists job information hierarchically?

Yes, for large scale cases, we want this in a different command .

If there is a small enough number of jobs that recursive listing will be scalable (and listing all jobs is a requirement), then perhaps a hierarchy of instances wasn't required.

Yes. But I was thinking this mostly as usability concerns. Currently flux mini batch creates a Flux instance (for good reasons) but this is implicit and users would not know this. And this can be confusing although one can argue users can be trained.

From my recent mini-hackathan, it was clear that the following was what the user wanted to see right after we did the first level flux jobs, but we found there wasn't an easy way. We ended up have our script write FLUX_URI to a file and used flux proxy against it.

ƒ(s=4,d=0,builddir) grondo@asp:~$ flux mini batch -n2 --wrap 'flux mini submit sleep 120; flux mini submit sleep 120; flux mini submit sleep 100; flux queue drain'ƒ(s=4,d=0,builddir) grondo@asp:~$ FLUX_URI=$(flux jobs -no {user.local_uri} ƒPpRR8kWB) flux jobs
      JOBID USER     NAME       ST NTASKS NNODES  RUNTIME RANKS
    ƒXPkH51 grondo   sleep      PD      1      -        - -
    ƒRSKAQf grondo   sleep       R      1      1   48.81s 0
    ƒKNx6cw grondo   sleep       R      1      1   49.05s 0

The new syntax to get this is definitely far better than what we did in our recent mini hackathan! From the perspective of regularly batch users with no CS background, though, I was thinking along the line of flux proxy <JOBID> flux jobs would be simpler. This can be a basis for a user script to customize what they want to see hierarchically.

Of course, this won't work if the JOBID isn't a flux instance. So flux proxy should be able to handle this gracefully.

Related: Do we want to mark a Flux instance job as a special case to flux listing tools? There seem to be certain additions things like recursive queries you want to be able to do on such jobs? One can argue users can easily find that by querying the name field though. Maybe a wrapper command that take the JOBID and tells if it is a Flux instance or not can make scripting easier.

@grondo
Copy link
Contributor Author

grondo commented Aug 2, 2020

I was thinking along the line of flux proxy flux jobs would be simpler.

This is not a bad idea! In combination with a solution to #2298, flux proxy could use the "guest exec" support to launch its shell instead of ssh, which would drop the need for passwordless ssh/rsh support to nodes, the requirement for a PAM plugin for access (#2533), and even the need for notifying the parent of the "remote" URI (since only the local uri would be required in this scenario).

flux proxy does require a URI as its argument though, so the usage might have to be flux proxy jobid://<jobid>, which is a bit unfortunate. However, perhaps we could add a porcelain command to wrap flux proxy in this case to hide that from users.

Note also that flux proxy is going to be pretty heavyweight for running single commands. I wonder if we had the flux exec --jobid support described in #2298, if the shell exec plugin could somehow optionally set the correct FLUX_URI if it has spawned a child instance of Flux (the shell plugin could "watch" for the child instance to register its local_uri in parent kvs namespace). Then flux exec --jobid=JOBID flux jobs would work as expected. (again perhaps a job-specific porcelain command would be warranted here).

@grondo
Copy link
Contributor Author

grondo commented Aug 2, 2020

Do we want to mark a Flux instance job as a special case to flux listing tools? There seem to be certain additions things like recursive queries you want to be able to do on such jobs? One can argue users can easily find that by querying the name field though. Maybe a wrapper command that take the JOBID and tells if it is a Flux instance or not can make scripting easier.

This can be done currently by checking flux job info JOBID guest.flux.local_uri, or if we changed the local_uri from a kvs value to an annotation, then if a job has the uri annotation it is a child instance. (it would actually be nice to use an ephemeral annotation if this was supported because there is no use storing the local and remote uris in the kvs after the job has exited).

@grondo
Copy link
Contributor Author

grondo commented Dec 14, 2021

I believe this issue has been resolved or superseded by #3986, #4004 and #3999.

The broker now notifies parent of instance URIs via the uri job memo, which can be found in the {user.uri} attribute from flux-jobs or the Python JobInfo class, or indirectly by using flux uri with the implicit jobid class.

I was thinking along the line of flux proxy flux jobs would be simpler. This can be a basis for a user script to customize what they want to see hierarchically.

As of #4004, this now works

$  flux proxy JOBID flux jobs

and since the jobid URI resolver works hierarchically, you can do this for child jobs, e.g.

$ flux jobs
       JOBID USER     NAME       ST NTASKS NNODES  RUNTIME NODELIST
  ƒ2VMtahtjZ grondo   flux        R      4      1   15.05s asp
$ flux proxy ƒ2VMtahtjZ flux jobs
       JOBID USER     NAME       ST NTASKS NNODES  RUNTIME NODELIST
     ƒre6xGf grondo   flux        R      1      1   27.12s asp
     ƒrccxzL grondo   flux        R      1      1   27.13s asp
     ƒrccxzK grondo   flux        R      1      1   27.13s asp
$ flux proxy ƒ2VMtahtjZ/ƒrccxzL flux jobs
       JOBID USER     NAME       ST NTASKS NNODES  RUNTIME NODELIST
    ƒ29yGZLU grondo   sleep      PD      1      -        - -
    ƒ29zkYco grondo   sleep      PD      1      -        - -
    ƒ29zkYcp grondo   sleep      PD      1      -        - -

We can use this groundwork to build a recursive job query tool, but that is a separate issue, so closing this one.

@grondo grondo closed this as completed Dec 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants