-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add FluxURIResolver Python class and flux-uri command for job URI discovery #3999
Conversation
BTW, $ flux uri --help
usage: flux-uri [-h] [--remote] [--local] TARGET
Resolve TARGET to a Flux URI
positional arguments:
TARGET A Flux jobid or URI in scheme:argument form (e.g. jobid:f1234)
optional arguments:
-h, --help show this help message and exit
--remote convert a local URI to remote
--local convert a remote URI to local
Supported resolver schemes:
jobid Get URI for a given Flux JOBID
pid Get FLUX_URI for a given local PID
slurm Get URI for a Flux instance launched under Slurm
it can also be used as a trivial way to convert local URIs to remote and vice versa: $ flux uri --remote $FLUX_URI
ssh://asp/tmp/flux-TBU7RM/local-0
$ flux uri --local ssh://asp/tmp/flux-TBU7RM/local-0
local:///tmp/flux-TBU7RM/local-0 |
This pull request introduces 6 alerts when merging c67c60d into 87c800b - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging 0ac1b64 into 87c800b - view on LGTM.com new alerts:
|
This is a nice addition to our tool set! Couple of quick comments: It might be good to add examples of input accepted by the built-in methods to the man page. I would expect this to be commonly used, so could we add the magic Should this parse?
(I expected the In another PR we could drop the jobid resolving code from |
No, |
Sure, though I was thinking that those need a holistic redo since we still reference lesser used commands like |
Would it be better to encourage In any event, in the next PR I'll add a helper function to call out to |
Ooh, great idea! |
Eh, I'd prefer to call your helper :-)
True. I'll open an issue on that. |
The problem with encouraging What about adding a |
Ah, that should have been obvious, sorry. I was thinking of the case of FLUX_URI set in a shell session with many commands.
I'm not sure how exactly that would work, but might be more user friendly to selectively extend commands that can take a URI to allow resolver URIs instead of just Flux job URIs. (I could just be missing the use case for However, what if |
I was just thinking The tmux idea is great! Now I think I've forgotten how |
I've pushed an update here that adds support for common If the final id in a The manpage was also expanded with a description of all 3 included resolver plugins, and examples of each. Finally, a few more tests were added to cover the new code for handing query parameters. |
I actually shouldn't have mentioned tmux, as I was thinking more along the lines of how ssh control sockets work. When you ssh to host, if control sockets are enabled ssh first looks for a socket in a well known location based on the the target host and options (like username, port, etc). Could the |
I'd actually like to clean up the terminology a bit here and could use some help. I've been calling the URI schemes into which resolver URIs are resolved (i.e. It gets a little confusing talking about URIs that resolve to other URIs. I wish we had a better name for the URI argument to |
I guess the distinction is that the former can be passed to Probably not great, but could we call the new one a URN and the original a URL (both URIs)? Another thought, is there really any reason why we couldn't allow the new URIs to be passed to |
Well, there's a I'm feeling a bit ambivalent about all this so I'd be will to try whatever you suggest. |
I think this PR could be independent of adding name resolution to One quick thought: do you think renaming the comand to On terminology, maybe "high level" or "unresolved" Flux URIs get resolved to "native" Flux URIs? |
👍
Yeah, good thought. I had originally called it
Yeah, I like that and it will do for now. |
After discussion it seems like this would be more of a plumbing script, so I'll leave it out of the automated |
Ok, I've pushed some fixup commits that rename |
I am fine both ways too. If you feel that shorter is better here feel free to revert that fixup. |
7be7cba
to
f55d562
Compare
I actually ended up keeping the shorter name for now (easy to change if we decide against it). Even though Another idea would be to add a The intro was written a little hastily and still needs improvement, but at least it is somewhere. |
Oh, I also got annoyed at how awkward it is to reference a manpage from in the reStructuredText. I found a random "extension" to easily add new domain refs to sphinx, and added special |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some notes on the man page. I'll make a pass through the code and then finish my review but wanted to get you these first.
DESCRIPTION | ||
=========== | ||
|
||
Connections to Flux are established via a Uniform Resource Indicator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Indicator/Identifier/
doc/man1/flux-uri.rst
Outdated
Processes running within a Flux instance will have the ``FLUX_URI`` | ||
environment variable set to a native URI which :man3:`flux_open` will | ||
use by default. Therefore, there is usually no need to specify a URI when | ||
connecting to the enclosing instance. However, connecting to a _different_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"different" is using markdown emphasis not rst emphasis.
Might want to add to the first sentence in this paragraph something like
, with fallback to a compiled-in native URI for the a system instance of Flux.
doc/man1/flux-uri.rst
Outdated
|
||
As a convenience, if *TARGET* is specified with no scheme, then the scheme | ||
is assumed to be ``jobid``. This allows ``flux uri`` to be used to look | ||
up the URI for a Flux instance running as a job with: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this doesn't make the sentence too awkward, maybe use:
This allows
flux uri
to be used to look up the URI for a Flux instance running as a job in the current enclosing instance with:
doc/man1/flux-uri.rst
Outdated
**-remote** | ||
Return the _remote_ (``ssh://``) equivalent of the resolved URI. | ||
|
||
**--local** | ||
Return the _local_ (``local://``) equivalent of the resulved URI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use rst emphasis not markdown for "local" and "remote"
doc/man1/flux-uri.rst
Outdated
This scheme attempts to read the ``FLUX_URI`` value from the process id | ||
*PID* using ``/proc/PID/environ``. If *PID* refers to a ``flux-broker``, | ||
then a child of the broker is used instead in order to obtain the | ||
URI for that broker (instead of its parent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe instead of "then a child of the broker is used instead" say "then the scheme reads FLUX_URI from the broker's initial program or another child process since FLUX_URI in the broker's environment would refer to its parent.
I'm not sure if that makes it any clearer. Feel free to ignore.
Missing period at end of sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added "(or may not be set at all in the case of an instance started with flux start --test-size=N
)"
doc/man1/flux-uri.rst
Outdated
.. note:: | ||
With the ``jobid`` resolver, ``?local`` only needs to be placed on | ||
the last component of the jobid "path" or hierarchy. This will resolve | ||
each URI in turn as a local URI, so it is only useful if all jobs | ||
are actually running on the local system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the last job is running on the local system, then all enclosing instances have to be running on the local system, so maybe the "so..." explanation could just be dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that sounds better, though more of what I was going for is that in resolving each job along the way the jobid resolver plugin has to connect to the parent job to resolve its child, and when ?local
is placed on the last jobid all connections will use the local
connector. This may not work if for example the first jobid is on another host (the first flux_open()
will fail).
Also, jobid URIs are resolved to the URI of rank 0. If not all enclosing instances have rank 0 running locally, then your statement, while true, doesn't apply to resolution of URIs.
However, perhaps that is too much detail for the manpage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to you, but if you leave it in you might want to include the rank 0 requirement.
If not all enclosing instances have rank 0 running locally
The top enclosing instance rank 0 isn't required to be local, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I just left it out.
The top enclosing instance rank 0 isn't required to be local, correct?
no I meant the first jobid in the path component, it will be forced to a local://
uri if ?local
is used. Probably the only use case for ?local
is developers working with test instances, so it is probably better to just sweep this under the rug for now.
doc/man1/flux-uri.rst
Outdated
|
||
:: | ||
|
||
$ flux uri pid:$(pidof flux-broker | cut -d' ' -f1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use pidof -s
instead of the cut?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, didn't know about pidof -s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few superficial comments - I didn't spot anything in the code to comment on.
Great work to improve our usability here.
@@ -9,6 +9,7 @@ nobase_fluxpy_PYTHON = \ | |||
future.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit message: s/validatory.py/validator.py/
@@ -40,6 +40,8 @@ nobase_fluxpy_PYTHON = \ | |||
hostlist.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit message s/hiearchical/hierarchical/
if a URI is resolved with no scheme, it is assumed to be a Flux jobid
Did you mean "if a URI is parsed with no scheme..."? And is the default scheme an optimization or a convenience?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And is the default scheme an optimization or a convenience?
Yeah, a convenience. I'll just drop the As an optimization,
-- no need for filler phrases in the commit message
@@ -46,9 +46,29 @@ | |||
|
|||
extensions = [ | |||
'sphinx.ext.intersphinx', | |||
'sphinx.ext.napoleon' | |||
'sphinx.ext.napoleon', | |||
'domainrefs' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome!
@@ -44,6 +44,7 @@ nobase_fluxpy_PYTHON = \ | |||
uri/__init__.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit message: s/direclty/directly/
Problem: Utility functions to aid in loading plugins from Python namespace packages are currently in flux/validator.py and therefore no other modules can use them. Move these functions to a new flux/importer.py module and use them from the job validator code.
Problem: Flux does not have a common place to collect the methods for discovery of URIs for Flux jobs. This lack causes an excess of ad-hoc solutions across the Flux user-base and reduces the overall user experience when working with hierarchical instances of Flux. Introduce a new Python FluxURIResolver class which is meant to act as an extensible repository of methods for discovery of URIs from other sources. On initialization, the class discovers a set of URI resolver "plugins" which can translate a simple URI given in the plugin's "scheme" to a job URI. If a URI does not include a scheme, it is assumed to be a Flux jobid (e.g. "f1234" is resolved as "jobid:f1234") As a convenience, a resolver URI may have an optional query component of "remote" or "local" which will force the result into a local or remote job URI. For intance jobid:f1234?local would rewrite the discovered URI for job f1234 as a local:// URI. Query parameters are _also_ passed into resolver plugins, and the use of `local` or `remote` (or other plugin-supported query parameters) may influence the plugin's URI resolution as well.
Problem: There is no way to easily resolve a Flux instance running as a job to its URI. Add a "jobid" resolver plugin for the FluxURIResolver class. This plugin resolves a Flux jobid or hierarchy of Flux jobids to a URI by querying the user.uri job memo. A hierarchy of jobs is specified by use of a forward-slash, e.g. "jobid:f1234/f3456" would resolve the URI for job f3456 running in the job f1234.
Problem: It would useful to fetch the FLUX_URI from a local process, but currently the schemes to do this are ad-hoc. Add a "pid" resolver plugin to the FluxURIResolver class. This plugin attempts to read the target process id's FLUX_URI from /proc/pid/environ, allowing the user to connect to the same Flux instance in which a process is running. As a convenience, if the target PID is a running flux-broker, get FLUX_URI from one of its children, allowing the "pid" resolver scheme to fetch the FLUX_URI for a broker running on the current system. This will be useful when discovering URIs when Flux is running under a foreign resource manager.
Problem: A command line interface for resolving Flux jobids and uri-resolver URIs to job URIs is not currently available. Add flux-uri.py for this purpose. The command can take a jobid or URI in any supported resolver scheme and will return a FLUX_URI for the target instance. For convenience, --local and --remote options are provided to attempt to convert a resolved job URI to its local or remote form.
Problem: It is cumbersome to cross reference manpages with sphinx, when it should be simple. Add the `domainrefs.py` sphinx plugin from https://github.com/mitogen-hq/mitogen/blob/master/docs/domainrefs.py which allows simple addition of multiple domain cross-references to the sphinx conf.py. Now referencing another manpage is as simple as :man1:`flux` for example. Extend PYTHONPATH in sphinx commands so that sphinx can find extension in srcdir.
Problem: No documentation exists for the `flux uri` command. Add a short manpage for flux-uri.
Problem: When running a Flux under Slurm, there is no convenient method to get the URI for instances running as Slurm jobs. Add an experimental "slurm" resolver plugin for the FluxURIResolver class. The slurm plugin works using the following method: * run `scontrol listpids` on the first node of the Slurm job via `srun --overlap --jobid=JOBID` * Try resolving each job PID using the "pid" resolver and return the first URI on success This plugin is best effort and can probably be easily fooled, for example if `flux start` or `flux broker` isn't run directly as a Slurm job.
Problem: No tests exist for the Python FluxURIResolver class and its front-end tool flux-uri. Add a small set of tests for the base URI and FluxJobURI classes to the python tests as t/python/t0025-uri.py, and a larger set of functionality tests in t2802-uri-cmd.t which use the flux-uri(1) command.
Codecov Report
@@ Coverage Diff @@
## master #3999 +/- ##
==========================================
+ Coverage 83.42% 83.47% +0.05%
==========================================
Files 364 371 +7
Lines 53418 53636 +218
==========================================
+ Hits 44565 44774 +209
- Misses 8853 8862 +9
|
Seeing some unexpected errors in the centos8 builder, only in
Which concerns me that we're executing the wrong version of Flux with (suspiciously, centos7 seems to have timed out, but got a similar error from the t3000-mpi-basic.t test) |
Ok, rerunning the CI fixed the errors noted above, still a little strange, but I'll set MWP now. |
This PR introduces a
FluxURIResolver
Python class which can resolve target URIs to Flux job URIs. Support for resolving URIs in different contexts is provided by resolver "plugins" which are loaded based on the target URI scheme, though if no scheme is provided, then a scheme ofjobid
is assumed.A new
flux-uri(1)
utility is provided to access this Python class from the commandline.The PR includes three resolver plugins:
jobid
: resolve job URIs for a Flux jobid in the current instance. This resolver works recursively if a "path" of jobids is provided, e.g.jobid:f1234/f3456
will attempt to resolve the URI for jobidf3456
running as child job off1234
. Sincejobid
is the default URI resolver scheme,f1234/f3456
would also work. e.g.pid:
resolve a URI for a local process id. This plugin attempts to read theFLUX_URI
value from/proc/PID/environ
. As a convenience, if the process appears to be aflux-broker
, then theFLUX_URI
of one of its children is used. This allows determination of the URI for a running flux-broker by providing its PID (useful in implementation of other resolvers)slurm:
an experimental resolver for Flux instances run as Slurm jobs. Theslurm
plugin works by usingsrun
to runscontrol listpids
on the first node of a Slurm job. For each PID that is also a direct child of slurmstepd (i.e. localid == 0) thepid
resolver is used to resolve the FLUX_URI locally.URI resolver plugins are loaded from the
flux.uri.resolvers
namespace, so it is trivial to add third party or override existing resolvers on systems where needed. E.g. I assume someone could provide anlsf
plugin at some point.The next step is to extend
flux proxy
to call out toflux uri
if its argument is not already alocal://
orssh://
URI. I actually have this working, but figured that would be better for a separate PR, release notes wise.Here's a preview: