-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace needs
with inputs
#319
Comments
Addendum: I've added a third benefit above (simplified execution model). I have a nasty suspicion that @bloodearnest suggested exactly this earlier and I patiently explained to him why it wouldn't work. But in fact I'm now convinced that it would work, and would be really great. |
You were extremely patient. Saint-like, almost 🤣 |
needs
with inputs
needs
with inputs
The consequences of liberal output patterns have cropped up several times recently. When investigating #298, I found several files that were released to L4 accidentally; not because the cohort-extractor action was liberal, but because a downstream action was liberal. See #393. |
Closing in favour of this option in our pipeline. |
We've gestured at this proposal before but I wanted to get it written up properly while we move on to other things so it doesn't get forgotten.
Proposal
The idea is that rather than specifying a
needs
key with a list of action IDs like so:We instead specify a list of files under an
inputs
key:The values can be either a specific filename (e.g.
outputs/input.csv
) or a glob pattern (e.g.outputs/reports/report_*.csv
) using the existing glob syntax.This has three advantages:
It avoids including unnecessary input files. Under the existing
needs
system all outputs of the upstream action become inputs to the downstream one. This wastes time transferring files in and wastes disk space for the duration of the action run. As the files can be quite large and as some runtime environments have limited disk space this is a real practical issue.It is (we hypothesise) conceptually simpler and less error prone. We can describe an action as a thing which takes some input files, runs some code, and produces some output files. The input files are specified by
inputs
, the code is specified byrun
, the outputs are specified byoutputs
. This makes explicit the important thing ("what input files does this action take?") and make implicit the secondary thing ("what other actions does this action depend on?"). The current implementation does the reverse and makes the secondary thing explicit while the primary thing must be inferred.Implementing this proposal will also require us to be more strict about requiring disjoint output patterns (see below). Anecdotally, some users have defaulted to being quite liberal with output patterns (e..g
output/*.csv
) which leads to nasty problems further down the line. Being clear about the intended purpose of output patterns and enforcing this (while providing helpful error messages) should be an overall gain for researchers.(Of course, we should talk this through with some researchers first to check our understanding.)
It allows us significantly to simplify the job execution model. A side-effect of the requirement for disjoint output patterns is we no longer need to track explicitly which files in a workspace were produced by which action because there can never be ambiguity as to which action a file with a given name belongs. All we need to track to correctly execute jobs is which actions have been run previously and whether they succeeded or not.
Implementation
In order to guarantee that we can identify which action produces a given file we'll need to enforce that the set of output patterns in a pipeline is pairwise disjoint in the sense that no string can match more than one pattern. We'll also need to enforce that each input pattern intersects with (i.e. has strings in common with) exactly one output pattern.
Determining if two arbitrary regular expressions intersect is a solved problem (convert the regexes to a DFAs, compute the intersection DFA, determine if the resulting language is non-empty) but there don't seem to be any production-ready, high performance libraries for doing this for us and anyway this smells too much like Actual Computer Science for our tastes.
Fortunately, for the very limited glob syntax we support there's an hilariously simple trick: two patterns intersect if and only if one of them matches the other. (Though the margins of this Github issue are too small to contain the proof.)
So once we've determined that these conditions hold we can calculate the inferred dependencies for each action based on which output patterns its input patterns intersect with. From there things proceed basically as they do already with the only difference being that we don't copy all output files from the dependencies, only those which match the supplied patterns.
Migration plan
It's reasonably easy to add a backwards compatibility layer in here.
project.yaml
files with a version number of N get the new behaviour so we enforce disjointness and infer the dependency graph.For
project.yaml
files with a version number less than N we don't enforce the disjointness property, and we build the same dependency graph directly from the specified dependencies. For each dependency we assume the set of files we want is*
(i.e. all of them). Then everything proceeds as normal.The soft side of the migration is the more tricky one. We should probably release this as a feature but not at first update the default version in the template repo. Once we've got some friendly researchers to try it out and ironed out the wrinkles we can update the documentation and make this the default style.
Note also that we'll only be able to take advantage of the simplified execution model (benefit number 3 above) once we've completely moved away from the old style of project and can enforce the disjointness property on all active projects.
The text was updated successfully, but these errors were encountered: