Replace `needs` with `inputs` #319

evansd · 2021-11-12T16:06:58Z

We've gestured at this proposal before but I wanted to get it written up properly while we move on to other things so it doesn't get forgotten.

Proposal

The idea is that rather than specifying a needs key with a list of action IDs like so:

generate_report:
  run: cohort-report:v2.1.0 output/input_delivery.csv.gz
    needs: [generate_delivery_cohort]
    outputs:
      ...

We instead specify a list of files under an inputs key:

generate_report:
  run: cohort-report:v2.1.0 output/input_delivery.csv.gz
    inputs:
     - output/input_delivery.csv.gz
    outputs:
      ...

The values can be either a specific filename (e.g. outputs/input.csv) or a glob pattern (e.g. outputs/reports/report_*.csv) using the existing glob syntax.

This has three advantages:

It avoids including unnecessary input files. Under the existing needs system all outputs of the upstream action become inputs to the downstream one. This wastes time transferring files in and wastes disk space for the duration of the action run. As the files can be quite large and as some runtime environments have limited disk space this is a real practical issue.
It is (we hypothesise) conceptually simpler and less error prone. We can describe an action as a thing which takes some input files, runs some code, and produces some output files. The input files are specified by inputs, the code is specified by run, the outputs are specified by outputs. This makes explicit the important thing ("what input files does this action take?") and make implicit the secondary thing ("what other actions does this action depend on?"). The current implementation does the reverse and makes the secondary thing explicit while the primary thing must be inferred.

Implementing this proposal will also require us to be more strict about requiring disjoint output patterns (see below). Anecdotally, some users have defaulted to being quite liberal with output patterns (e..g output/*.csv) which leads to nasty problems further down the line. Being clear about the intended purpose of output patterns and enforcing this (while providing helpful error messages) should be an overall gain for researchers.

(Of course, we should talk this through with some researchers first to check our understanding.)
It allows us significantly to simplify the job execution model. A side-effect of the requirement for disjoint output patterns is we no longer need to track explicitly which files in a workspace were produced by which action because there can never be ambiguity as to which action a file with a given name belongs. All we need to track to correctly execute jobs is which actions have been run previously and whether they succeeded or not.

Implementation

In order to guarantee that we can identify which action produces a given file we'll need to enforce that the set of output patterns in a pipeline is pairwise disjoint in the sense that no string can match more than one pattern. We'll also need to enforce that each input pattern intersects with (i.e. has strings in common with) exactly one output pattern.

Determining if two arbitrary regular expressions intersect is a solved problem (convert the regexes to a DFAs, compute the intersection DFA, determine if the resulting language is non-empty) but there don't seem to be any production-ready, high performance libraries for doing this for us and anyway this smells too much like Actual Computer Science for our tastes.

Fortunately, for the very limited glob syntax we support there's an hilariously simple trick: two patterns intersect if and only if one of them matches the other. (Though the margins of this Github issue are too small to contain the proof.)

So once we've determined that these conditions hold we can calculate the inferred dependencies for each action based on which output patterns its input patterns intersect with. From there things proceed basically as they do already with the only difference being that we don't copy all output files from the dependencies, only those which match the supplied patterns.

Migration plan

It's reasonably easy to add a backwards compatibility layer in here. project.yaml files with a version number of N get the new behaviour so we enforce disjointness and infer the dependency graph.

For project.yaml files with a version number less than N we don't enforce the disjointness property, and we build the same dependency graph directly from the specified dependencies. For each dependency we assume the set of files we want is * (i.e. all of them). Then everything proceeds as normal.

The soft side of the migration is the more tricky one. We should probably release this as a feature but not at first update the default version in the template repo. Once we've got some friendly researchers to try it out and ironed out the wrinkles we can update the documentation and make this the default style.

Note also that we'll only be able to take advantage of the simplified execution model (benefit number 3 above) once we've completely moved away from the old style of project and can enforce the disjointness property on all active projects.

The text was updated successfully, but these errors were encountered:

evansd · 2021-11-29T14:19:59Z

Addendum: I've added a third benefit above (simplified execution model). I have a nasty suspicion that @bloodearnest suggested exactly this earlier and I patiently explained to him why it wouldn't work. But in fact I'm now convinced that it would work, and would be really great.

bloodearnest · 2021-11-29T16:49:03Z

Addendum: I've added a third benefit above (simplified execution model). I have a nasty suspicion that @bloodearnest suggested exactly this earlier and I patiently explained to him why it wouldn't work. But in fact I'm now convinced that it would work, and would be really great.

You were extremely patient. Saint-like, almost 🤣

iaindillingham · 2022-05-09T17:53:16Z

Anecdotally, some users have defaulted to being quite liberal with output patterns.

The consequences of liberal output patterns have cropped up several times recently. When investigating #298, I found several files that were released to L4 accidentally; not because the cohort-extractor action was liberal, but because a downstream action was liberal. See #393.

benbc · 2022-06-23T16:05:47Z

Closing in favour of this option in our pipeline.

evansd mentioned this issue Nov 29, 2021

Encode privacy levels via reserved directory names #320

Closed

bloodearnest mentioned this issue Dec 15, 2021

Stop running out of diskspace. #328

Closed

4 tasks

wjchulme mentioned this issue Mar 14, 2022

Running selected actions using globs opensafely-core/job-server#1652

Open

sebbacon mentioned this issue Apr 4, 2022

Review and prioritise output file options listed in this issue #261

Closed

iaindillingham changed the title ~~Proposal: replace needs with inputs~~ Replace needs with inputs Apr 19, 2022

iaindillingham mentioned this issue May 9, 2022

Limit medium_privacy files sizes #298

Closed

sebbacon mentioned this issue May 16, 2022

[meta] Pipeline/dependency issues #406

Closed

5 tasks

benbc closed this as completed Jun 23, 2022

benbc closed this as not planned Won't fix, can't repro, duplicate, stale Jul 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace `needs` with `inputs` #319

Replace `needs` with `inputs` #319

evansd commented Nov 12, 2021 •

edited

Loading

evansd commented Nov 29, 2021

bloodearnest commented Nov 29, 2021

iaindillingham commented May 9, 2022

benbc commented Jun 23, 2022

Replace needs with inputs #319

Replace needs with inputs #319

Comments

evansd commented Nov 12, 2021 • edited Loading

Proposal

Implementation

Migration plan

evansd commented Nov 29, 2021

bloodearnest commented Nov 29, 2021

iaindillingham commented May 9, 2022

benbc commented Jun 23, 2022

Replace `needs` with `inputs` #319

Replace `needs` with `inputs` #319

evansd commented Nov 12, 2021 •

edited

Loading