Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace needs with inputs #319

Closed
Tracked by #406
evansd opened this issue Nov 12, 2021 · 4 comments
Closed
Tracked by #406

Replace needs with inputs #319

evansd opened this issue Nov 12, 2021 · 4 comments

Comments

@evansd
Copy link
Contributor

evansd commented Nov 12, 2021

We've gestured at this proposal before but I wanted to get it written up properly while we move on to other things so it doesn't get forgotten.


Proposal

The idea is that rather than specifying a needs key with a list of action IDs like so:

generate_report:
  run: cohort-report:v2.1.0 output/input_delivery.csv.gz
    needs: [generate_delivery_cohort]
    outputs:
      ...

We instead specify a list of files under an inputs key:

generate_report:
  run: cohort-report:v2.1.0 output/input_delivery.csv.gz
    inputs:
     - output/input_delivery.csv.gz
    outputs:
      ...

The values can be either a specific filename (e.g. outputs/input.csv) or a glob pattern (e.g. outputs/reports/report_*.csv) using the existing glob syntax.

This has three advantages:

  1. It avoids including unnecessary input files. Under the existing needs system all outputs of the upstream action become inputs to the downstream one. This wastes time transferring files in and wastes disk space for the duration of the action run. As the files can be quite large and as some runtime environments have limited disk space this is a real practical issue.

  2. It is (we hypothesise) conceptually simpler and less error prone. We can describe an action as a thing which takes some input files, runs some code, and produces some output files. The input files are specified by inputs, the code is specified by run, the outputs are specified by outputs. This makes explicit the important thing ("what input files does this action take?") and make implicit the secondary thing ("what other actions does this action depend on?"). The current implementation does the reverse and makes the secondary thing explicit while the primary thing must be inferred.

    Implementing this proposal will also require us to be more strict about requiring disjoint output patterns (see below). Anecdotally, some users have defaulted to being quite liberal with output patterns (e..g output/*.csv) which leads to nasty problems further down the line. Being clear about the intended purpose of output patterns and enforcing this (while providing helpful error messages) should be an overall gain for researchers.

    (Of course, we should talk this through with some researchers first to check our understanding.)

  3. It allows us significantly to simplify the job execution model. A side-effect of the requirement for disjoint output patterns is we no longer need to track explicitly which files in a workspace were produced by which action because there can never be ambiguity as to which action a file with a given name belongs. All we need to track to correctly execute jobs is which actions have been run previously and whether they succeeded or not.

Implementation

In order to guarantee that we can identify which action produces a given file we'll need to enforce that the set of output patterns in a pipeline is pairwise disjoint in the sense that no string can match more than one pattern. We'll also need to enforce that each input pattern intersects with (i.e. has strings in common with) exactly one output pattern.

Determining if two arbitrary regular expressions intersect is a solved problem (convert the regexes to a DFAs, compute the intersection DFA, determine if the resulting language is non-empty) but there don't seem to be any production-ready, high performance libraries for doing this for us and anyway this smells too much like Actual Computer Science for our tastes.

Fortunately, for the very limited glob syntax we support there's an hilariously simple trick: two patterns intersect if and only if one of them matches the other. (Though the margins of this Github issue are too small to contain the proof.)

So once we've determined that these conditions hold we can calculate the inferred dependencies for each action based on which output patterns its input patterns intersect with. From there things proceed basically as they do already with the only difference being that we don't copy all output files from the dependencies, only those which match the supplied patterns.

Migration plan

It's reasonably easy to add a backwards compatibility layer in here. project.yaml files with a version number of N get the new behaviour so we enforce disjointness and infer the dependency graph.

For project.yaml files with a version number less than N we don't enforce the disjointness property, and we build the same dependency graph directly from the specified dependencies. For each dependency we assume the set of files we want is * (i.e. all of them). Then everything proceeds as normal.

The soft side of the migration is the more tricky one. We should probably release this as a feature but not at first update the default version in the template repo. Once we've got some friendly researchers to try it out and ironed out the wrinkles we can update the documentation and make this the default style.

Note also that we'll only be able to take advantage of the simplified execution model (benefit number 3 above) once we've completely moved away from the old style of project and can enforce the disjointness property on all active projects.

@evansd
Copy link
Contributor Author

evansd commented Nov 29, 2021

Addendum: I've added a third benefit above (simplified execution model). I have a nasty suspicion that @bloodearnest suggested exactly this earlier and I patiently explained to him why it wouldn't work. But in fact I'm now convinced that it would work, and would be really great.

@bloodearnest
Copy link
Member

Addendum: I've added a third benefit above (simplified execution model). I have a nasty suspicion that @bloodearnest suggested exactly this earlier and I patiently explained to him why it wouldn't work. But in fact I'm now convinced that it would work, and would be really great.

You were extremely patient. Saint-like, almost 🤣

@iaindillingham
Copy link
Member

Anecdotally, some users have defaulted to being quite liberal with output patterns.

The consequences of liberal output patterns have cropped up several times recently. When investigating #298, I found several files that were released to L4 accidentally; not because the cohort-extractor action was liberal, but because a downstream action was liberal. See #393.

@benbc
Copy link
Contributor

benbc commented Jun 23, 2022

Closing in favour of this option in our pipeline.

@benbc benbc closed this as completed Jun 23, 2022
@benbc benbc closed this as not planned Won't fix, can't repro, duplicate, stale Jul 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants