Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't require globally unique output patterns #180

Closed
evansd opened this issue Mar 11, 2021 · 6 comments
Closed

Don't require globally unique output patterns #180

evansd opened this issue Mar 11, 2021 · 6 comments

Comments

@evansd
Copy link
Contributor

evansd commented Mar 11, 2021

This causes much confusion.

Excerpt from Slack:

OK Louis I've finally got to the bottom of this. It's not your fault at all; it's a problem with the way project.yaml works which we don't yet have a good solution for.
When a job has finished running, all the files which exist at that point and which match the output patterns get copied out and marked as being outputs from that job.
The standardise_and_plot_demographics action captures as outputs all files matching output/rate.csv
But that ends up including some of it's input files, which also match that pattern.
This means those files get marked as being outputs from the standardise_and_plot_demographics, rather than the calculate_measures_demographics action.
And this means that next time an action runs which needs inputs from calculate_measures_demographics it ends up not finding anything because all the files have be "reassigned".
The short term fix is to try fiddle with filenames and patterns to try to avoid accidentally matching input files.
But this isn't a long term solution as it's incredibly easy to get tripped up by this, and very hard to diagnose what's going on.

Here's the project YAML in question: https://github.com/opensafely/hospital-disruption-research/blob/bd04ed2eca4f1beceaaf73fcd6778492d1e10c70/project.yaml

Here's the job succeeding the first time it's run:
http://jobs.opensafely.org/jobs/6ozte5mnve4pxxk7/

And failing the second:
http://jobs.opensafely.org/jobs/obn7q4gpxt7srdzb/

@sebbacon
Copy link
Contributor

This would be a hilariously devious one to assign to a new starter as their first bug

@evansd
Copy link
Contributor Author

evansd commented Mar 11, 2021

I've realised it's actually relatively straightforward to determine if any of a job's input files end up as part of its outputs and to make that an error, so we should be able to prevent this specific problem in future.

I think we should abandon the uniqueness requirement for output patterns: it doesn't actually solve the problem and it leads to silly workarounds like having output/inp*.csv and outputs/inpu*.csv.

It would be nice to make overwriting the outputs of one action with those of another action an error, but we can't quite do that because when developing a pipeline it's common to rename actions, or split them up and move them around and want to support this kind of iterative workflow both locally and on the server.

But maybe we could make it error to overwrite the outputs of another action with the same commit sha. This would catch problems in test and when doing a full run_all on the server.

@sebbacon
Copy link
Contributor

I think we should abandon the uniqueness requirement for output patterns

yes

maybe we could make it error to overwrite the outputs of another action with the same commit sha. This would catch problems in test and when doing a full run_all on the server

Could work... and it perhaps ties in with @bloodearnest's observations about the persistence of old output files and the need to rejig the base directories of respective output runs.

@wjchulme
Copy link

wjchulme commented Jun 4, 2021

related issue: #155

@sebbacon
Copy link
Contributor

OK, so this task becomes:

  • Remove global uniqueness requirement on output names
  • Replace this with two things:
    • Raise an error if any input files become part of an action's output files
    • Raise an error if an output already exists for the same commit sha, but originating in another action
  • Check the pipline docs still make sense

@iaindillingham iaindillingham changed the title Input files can get accidentally captured as outputs Don't require globally unique output patterns Apr 19, 2022
@benbc
Copy link
Contributor

benbc commented Jun 23, 2022

Closing in favour of this option in our pipeline.

@benbc benbc closed this as completed Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants