Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine checks on output of CI #19

Open
kevinrue opened this issue Nov 30, 2021 · 1 comment
Open

Refine checks on output of CI #19

kevinrue opened this issue Nov 30, 2021 · 1 comment

Comments

@kevinrue
Copy link
Collaborator

  • Download the output of a successful workflow run
  • Compute checksum for key output files that are not expected to change between workflow runs
  • Identify key output files that do change between workflow runs (e.g. files that include timestamps)
    • Design bespoke validity checking rules for those files
  • Apply validity checking rules to subsequent workflow runs for
@kevinrue
Copy link
Collaborator Author

kevinrue commented Nov 30, 2021

A few issues before moving forward with this, if moving forward at all:

  • Several software packages embed their version number or timestamp in output files, breaking checksums or requiring to trim out those bits to run the checksum on the stable part of the file. Not ideal.
  • Updates to individual software dependencies might alter the data contents of the files (e.g., number of mapped reads) in a way that is not indicative of any failure on the part of the pipeline itself. To accommodate this, one option would be to maintain different checks for different combinations of versions in the pipeline's software dependencies. Whether we consider checksums, the number of lines in a file, or even the name of files produced by software; all of those checks are subject to change beyond our control (aside from the inputs and outputs of tasks explicitly tracked by Ruffus). Not ideal.

I argue that seeing the pipeline make full complete without error during the GitHub Action workflow is already in itself confirmation that all the key intermediate files tracked by Ruffus were successfully generated, and did not cause any error in downstream steps.
At that point, a reasonable CI strategy could include:

  • checking that the very last files in the pipeline were indeed generated (does Ruffus check at all whether the stated outputs of a task are present at the end of the task execution? or does it only check those at the start of the pipeline, to determine the tasks that need to be run?)
  • ... maybe that those very last files are not empty, or contain a minimum of information, described by small bits of code that would equate to unit tests.

After all, CI checks are meant to ensure that software is functional, not that it is overfitted to produce one particular output.
(yes, unit tests often check that code produces a particular output, but that's when the software is designed to return 4 when you ask for 2+2, what I'm saying here is that a pipeline is not designed to produce files with absolutely stable contents, or map reads systematically in the same way for every version of the mapping software, what matters is that the pipeline still runs to completion for the latest version of its dependencies, the rest is benchmarking and optimisation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant