Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic task cleanup #3849

Draft
wants to merge 40 commits into
base: master
Choose a base branch
from
Draft

Automatic task cleanup #3849

wants to merge 40 commits into from

Conversation

bentsherman
Copy link
Member

@bentsherman bentsherman commented Apr 10, 2023

Alternative to #3818

Instead of adding a temporary option to output paths, this PR facilitates the automatic cleanup through the cleanup config option. By setting cleanup = 'eager', Nextflow will automatically delete task directories during the workflow run. Caveats are documented in the PR.

TODO:

  • wait for outputs to be published
  • warn about incompatible publish modes
  • resumability
  • refactor based on workflow output DSL
  • remove lazy, eager strategies, make aggressive strategy resumable

@bentsherman
Copy link
Member Author

As I mentioned in the other PR, this eager cleanup currently won't work correctly with file publishing because the publishing is asynchronous. So for each task we need to wait for any files to be published first...

@bentsherman bentsherman changed the base branch from ben-task-graph to master April 28, 2023 18:43
@netlify
Copy link

netlify bot commented Jul 7, 2023

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 9ce10dc
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/65c1015b846889000810a7cc

@pditommaso pditommaso force-pushed the master branch 2 times, most recently from 81f7cb7 to 8a43489 Compare August 20, 2023 20:13
@JohnHadish
Copy link

JohnHadish commented Mar 21, 2024

If I understand correctly, your current method of implementing this is to wait for the output of each process to be output then perform cleanup. It may be better to wait until the end of all processes of a single run and then delete. This would probably make it easier to run since you do not need to worry about what files need to be stored. This would still offer a significant amount of space saving.

Exmaple:
I have 10000 files to process. I have a machine where I can have 100 jobs at a time. My nextflow workflow has 5 processes, each which produces files I do not care about, but need for the next step. The run begins and the first 100 files are selected to process. File 1 goes through all 5 processes. File 1 is now done, and all of its temporary files can be deleted. Something interupts the run. File 2 had just completed process 3 during the interuption. The user restarts, File 2 continues where it was, as it has all intermediate files still.

@bentsherman
Copy link
Member Author

The cleanup strategy needs to be reworked due to the upcoming workflow output DSL #4670.

If the publish definition is moved to the workflow level, the task has no idea which of its outputs will be published, and the cleanup observer can't delete a task until it knows that all of the outputs that are going to be published have been published.

A simple solution would be to mark an output for deletion when it is published (and downstream tasks are done with it, etc). The downside is that outputs that are not published are not deleted.

Thinking further, the current POC of the output DSL just appends some "publish" operators to the DAG, so I might be able to trace each process output through the DAG to see if it's connected to a publish op. That way we know if an output can "not" be published and delete it sooner. It still misses files that "could" be published but aren't at runtime, e.g. because they get filtered out by a filter op, but I suspect this is an edge case that can be avoided with good pipeline design.

Finally, we can always fall back to the existing strategy of "delete whatever is left at the end". As long as the eager cleanup can delete enough files early on, it should be enough to move many pipelines from un-runnable to runnable.

@pbieberstein
Copy link

Many thanks for working on this, I know many people will already benefit from this even if resumeability doesn't work but I wanted to share our usecase for why resumeability is important since I haven't seen anyone mention it yet :)

Our pipeline will be the backbone of a platform that will continously accept new samples and archive all the results. The first part of the pipeline is QC, Filtering, genome assembly, etc. And then comes various genome annotation methods.

Since annotation methods tend to evolve over time, we want to be able to quickly re-run all samples we have already analyzed. That's why we want to keep the work directory longterm (but it needs to be as small as possible) to enable skipping the first parts of the pipeline until the annotation steps. So this will require some flexibility in deciding which files should persist the cleanup (we want to keep finished assemblies since these files aren't so large and the updated annotation methods usually start from there + raw reads). Occasionally, an earlier step will also be updated and that's why it'd be great to have a smart resumeability method which can figure out what truly needs to re-run. But I think a user needs to be able to define which files should persist the cleanup because automatic purging would also delete the finished assemblies and then the pipeline has to compute everything from scratch even if only the annotation method changes. That would be the same as simply deleting the entire work directory.

Hoping that this is a good usecase to have in mind while developing the resuming functionality. For now we will try the GEMmaker method.

Thank you!!

@bentsherman
Copy link
Member Author

bentsherman commented Jul 18, 2024

One way to handle that would be to publish files that you don't want to be lost. They will be deleted from the work directory, but when the automatic cleanup recovers a deleted task on resume, it will also verify that the published files are up to date (the file checksums will be stored in the .nextflow cache), so it could use the published files for downstream tasks. But I guess it would need to re-download the published file if it's in a remote location

@pbieberstein
Copy link

Nice, true that would make sense that it can reference published files, is that logic already implemented in the nf-boost plugin?

@bentsherman
Copy link
Member Author

No, nf-boost doesn't do anything with resume

Comment on lines +911 to +915
// -- recursively check for cached outputs
List<HashCode> queue = [ seedHash ]
Map<HashCode,TaskHandler> handlers = [:]

while( !queue.isEmpty() ) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are recursively searching for downstream tasks that are still whole (or the end of the pipeline).

Note that we are skipping the dataflow logic between these processes. Essentially we are assuming that we only need to check the consumer tasks.

However, a user could add a new process between runs that depends on this deleted task. In this case we need to re-execute the deleted task. So this approach of only checking the consumer tasks will not work.

Instead, we can optimistically finalize the deleted task and emit "fake" output files (e.g. DeletedPath) that point to the deleted task. A deleted file can provide its hash and maybe some other metadata, but if downstream code tries to access the file contents, it causes the deleted task to be re-executed.

But how to manage multiple file objects pointing to the same deleted task? They will need to be synchronized so that the deleted task is re-executed only once (this is already done in FilePorter). Then the fake file object needs to be "re-hydrated" with the re-computed file and delegate to it.

I don't think the re-computed task will be able to simply emit its outputs. I think it will need to communicate the re-computed files to the fake files that are already downstream. So the deleted file might need to also know something about where it sits in the task outputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants