-
Notifications
You must be signed in to change notification settings - Fork 648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for temporary output paths #3818
Conversation
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Some additional thoughts:
|
…emote paths) Signed-off-by: Ben Sherman <[email protected]>
This is awesome! 👏🏻 😎 Is there any need to mark outputs as temporary? If we're using Following the |
@bentsherman I have a couple of questions, how would this act on those files that are generated by a process but are not defined as I understand that all those intermediate files which are not used downstream can be safely removed upon the process completion, without affecting the resume functionality. |
@ewels That's a fair point, it would be nice to simply say Hmm, it seems that while the task outputs are published before the "on task complete" / "on process terminate" events are sent (which the temp file cleaner uses to trigger cleanup), publishing is asynchronous, so we would also need an event for when publishing is complete for a given task / process. In any case, I think I will get rid of the "empty file" trick and just delete the file, much easier to support directories and object storage that way. I don't think it's strictly needed for resumability. But I would like to know how important resumability is to the community. Paolo says he really wants it, but I suspect that in production, where the automatic cleanup is most useful, being able to resume is not as important because you aren't fixing bugs, etc. If the task cache has all the necessary information, and if we can distinguish between a task that was deleted vs a task that was modified, then the resume should be able to skip tasks that were deleted (but not otherwise modified) as long as downstream tasks are also cached. But I think we could go ahead and ship a basic automatic cleanup feature, then try to add resumability in a separate PR. |
Yup, agree on all points 👍🏻 |
@mbosio85 If an output file isn't captured by an |
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
I figured out how to track the downstream tasks of a temporary output -- when a process "closes" (i.e. all tasks have been created), we can inspect which tasks use a temporary file and be certain that we found all of them. So now each temp file can be deleted much sooner. On top of that, we can save the list of downstream tasks for each task and use it during the resume. To do this, we have to compute separately the task "inputs" hash and task "outputs" hash. The inputs/script/config of a task must be cached no matter what, but if any outputs are missing, we can traverse the list of downstream tasks and see if they are cached. We can traverse the entire task dependency graph, as long as we end up at leaf nodes that are cached. Basically, we want the These ideas should all apply to the global cleanup option as well, including resumability. But with the global cleanup we also have to wait for files to be published. So I'm going to explore the resumability in this PR for now, and then I will translate it to the "eager" cleanup PR. The end goal is to have the global cleanup with resumability, then I think we'll be good to go. |
cf5f8df
to
9760f4f
Compare
Signed-off-by: Ben Sherman <[email protected]>
9760f4f
to
9637e34
Compare
This feature will be awesome! |
The "eager" cleanup PR now has the same capabilities as this one. In particular, it can eagerly delete individual output files in addition to task directories. This piece was important because output files can often times be deleted sooner than task directories. I was going to implement resumability here first and then port it to the other PR, but now we can cut out the middle man 😄 On to resumability... Closing in favor of #3849 . |
Closes #452
Adds a
temporary
option topath
outputs. See the docs and e2e test for details.Notes:
It is currently a coarse-grained approach. A temp file is deleted when all consuming processes of the file's originating process are finished. As a result, temp files may not be deleted as soon as possible. I will investigate more fine-grained approaches, like tracking consumers at the level of channels or tasks, but they will be more complex.Temp file lifetimes are now determined by downstream tasks.Directories and remote paths (e.g. S3) aren't supported yet. I'm working on it!All paths are supported now.Here is how you can test this feature using the e2e test:
After the first run, we inspect the work directory and find that all of the output files that were declared with
temporary: true
in the pipeline are now empty. On a resumed run, everything is cached. In this case, a run can be safely resumed as long as all of thebaz
can be cached. If you modify thebaz
process or delete/modify any of thec.txt
files, resuming the run will produce incorrect output.