Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve timestamps during caching #8602

Closed
johnyaku opened this issue Nov 22, 2022 · 14 comments
Closed

Preserve timestamps during caching #8602

johnyaku opened this issue Nov 22, 2022 · 14 comments
Labels
A: data-management Related to dvc add/checkout/commit/move/remove awaiting response we are waiting for your reply, please respond! :) feature request Requesting a new feature

Comments

@johnyaku
Copy link

Background

DVC pipelines make decisions about whether to execute stages based on the content (checksum) of the dependencies. This is awesome and it is one of the reasons why we are planning to use DVC for top-level pipeline orchestration.

Unfortunately, DVC pipelines lack features from other worfklow managers, such as parallelization and environment switching. This is both a blessing and a curse -- a blessing because it means that DVC pipelines are simple and easy to learn, but a curse because features such as parallelization are central to our existing workflows.

So we are working on use DVC pipelines to coordinate Snakemake workflows. DVC takes care of data integrity, while Snakemake iterates over samples and orchestrates parallel processing, etc.

So far this is going well, at least at the DVC level. But Snakemake makes its decisions about what to execute based on timestamps.

However, when a file is added to a DVC project via dvc add and dvc repro both the symlink AND the cached data have a new timestamp corresponding to the time of the DVC operation.

As a result, if we tinker with the content of a stage (a Snakemake workflow) we have to re-run the entire stage (workflow) and not just the new bits, unless we fuss around touching timestamps. This is tedious and error prone, and "rewrites history" by assigning false timestamps.

(Of course, if neither the workflow (stage) or its dependencies have changed, then the entire workflow (stage) is skipped, which is great.)

We prefer the checksum-based execution decisions as in DVC, but we would like to make this compatible with the timestamp-based decisions in Snakemake workflows.

Feature request:

Add an option to dvc add and dvc repro to preserve timestamps.

Specifically, when this option is specified for each file or directory added to a DVC project, both the symlink in the workspace and the actual data in the cache should have a timestamp matching that of the original data that was added.

If identical data is added later (identical in content, that is), then the timestamps can be updated to match that of the later file.

In addition, add an option to dvc checkout so that the timestamps of the symlinks created in the workspace match those of the target data in the cache.

Together, these two changes should allow DVC and Snakemake to play nicely together :)

Who knows, it might even make sense to make these the default options ...

@dlroden

@daavoo daavoo added A: data-management Related to dvc add/checkout/commit/move/remove feature request Requesting a new feature labels Nov 22, 2022
@daavoo daavoo added this to DVC Nov 22, 2022
@daavoo daavoo moved this to Backlog in DVC Nov 22, 2022
@dberenbaum
Copy link
Collaborator

Hi @johnyaku! Thanks for the request.

We discussed as a team, and it sounds like we could make some improvements on preserving timestamps, but ensuring DVC preserves the same timestamp for any matching content is not a simple fix and not something we are likely to add. I think it could be unexpected to create a new file and see an old timestamp because it matches existing content from the cache. If you want, we can keep this issue open to reduce changing timestamps where possible.

You may already know that we have a longstanding open issue for parallelizing pipelines in #755, and addressing this is the more likely long-term solution here. The same is true for environment handling, although I hope this one's easier to workaround by activating the environments as part of your stage commands.

@dberenbaum dberenbaum added the awaiting response we are waiting for your reply, please respond! :) label Nov 22, 2022
@johnyaku
Copy link
Author

Thanks @dberenbaum for the quick reply. This responsiveness has been a key factor in our decision to put faith in DVC.

On reflection, checking out files with old timestamps might not be appropriate as default behaviour, but it would be very helpful to have this option for use with our Snakemake modules.

FWIW, I think checksum-based execution decisions are far superior to timestamp-based decisions, and it is interesting to note that their is an open issue for Snakemake to implement checksums!

On the other hand, Snakemake's parallelisation, sample iteration and environment-switching features are very mature and we have several legacy workflows that we'd like to leverage. Snakemake workflows port flexibly across K8s and multiple different HPC vendors, as well as stand-alone PCs, and we feel that this platform-independence is a key component of scientific reproducibility. Snakemake also provides helpful reports for optimising resource allocations.

So altho we will be watching #755 and the evolution of DVC pipelines with interest, I expect it will be some time before it can replicate the full functionality of more mature workflow managers. And one of the things I like best about DVC pipelines is the simplicity, so I question whether it is really necessary for DVC to duplicate all of these features, especially if it can "hand off" complex tasks like parallelization to Snakemake or Nextflow, etc.

We have given quite a lot of thought to Snakemake-DVC integration, and are happy to share what we have come up with. But at the moment the key obstacle is timestamp management. The behaviour requested in the feature request might not suit all users in all situations, but it would be great to have as an option, perhaps even something that could be specified in .dvc/config so as to apply to all add, repro, pull and checkout operations for a whole project.

@johnyaku
Copy link
Author

johnyaku commented Nov 23, 2022

Outside of Snakemake, we've also noticed that some other tools also use timestamp-based sanity checks.

For example, .bam files store compressed genomic data, and are often accompanied by a .bai index file to enable random access. The index is created after the compression is complete, and so downstream tools use timestamps as a sanity check, since an index that is older than its target is likely to be out of date.

Because dvc add etc don't preserve timestamps, we sometimes have to deal with a LOT of warnings/errors. We can deal with this via strategic touching or unprotecting but it would be so much cleaner if timestamps could be simply preserved.

Perhaps timestamps (creation and/or modification times) could be captured in .dvc files as part of dvc add and dvc repro? Then users could have the option to apply this meta data to the symlinks in the workspace and/or the data in the cache.

Or it might be more straightforward to note existing timestamps and then apply them as soon as the links/cached files are created (if requested via an additional option or specified via config).

@efiop
Copy link
Contributor

efiop commented Nov 24, 2022

@johnyaku Saving timestamps as metadata in dvcfiles is indeed reasonable and would be a generally useful thing to have. Due to some other limitations, right now this can only be implemented for standaline files but not for files inside of dvc-tracked directories (the legacy .dir object format don't support that and we have newer mechanisms that are not yet enabled by default).

Regarding dvc setting the mtime back, this can be done, but it is more involved and is conflicting with symlinks and hardlinks, since they share the same inode with cache and it can be used in multiple places with different desired timestamps (though this should be doable with copies and maybe reflinks). Also there are limitations like different mtime resolution on different filesystems (e.g. APFS is notorious for having a 1 sec resolution). Overall with many caveats, this can be done (somewhat related to how we handle isexec), but requires working towards a specific scenario (e.g. snakemake, which we are not using). I'm not sure though that all the caveats will make it worthwhile to be accepted in the upstream dvc, especially with us having our own pipeline management.

@johnyaku
Copy link
Author

johnyaku commented Nov 24, 2022

Thanks @efiop! I'm very grateful for the consideration you're giving this.

Re file system latency, I think that if the timestamp gap is small (in the order of a second or two) then the computational cost of re-running that processing will also be small, so I think we can live with it. Also, to clarify our intended use case, in situations where we use a Snakemake workflow to implement a DVC stage, I expect that the entire workflow will run to completion before the outs specified in the DVC stage are cached. That is, DVC caching (and any associated timestamp hacking) will happen after Snakemake has finished running its workflow, and so Snakemake will no longer be relying on these timestamps. As a result, timestamp manipulations during caching are unlikely to affect Snakemake during that particular execution. Hopefully it will be possible to snapshot the timestamp of each file and directory prior (*) to caching, and then apply this timestamp to both the link and the cache after (*) caching.

If this were possible, the advantage for us would only be apparent on subsequent reproductions (**). Specifically, if we modify one of the rules in the Snakemake workflow then the worfklow will need to run again, since it is itself one of the deps of the parent DVC stage. However, most of the rules within the workflow will probably still be the same, and their intermediate files may not need to be regenerated. If timestamps can be preserved, Snakemake will be able to decide intelligently what needs to be re-run, but currently timestamp re-writing forces the entire workflow to be re-executed, which can sometimes take a couple of days even on high performance hardware with extensive parallelisation.

You make a good point about potential conflicts arising from inconsistent timestamps amongst multiple links to the same cached file. This makes me think that perhaps timestamp preservation should be an "all or nothing" option, specified in the config rather than via options to add, repro, etc. At the risk of introducing further complications, timestamp preservation may need to be extended to remotes as well, in order to ensure consistency between instances on different platforms.

(*) TIming may be critical here in at least two aspects: 1) handover between DVC stages, 2) initiation of subsequent DVC stages that may themselves also be Snakemake workflows, and which include deps generated by an earlier stage. I think everything should be OK so long as the initiation of both 1) and 2) takes place after timestamp restoration, rather than immediately after the earlier stage finishes executing its cmd.

(**) "Subsequent reproductions" includes reproductions in other instances (clones) of the DVC project. A colleague may wish to checkout a project with the express purpose of tweaking one of the workflow-stages, perhaps something as simple as tweaking the formatting of the summary report for that workflow-stage. Ideally they should be able to reproduce the pipeline -- including re-running the tweaked stage but in such a way that only the bare minimum is actually re-executed (regenerating the report in this example).

There is vigorous debate within our group as to whether we should use Snakemake to coordinate multiple workflow modules (while asking Snakemake to dvc add the results as we go) or whether we should use DVC to coordinate multiple workflows (including, occassionally, Nextflow etc). I am strongly advocating for the latter, because I believe that checksum decisions are superior to timestamp decisions, and because dvc.lock ties everything together so beautifully, but timestamp rewriting is proving challenging. I appreciate that DVC has its own ambitions to become a fully mature pipeline manager, but I would like to draw your attention to the fact that most mature workflow managers include "handover" features for integration with other workflow managers. In order to fit into this ecosystem DVC may need to preserve timestamps, or at least offer an option to do so.

@efiop
Copy link
Contributor

efiop commented Nov 24, 2022

@johnyaku Btw, have you tried --no-commit? Maybe it could be a local workaround for you. It will not touch files until you tell it to with dvc commit.

@johnyaku
Copy link
Author

Thanks @efiop. I hadn't appreciated --no-commit. It looks like it will do what we need within a particular instance (clone) of a project, which is the most common scenario.

There might also be a way to hack --meta (for add) and meta: (in dvc.yaml for repro) to capture timestamp information as we go.

@zmbc
Copy link

zmbc commented Oct 28, 2024

I am curious whether anyone has considered integrating DVC and Snakemake more deeply. I strongly agree that Snakemake is much stronger for workflow orchestration. The extent of the integration would be using DVC to determine whether to run a specific pipeline step, while still using Snakemake to run it.

I think this integration would solve major problems with both tools: snakemake/snakemake#2977, #755.

If this sounds feasible and of interest, I might play around with a proof-of-concept.

@shcheklein
Copy link
Member

@zmbc sounds good to explore. Please share your thoughts as you go, we can help you review ideas. It would be great to details the proposal a bit more - e.g. some sample project with both tools so that we can "feel" the benefits / try ideas / see limitations / agree on the design. WDYT?

@johnyaku
Copy link
Author

We've been doing this for a year or two now and it works well for our use cases.

Rather than build huge monolithic Snakemake workflows that do everything, we have been creating smaller workflows with simpler rule graphs. We call these "workflow modules", and we include them in our super project as git submodules.

We then combine these workflow modules into a "super pipeline" which is controlled by DVC. Each workflow module is a "stage" in dvc.yaml but only runs if there have been changes to the dependencies. We like the fact that these "changes" are detected based on content (checksums) rather than timestamps. This greatly enhances portability.

Once a stage has been triggered, the fine details of iterating over samples etc is handled by Snakemake. Here decisions about which files need to be regenerated are handled by Snakemake, and so based on timestamp (*).

For full reproducibility, we let DVC destroy the outs for any stage that needs to be run (the default behaviour). During development, we often tag outs with preserve: True so that only missing files are generated. This option can also be helpful when adding more samples but otherwise leaving the workflow unchanged.

When using preserve, timestamps become important, which is problematic when working with multiple instances (clones) of the repo across multiple platforms. However, rather than mess around with timestamps we now use snakemake --touch to mark all existing files as "keepers" when checking out a fresh clone of the repo. If a stage (= workflow module) then fails for any reason during a re-run (whether triggered by new data or changes to the code for the module) the files in the workspace are not added to the cache, and so timestamps are preserved until successful completion of the stage, after which we no longer care.

From this perspective, this issue can be closed

We also frequently "freeze" stages with big or numerous dependencies to avoid waiting for hours only to be told that nothing has changed, although recent versions seem to be better at avoiding futile hashing, particularly with imports. Again, full reproducibility requires running everything from the top without frozen or preserve.

An ongoing problem is duplication of dependency specifications between workflow modules, and between modules and DVC. Here the difficulty is ensuring alignment.

One neat feature of Snakemake that DVC might consider emulating is the idea of an "input function" whereby inputs (="deps" in DVC parlance) can be determined programmatically. This would enable the inputs/dependencies of a given workflow/stage to be specified in one place, and loaded by each tool as needed.

(*) From v8.0, Snakemake now allows for third-party plugins. I think it is only a matter of time before there is a plugin for content-aware (checksum-based) run triggers.

@zmbc
Copy link

zmbc commented Oct 29, 2024

@shcheklein Thanks! The main benefits would be:

  • Parallel execution across cores on a single machine, or across a number of distributed backends (e.g. Slurm) that are supported by Snakemake
  • Much more powerful workflow construction features, including checkpoints, service rules, input functions as mentioned by @johnyaku, etc

some sample project with both tools

I agree this could be a good way to demonstrate this, but realistically I would only be able to do this for an actual project I work on. I am thinking this one: https://github.com/ihmeuw/person_linkage_case_study

It would be great to details the proposal a bit more

The most likely design would be either a Snakemake plugin, or a fork of Snakemake if the plugin interface is not powerful enough, that does the following:

  • Still uses all the Snakemake workflow construction and DAG logic
  • At the point of determining which jobs need to run, instead of using Snakemake "mtime" rerun-trigger, checks whether the DVC-tracked input files to a job are the same as any of the sets of input files for that job in the DVC cache (I'm a bit fuzzy on how DVC identifies that a job is "the same," this might be partially duplicative of Snakemake's other rerun-triggers but that isn't a big problem)
  • Still uses Snakemake to execute the resulting plan of jobs to run
  • Adds resulting files to the DVC cache indexed appropriately by their DVC-tracked input files

Or, since the resulting cache would be Snakemake-specific anyway, we could do it a bit more abstractly:

  • Make "mtime" no longer a default Snakemake rerun trigger, and add a new rerun trigger for DVC hashes of input files and make it on by default
  • Take whatever rerun triggers were requested, and use (a hash of) them as the keys in the DVC cache

I'm not sure whether the second one would require some code modifications on the DVC side to accommodate such general caching.

@zmbc
Copy link

zmbc commented Nov 4, 2024

@shcheklein Any reactions to the design above? Do you think it would make sense to start prototyping?

@johnyaku
Copy link
Author

johnyaku commented Nov 6, 2024

Sounds more like a design spec for a Snakemake plugin than a DVC feature request.

Adds resulting files to the DVC cache indexed appropriately by their DVC-tracked input files

This bit needs some thinking through. DVC pipelines define inputs ('deps') and outputs ('outs') for each stage in dvc.yaml. If you want to Snakemake to run the pipeline then the plugin would probably have to interact with dvc.yaml, dvc.lock somehow. I don't imagine it would be too hard to translate snakemake rules into DVC pipeline stages, so long as all rules use script directives rather than run or shell blocks. The trouble with these last two is that it would be hard to capture the software dependencies, even if you could capture the data dependencies. Anyway, if you can solve that then the plugin could run dvc commit at the end of the workflow. This would not be as neat as dvc repro, which updates dvc.lock at the end of each successful stage, but the trade off would be better parallelization, etc.

As for checkpoints, I hesitate to call them a "feature" of Snakemake as they are hard to debug and prevent you from doing a proper "dry run" all the way to the end. In practice, I find myself refactoring Snakemake workflows containing checkpoints into two or more sub-workflows, so that each sub-workflow becomes a "stage" in the DVC "super pipeline".

@zmbc
Copy link

zmbc commented Nov 6, 2024

Sounds more like a design spec for a Snakemake plugin than a DVC feature request.

I think it is perfectly in-between, actually: it is probably a project that requires changes to one or both of the existing codebases to get them to play nice with each other. I haven't gotten any traction reaching out on the Snakemake repository, but my goal was to run the idea by contributors on both projects, and see whether either side would be interested in "owning" this if and when I can prove it works.

This bit needs some thinking through. DVC pipelines define inputs ('deps') and outputs ('outs') for each stage in dvc.yaml. If you want to Snakemake to run the pipeline then the plugin would probably have to interact with dvc.yaml

I am currently assuming we would bypass dvc.yaml and interact with the DVC cache directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-management Related to dvc add/checkout/commit/move/remove awaiting response we are waiting for your reply, please respond! :) feature request Requesting a new feature
Projects
None yet
Development

No branches or pull requests

6 participants