-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve timestamps during caching #8602
Comments
Hi @johnyaku! Thanks for the request. We discussed as a team, and it sounds like we could make some improvements on preserving timestamps, but ensuring DVC preserves the same timestamp for any matching content is not a simple fix and not something we are likely to add. I think it could be unexpected to create a new file and see an old timestamp because it matches existing content from the cache. If you want, we can keep this issue open to reduce changing timestamps where possible. You may already know that we have a longstanding open issue for parallelizing pipelines in #755, and addressing this is the more likely long-term solution here. The same is true for environment handling, although I hope this one's easier to workaround by activating the environments as part of your stage commands. |
Thanks @dberenbaum for the quick reply. This responsiveness has been a key factor in our decision to put faith in DVC. On reflection, checking out files with old timestamps might not be appropriate as default behaviour, but it would be very helpful to have this option for use with our Snakemake modules. FWIW, I think checksum-based execution decisions are far superior to timestamp-based decisions, and it is interesting to note that their is an open issue for Snakemake to implement checksums! On the other hand, Snakemake's parallelisation, sample iteration and environment-switching features are very mature and we have several legacy workflows that we'd like to leverage. Snakemake workflows port flexibly across K8s and multiple different HPC vendors, as well as stand-alone PCs, and we feel that this platform-independence is a key component of scientific reproducibility. Snakemake also provides helpful reports for optimising resource allocations. So altho we will be watching #755 and the evolution of DVC pipelines with interest, I expect it will be some time before it can replicate the full functionality of more mature workflow managers. And one of the things I like best about DVC pipelines is the simplicity, so I question whether it is really necessary for DVC to duplicate all of these features, especially if it can "hand off" complex tasks like parallelization to Snakemake or Nextflow, etc. We have given quite a lot of thought to Snakemake-DVC integration, and are happy to share what we have come up with. But at the moment the key obstacle is timestamp management. The behaviour requested in the feature request might not suit all users in all situations, but it would be great to have as an option, perhaps even something that could be specified in |
Outside of Snakemake, we've also noticed that some other tools also use timestamp-based sanity checks. For example, Because Perhaps timestamps (creation and/or modification times) could be captured in Or it might be more straightforward to note existing timestamps and then apply them as soon as the links/cached files are created (if requested via an additional option or specified via config). |
@johnyaku Saving timestamps as metadata in dvcfiles is indeed reasonable and would be a generally useful thing to have. Due to some other limitations, right now this can only be implemented for standaline files but not for files inside of dvc-tracked directories (the legacy Regarding dvc setting the mtime back, this can be done, but it is more involved and is conflicting with symlinks and hardlinks, since they share the same inode with cache and it can be used in multiple places with different desired timestamps (though this should be doable with copies and maybe reflinks). Also there are limitations like different mtime resolution on different filesystems (e.g. APFS is notorious for having a 1 sec resolution). Overall with many caveats, this can be done (somewhat related to how we handle isexec), but requires working towards a specific scenario (e.g. snakemake, which we are not using). I'm not sure though that all the caveats will make it worthwhile to be accepted in the upstream dvc, especially with us having our own pipeline management. |
Thanks @efiop! I'm very grateful for the consideration you're giving this. Re file system latency, I think that if the timestamp gap is small (in the order of a second or two) then the computational cost of re-running that processing will also be small, so I think we can live with it. Also, to clarify our intended use case, in situations where we use a Snakemake workflow to implement a DVC stage, I expect that the entire workflow will run to completion before the If this were possible, the advantage for us would only be apparent on subsequent reproductions (**). Specifically, if we modify one of the rules in the Snakemake workflow then the worfklow will need to run again, since it is itself one of the You make a good point about potential conflicts arising from inconsistent timestamps amongst multiple links to the same cached file. This makes me think that perhaps timestamp preservation should be an "all or nothing" option, specified in the config rather than via options to (*) TIming may be critical here in at least two aspects: 1) handover between DVC stages, 2) initiation of subsequent DVC stages that may themselves also be Snakemake workflows, and which include (**) "Subsequent reproductions" includes There is vigorous debate within our group as to whether we should use Snakemake to coordinate multiple workflow modules (while asking Snakemake to |
@johnyaku Btw, have you tried |
Thanks @efiop. I hadn't appreciated There might also be a way to hack |
I am curious whether anyone has considered integrating DVC and Snakemake more deeply. I strongly agree that Snakemake is much stronger for workflow orchestration. The extent of the integration would be using DVC to determine whether to run a specific pipeline step, while still using Snakemake to run it. I think this integration would solve major problems with both tools: snakemake/snakemake#2977, #755. If this sounds feasible and of interest, I might play around with a proof-of-concept. |
@zmbc sounds good to explore. Please share your thoughts as you go, we can help you review ideas. It would be great to details the proposal a bit more - e.g. some sample project with both tools so that we can "feel" the benefits / try ideas / see limitations / agree on the design. WDYT? |
We've been doing this for a year or two now and it works well for our use cases. Rather than build huge monolithic Snakemake workflows that do everything, we have been creating smaller workflows with simpler rule graphs. We call these "workflow modules", and we include them in our super project as git submodules. We then combine these workflow modules into a "super pipeline" which is controlled by DVC. Each workflow module is a "stage" in Once a stage has been triggered, the fine details of iterating over samples etc is handled by Snakemake. Here decisions about which files need to be regenerated are handled by Snakemake, and so based on timestamp (*). For full reproducibility, we let DVC destroy the When using From this perspective, this issue can be closed We also frequently "freeze" stages with big or numerous dependencies to avoid waiting for hours only to be told that nothing has changed, although recent versions seem to be better at avoiding futile hashing, particularly with imports. Again, full reproducibility requires running everything from the top without An ongoing problem is duplication of dependency specifications between workflow modules, and between modules and DVC. Here the difficulty is ensuring alignment. One neat feature of Snakemake that DVC might consider emulating is the idea of an "input function" whereby inputs (="deps" in DVC parlance) can be determined programmatically. This would enable the inputs/dependencies of a given workflow/stage to be specified in one place, and loaded by each tool as needed. (*) From v8.0, Snakemake now allows for third-party plugins. I think it is only a matter of time before there is a plugin for content-aware (checksum-based) run triggers. |
@shcheklein Thanks! The main benefits would be:
I agree this could be a good way to demonstrate this, but realistically I would only be able to do this for an actual project I work on. I am thinking this one: https://github.com/ihmeuw/person_linkage_case_study
The most likely design would be either a Snakemake plugin, or a fork of Snakemake if the plugin interface is not powerful enough, that does the following:
Or, since the resulting cache would be Snakemake-specific anyway, we could do it a bit more abstractly:
I'm not sure whether the second one would require some code modifications on the DVC side to accommodate such general caching. |
@shcheklein Any reactions to the design above? Do you think it would make sense to start prototyping? |
Sounds more like a design spec for a Snakemake plugin than a DVC feature request.
This bit needs some thinking through. DVC pipelines define inputs ('deps') and outputs ('outs') for each stage in As for checkpoints, I hesitate to call them a "feature" of Snakemake as they are hard to debug and prevent you from doing a proper "dry run" all the way to the end. In practice, I find myself refactoring Snakemake workflows containing checkpoints into two or more sub-workflows, so that each sub-workflow becomes a "stage" in the DVC "super pipeline". |
I think it is perfectly in-between, actually: it is probably a project that requires changes to one or both of the existing codebases to get them to play nice with each other. I haven't gotten any traction reaching out on the Snakemake repository, but my goal was to run the idea by contributors on both projects, and see whether either side would be interested in "owning" this if and when I can prove it works.
I am currently assuming we would bypass dvc.yaml and interact with the DVC cache directly. |
Background
DVC pipelines make decisions about whether to execute stages based on the content (checksum) of the dependencies. This is awesome and it is one of the reasons why we are planning to use DVC for top-level pipeline orchestration.
Unfortunately, DVC pipelines lack features from other worfklow managers, such as parallelization and environment switching. This is both a blessing and a curse -- a blessing because it means that DVC pipelines are simple and easy to learn, but a curse because features such as parallelization are central to our existing workflows.
So we are working on use DVC pipelines to coordinate Snakemake workflows. DVC takes care of data integrity, while Snakemake iterates over samples and orchestrates parallel processing, etc.
So far this is going well, at least at the DVC level. But Snakemake makes its decisions about what to execute based on timestamps.
However, when a file is added to a DVC project via
dvc add
anddvc repro
both the symlink AND the cached data have a new timestamp corresponding to the time of the DVC operation.As a result, if we tinker with the content of a stage (a Snakemake workflow) we have to re-run the entire stage (workflow) and not just the new bits, unless we fuss around
touch
ing timestamps. This is tedious and error prone, and "rewrites history" by assigning false timestamps.(Of course, if neither the workflow (stage) or its dependencies have changed, then the entire workflow (stage) is skipped, which is great.)
We prefer the checksum-based execution decisions as in DVC, but we would like to make this compatible with the timestamp-based decisions in Snakemake workflows.
Feature request:
Add an option to
dvc add
anddvc repro
to preserve timestamps.Specifically, when this option is specified for each file or directory added to a DVC project, both the symlink in the workspace and the actual data in the cache should have a timestamp matching that of the original data that was added.
If identical data is added later (identical in content, that is), then the timestamps can be updated to match that of the later file.
In addition, add an option to
dvc checkout
so that the timestamps of the symlinks created in the workspace match those of the target data in the cache.Together, these two changes should allow DVC and Snakemake to play nicely together :)
Who knows, it might even make sense to make these the default options ...
@dlroden
The text was updated successfully, but these errors were encountered: