-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DVC SSH cache of folder has a bug #2542
Comments
For the record, discussing on discord for now https://discordapp.com/channels/485586884165107732/563406153334128681/627125740072075264 |
For the record: fresh discussion https://discordapp.com/channels/485586884165107732/563406153334128681/628242212769103872 |
For the record: https://discordapp.com/channels/485586884165107732/563406153334128681/628509561644646400 So what happened here, is that files in the output directory were still being written to even after the script finished running, probably due to incorrectly implemented shutdown procedure or maybe that is just how dask works and it is dask's issue. So when dvc went on to save them, it This issue might be relevant not only for |
Also, currently we |
Here is an update on how I got my DVC stage (i.e. python script) to work. My solution was just to manually in the dask.delayed job to close the logging filehandle by
@efiop, will you decide if the issue can be closed? :) |
@PeterFogh Yeah, let's not close it for now. This issue has brought a very good point, so we will consider a better approach at moving files to cache, that will not be susceptible to these kinds of issues and a bunch of related ones :) Thank you! |
We can add an entry to the documentation about the dangers of running two processes working on the same file. Could be also a |
Copying won't help anyway in general case as several processes might change out concurrently or someone just gets in between. Document this and maybe provide a HINT as @MrOutis suggested is as much as we can do. Ultimately we rely on users and their code to behave ;) |
Ok, so need to add a hint to the corrupted cache error and need to note this stuff in dvc add/run/etc command reference on dvc.org. |
We probably need a concept of well behaving command:
Explain why is this important and then link to it from commands. |
@PeterFogh , I'm wondering why the logger affected the output? Could you come up with a minimal example to reproduce this with Dask? Maybe there's something else that we might be overlooking 🤔 Tried to reproduce it with the following example: dvc init --no-scm
dvc config cache.type symlink
dvc config cache.protected true
dvc run -o foo 'echo -n f > foo && { sleep 3 && echo oo >> foo } &' But since the cache is protected, it couldn't continue writing on the second echo. Now that I'm thinking about it, looks like protected cache isn't implemented in SSH. rm -rf /tmp/{foo,dvc-cache}
mkdir -p /tmp/dvc-cache
dvc init --no-scm
dvc remote add ssh ssh://localhost/tmp
dvc remote add cache remote://ssh/dvc-cache
dvc remote modify cache type symlink
dvc remote modify cache protected true
dvc config cache.ssh cache
dvc run -o remote://ssh/foo 'echo -n f > /tmp/foo && { sleep 3 && echo oo >> /tmp/foo } &' It would be nice to have such feature, since it helps from preventing the user to accidentally corrupt the cache. What do you think, @PeterFogh ? |
@PeterFogh , also, consider using XFS as your remote's file system to enable |
For the record, we've discussed this privately with @MrOutis and protecting won't help anything, as those workers have an opened fd, that is not affected by sequential chmod changes. So there is not much we can do on our side, except copying to cache and only then computing the checksum, which would be slow and won't help us with the similar case of dependencies that other workers are writing to. |
Hi @efiop and @MrOutis, sorry that I have not answered you messages. I have worked on other projects the last month, where I have not used DVC. Thanks for the advice about XFS, I will take a talk with my team about that 👍 |
@MrOutis @PeterFogh For the record, XFS won't help there, as we move file to cache, so it will still have the same inode to which workers would be writing too, corrupting it. XFS and reflinks come into place only during the checkout phase, which happens after we have the cache, so they won't be able to help there at all. |
What about both copying as you go BUT only delete sources after everything has been copied? Not sure about the computing checksums part though. Is there a separate issue to address this BTW? Just curious |
@jorgeorpinel There is a comment later stating that it would bring too much storage overhead(e.g. you might run out of disk space), so copying is not the best solution. |
But if the system runs out of space this way, DVC can still catch the error and delete all the copies. I guess this is also true for copying as you go. Maybe keep a log of the "transaction" so it can be reverted if the file system gives any errors? |
@jorgeorpinel But it would also bring copying overhead in terms of the time it takes. Currently, we do |
Hi, I have just tried you the new SSh cache of folder feature. However, I get the error that the cache does not exist for the folder.
First, I reproduce the stage when the output folder does not exist. Thus, the cache is created as seen in the verbose log below:
However, then reproducing the stages again, the folder does not exist in the cache, as seen in the verbose log below.
I suspect that the "WARNING: corrupted cache file" creates the bug here, but why the cache is corrupted, I do not know?
My dvc version is
But as seen in the discord messages https://discordapp.com/channels/485586884165107732/563406153334128681/627079206760874005 I have also tried with version 0.54.1, but still with the same error.
I know this bug report is difficult to understand, so I'm welcoming any questions on this issue, but are also open for a call on Discord.
The text was updated successfully, but these errors were encountered: