Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DVC SSH cache of folder has a bug #2542

Closed
PeterFogh opened this issue Sep 27, 2019 · 20 comments · Fixed by iterative/dvc.org#839
Closed

DVC SSH cache of folder has a bug #2542

PeterFogh opened this issue Sep 27, 2019 · 20 comments · Fixed by iterative/dvc.org#839
Labels
bug Did we break something? p1-important Important, aka current backlog of things to do

Comments

@PeterFogh
Copy link
Contributor

PeterFogh commented Sep 27, 2019

Hi, I have just tried you the new SSh cache of folder feature. However, I get the error that the cache does not exist for the folder.

First, I reproduce the stage when the output folder does not exist. Thus, the cache is created as seen in the verbose log below:

DEBUG: Computed stage 'extract_sathub_dataset.dvc' md5: '2eb5b77a4f820a5e53ea0b5ddca6d529'
DEBUG: cache 'ssh://[email protected]/scratch/dvc_project_cache/DNT/c6/56521fb574b12bdde4cfc12b6d3bf5.dir' expected 'c656521fb574b12bdde4cfc12b6d3bf5.dir' actual 'c656521fb574b12bdde4cfc12b6d3bf5'
DEBUG: cache 'ssh://[email protected]/scratch/dvc_project_cache/DNT/38/b2609acfc2b2f24a998a3f9e290aaa' expected '38b2609acfc2b2f24a998a3f9e290aaa' actual 'None'
Saving 'ssh://[email protected]/scratch/dvc_users/fogh/DNT/sathub_datasets/' to 'ssh://[email protected]/scratch/dvc_project_cache/DNT/c6/56521fb574b12bdde4cfc12b6d3bf5.dir'.
DEBUG: cache 'ssh://[email protected]/scratch/dvc_project_cache/DNT/38/b2609acfc2b2f24a998a3f9e290aaa' expected '38b2609acfc2b2f24a998a3f9e290aaa' actual 'None'
DEBUG: Created 'copy': ssh://[email protected]/scratch/dvc_project_cache/DNT/38/b2609acfc2b2f24a998a3f9e290aaa -> ssh://[email protected]/scratch/dvc_users/fogh/DNT/sathub_datasets/3501306.log
DEBUG: cache 'ssh://[email protected]/scratch/dvc_project_cache/DNT/94/7fd0c5d28f7e3d9f7d0f349092944a' expected '947fd0c5d28f7e3d9f7d0f349092944a' actual 'None'
DEBUG: Created 'copy': ssh://[email protected]/scratch/dvc_project_cache/DNT/94/7fd0c5d28f7e3d9f7d0f349092944a -> ssh://[email protected]/scratch/dvc_users/fogh/DNT/sathub_datasets/3501306.nc
DEBUG: cache 'ssh://[email protected]/scratch/dvc_project_cache/DNT/22/2da66f14e694e6b14e86e771771c1c' expected '222da66f14e694e6b14e86e771771c1c' actual 'None'
--------[the same log for all of the other files in the folder]-----------

However, then reproducing the stages again, the folder does not exist in the cache, as seen in the verbose log below.

DEBUG: cache 'ssh://[email protected]/scratch/dvc_project_cache/DNT/c6/56521fb574b12bdde4cfc12b6d3bf5.dir' expected 'c656521fb574b12bdde4cfc12b6d3bf5.dir' actual 'c656521fb574b12bdde4cfc12b6d3bf5'
DEBUG: cache 'ssh://[email protected]/scratch/dvc_project_cache/DNT/38/b2609acfc2b2f24a998a3f9e290aaa' expected '38b2609acfc2b2f24a998a3f9e290aaa' actual '38b2609acfc2b2f24a998a3f9e290aaa'
DEBUG: cache 'ssh://[email protected]/scratch/dvc_project_cache/DNT/94/7fd0c5d28f7e3d9f7d0f349092944a' expected '947fd0c5d28f7e3d9f7d0f349092944a' actual '947fd0c5d28f7e3d9f7d0f349092944a'
DEBUG: cache 'ssh://[email protected]/scratch/dvc_project_cache/DNT/22/2da66f14e694e6b14e86e771771c1c' expected '222da66f14e694e6b14e86e771771c1c' actual '0aed37f9195162d51ed798b9ac3eb082'
WARNING: corrupted cache file 'ssh://[email protected]/scratch/dvc_project_cache/DNT/22/2da66f14e694e6b14e86e771771c1c'.
WARNING: Output 'remote://ahsoka_project_data/sathub_datasets/' of 'extract_sathub_dataset.dvc' changed because it is 'not in cache'
WARNING: Stage 'extract_sathub_dataset.dvc' changed.

I suspect that the "WARNING: corrupted cache file" creates the bug here, but why the cache is corrupted, I do not know?

My dvc version is

(py37_v7) fogh@ubuntuVM:~/work/seges/the_digital_national_trials/dvc_pipeline$ dvc version
DVC version: 0.60.0
Python version: 3.7.4
Platform: Linux-5.0.0-29-generic-x86_64-with-debian-buster-sid
Binary: False
Cache: reflink - False, hardlink - True, symlink - True
Filesystem type (cache directory): ('ext4', '/dev/sda1')
Filesystem type (workspace): ('ext4', '/dev/sda1')

But as seen in the discord messages https://discordapp.com/channels/485586884165107732/563406153334128681/627079206760874005 I have also tried with version 0.54.1, but still with the same error.

I know this bug report is difficult to understand, so I'm welcoming any questions on this issue, but are also open for a call on Discord.

@efiop efiop added p0-critical Critical issue. Needs to be fixed ASAP. bug Did we break something? labels Sep 27, 2019
@efiop
Copy link
Contributor

efiop commented Sep 27, 2019

For the record, discussing on discord for now https://discordapp.com/channels/485586884165107732/563406153334128681/627125740072075264

@efiop
Copy link
Contributor

efiop commented Sep 30, 2019

@efiop
Copy link
Contributor

efiop commented Oct 1, 2019

For the record: https://discordapp.com/channels/485586884165107732/563406153334128681/628509561644646400

So what happened here, is that files in the output directory were still being written to even after the script finished running, probably due to incorrectly implemented shutdown procedure or maybe that is just how dask works and it is dask's issue. So when dvc went on to save them, it sftp.renameed them, so the workers still had correct fd pointing to that renamed file that was now in the cache. When sequential dvc status was called, it detected that one log file had incorrect checksum and deleted it as corrupted, which triggered other workers to fail and corrupt cache for other log files, since they were also still attached to those.

This issue might be relevant not only for ssh, but for local too, as we shutil.move there, which might rename the file, causing the same type of issues. What we need to do is to change the inode when we move files to cache, so no other processes could be having access to our cache files.

@efiop
Copy link
Contributor

efiop commented Oct 1, 2019

Also, currently we move things to the cache as we go, which might be fragile if the destination doesn't have enough storage space left. What we should really do is copy everything we need to cache and only then delete the sources, so if we break at some point, at least we have the original files intact. That approach would use additional storage space temporarily, but I think it is totally worth it.

@PeterFogh
Copy link
Contributor Author

Here is an update on how I got my DVC stage (i.e. python script) to work.
The problem was that the dask workers still (after the job was finished) had an open file handler to the log files, created as part of the dask.delayed jobs. I do not think the file content checked after the job was finished, however, I assume opened and closed files do not have the same checksum??

My solution was just to manually in the dask.delayed job to close the logging filehandle by

import logging

@dask.delayed
def task(...):
    logger = logging.getLogger()
    logger.setLevel(logging.DEBUG)
    fh = get_log_filehandler(path_log)
    logger.addHandler(fh)

   ........the actual code for the task......

    for handler in logger.handlers[:]:
        handler.close()
        logger.removeHandler(handler)

@efiop, will you decide if the issue can be closed? :)

@efiop
Copy link
Contributor

efiop commented Oct 1, 2019

@PeterFogh Yeah, let's not close it for now. This issue has brought a very good point, so we will consider a better approach at moving files to cache, that will not be susceptible to these kinds of issues and a bunch of related ones :) Thank you!

@efiop
Copy link
Contributor

efiop commented Oct 8, 2019

As @MrOutis @pared noticed, copying approach will slow dvc by a lot, so we might want to consider a different approach. Maybe we could come up with another way? Need to research.

@efiop efiop added research p1-important Important, aka current backlog of things to do and removed p0-critical Critical issue. Needs to be fixed ASAP. labels Oct 8, 2019
@ghost
Copy link

ghost commented Oct 8, 2019

We can add an entry to the documentation about the dangers of running two processes working on the same file. Could be also a HINT under cache corruption warning.

@Suor
Copy link
Contributor

Suor commented Oct 22, 2019

Copying won't help anyway in general case as several processes might change out concurrently or someone just gets in between. Document this and maybe provide a HINT as @MrOutis suggested is as much as we can do.

Ultimately we rely on users and their code to behave ;)

@efiop
Copy link
Contributor

efiop commented Oct 22, 2019

Ok, so need to add a hint to the corrupted cache error and need to note this stuff in dvc add/run/etc command reference on dvc.org.

@Suor
Copy link
Contributor

Suor commented Oct 22, 2019

We probably need a concept of well behaving command:

  • only reads from specified deps
  • only writes to specified outs
  • completely rewrites outs (no appending, editing, etc)
  • stops reading/writing on command exit

Explain why is this important and then link to it from commands.

@ghost ghost self-assigned this Oct 25, 2019
@ghost
Copy link

ghost commented Oct 26, 2019

@PeterFogh , I'm wondering why the logger affected the output? Could you come up with a minimal example to reproduce this with Dask?

Maybe there's something else that we might be overlooking 🤔

Tried to reproduce it with the following example:

dvc init --no-scm
dvc config cache.type symlink
dvc config cache.protected true
dvc run -o foo 'echo -n f > foo && { sleep 3 && echo oo >> foo } &'

But since the cache is protected, it couldn't continue writing on the second echo.

Now that I'm thinking about it, looks like protected cache isn't implemented in SSH.

rm -rf /tmp/{foo,dvc-cache}
mkdir -p /tmp/dvc-cache

dvc init --no-scm
dvc remote add ssh ssh://localhost/tmp
dvc remote add cache remote://ssh/dvc-cache
dvc remote modify cache type symlink
dvc remote modify cache protected true
dvc config cache.ssh cache
dvc run -o remote://ssh/foo 'echo -n f > /tmp/foo && { sleep 3 && echo oo >> /tmp/foo } &'

It would be nice to have such feature, since it helps from preventing the user to accidentally corrupt the cache.

What do you think, @PeterFogh ?

@ghost
Copy link

ghost commented Oct 26, 2019

@PeterFogh , also, consider using XFS as your remote's file system to enable reflinks, you'll have a better experience with DVC :)

@ghost ghost added the awaiting response we are waiting for your reply, please respond! :) label Oct 28, 2019
@efiop
Copy link
Contributor

efiop commented Nov 4, 2019

For the record, we've discussed this privately with @MrOutis and protecting won't help anything, as those workers have an opened fd, that is not affected by sequential chmod changes. So there is not much we can do on our side, except copying to cache and only then computing the checksum, which would be slow and won't help us with the similar case of dependencies that other workers are writing to.

@efiop efiop removed the awaiting response we are waiting for your reply, please respond! :) label Nov 6, 2019
@PeterFogh
Copy link
Contributor Author

Hi @efiop and @MrOutis, sorry that I have not answered you messages. I have worked on other projects the last month, where I have not used DVC.
I think it sounds like a fine solution to give the user a hint when the cache is corrupted. And/or perhaps a link to the documentation where the error and hint is explained in details.

Thanks for the advice about XFS, I will take a talk with my team about that 👍

@efiop
Copy link
Contributor

efiop commented Dec 7, 2019

@MrOutis @PeterFogh For the record, XFS won't help there, as we move file to cache, so it will still have the same inode to which workers would be writing too, corrupting it. XFS and reflinks come into place only during the checkout phase, which happens after we have the cache, so they won't be able to help there at all.

@jorgeorpinel
Copy link
Contributor

What we should really do is copy everything we need to cache and only then delete the sources... copying approach will slow dvc by a lot... So there is not much we can do on our side, except copying to cache and only then computing the checksum...

What about both copying as you go BUT only delete sources after everything has been copied? Not sure about the computing checksums part though. Is there a separate issue to address this BTW? Just curious

@efiop
Copy link
Contributor

efiop commented Dec 19, 2019

@jorgeorpinel There is a comment later stating that it would bring too much storage overhead(e.g. you might run out of disk space), so copying is not the best solution.

@jorgeorpinel
Copy link
Contributor

But if the system runs out of space this way, DVC can still catch the error and delete all the copies. I guess this is also true for copying as you go. Maybe keep a log of the "transaction" so it can be reverted if the file system gives any errors?

@efiop
Copy link
Contributor

efiop commented Dec 22, 2019

@jorgeorpinel But it would also bring copying overhead in terms of the time it takes. Currently, we do move, which doesn't take any time if we are moving within the safe FS, and if we start copying, it will always be slow. But even if we would sacrifice that, your workers might keep writing to that workspace file and so when you checkout next time, you might get an incomplete file from cache. So you still need to ensure that your program behaves well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? p1-important Important, aka current backlog of things to do
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants