DVC SSH cache of folder has a bug #2542

PeterFogh · 2019-09-27T12:17:22Z

Hi, I have just tried you the new SSh cache of folder feature. However, I get the error that the cache does not exist for the folder.

First, I reproduce the stage when the output folder does not exist. Thus, the cache is created as seen in the verbose log below:

DEBUG: Computed stage 'extract_sathub_dataset.dvc' md5: '2eb5b77a4f820a5e53ea0b5ddca6d529'
DEBUG: cache 'ssh://[email protected]/scratch/dvc_project_cache/DNT/c6/56521fb574b12bdde4cfc12b6d3bf5.dir' expected 'c656521fb574b12bdde4cfc12b6d3bf5.dir' actual 'c656521fb574b12bdde4cfc12b6d3bf5'
DEBUG: cache 'ssh://[email protected]/scratch/dvc_project_cache/DNT/38/b2609acfc2b2f24a998a3f9e290aaa' expected '38b2609acfc2b2f24a998a3f9e290aaa' actual 'None'
Saving 'ssh://[email protected]/scratch/dvc_users/fogh/DNT/sathub_datasets/' to 'ssh://[email protected]/scratch/dvc_project_cache/DNT/c6/56521fb574b12bdde4cfc12b6d3bf5.dir'.
DEBUG: cache 'ssh://[email protected]/scratch/dvc_project_cache/DNT/38/b2609acfc2b2f24a998a3f9e290aaa' expected '38b2609acfc2b2f24a998a3f9e290aaa' actual 'None'
DEBUG: Created 'copy': ssh://[email protected]/scratch/dvc_project_cache/DNT/38/b2609acfc2b2f24a998a3f9e290aaa -> ssh://[email protected]/scratch/dvc_users/fogh/DNT/sathub_datasets/3501306.log
DEBUG: cache 'ssh://[email protected]/scratch/dvc_project_cache/DNT/94/7fd0c5d28f7e3d9f7d0f349092944a' expected '947fd0c5d28f7e3d9f7d0f349092944a' actual 'None'
DEBUG: Created 'copy': ssh://[email protected]/scratch/dvc_project_cache/DNT/94/7fd0c5d28f7e3d9f7d0f349092944a -> ssh://[email protected]/scratch/dvc_users/fogh/DNT/sathub_datasets/3501306.nc
DEBUG: cache 'ssh://[email protected]/scratch/dvc_project_cache/DNT/22/2da66f14e694e6b14e86e771771c1c' expected '222da66f14e694e6b14e86e771771c1c' actual 'None'
--------[the same log for all of the other files in the folder]-----------

However, then reproducing the stages again, the folder does not exist in the cache, as seen in the verbose log below.

DEBUG: cache 'ssh://[email protected]/scratch/dvc_project_cache/DNT/c6/56521fb574b12bdde4cfc12b6d3bf5.dir' expected 'c656521fb574b12bdde4cfc12b6d3bf5.dir' actual 'c656521fb574b12bdde4cfc12b6d3bf5'
DEBUG: cache 'ssh://[email protected]/scratch/dvc_project_cache/DNT/38/b2609acfc2b2f24a998a3f9e290aaa' expected '38b2609acfc2b2f24a998a3f9e290aaa' actual '38b2609acfc2b2f24a998a3f9e290aaa'
DEBUG: cache 'ssh://[email protected]/scratch/dvc_project_cache/DNT/94/7fd0c5d28f7e3d9f7d0f349092944a' expected '947fd0c5d28f7e3d9f7d0f349092944a' actual '947fd0c5d28f7e3d9f7d0f349092944a'
DEBUG: cache 'ssh://[email protected]/scratch/dvc_project_cache/DNT/22/2da66f14e694e6b14e86e771771c1c' expected '222da66f14e694e6b14e86e771771c1c' actual '0aed37f9195162d51ed798b9ac3eb082'
WARNING: corrupted cache file 'ssh://[email protected]/scratch/dvc_project_cache/DNT/22/2da66f14e694e6b14e86e771771c1c'.
WARNING: Output 'remote://ahsoka_project_data/sathub_datasets/' of 'extract_sathub_dataset.dvc' changed because it is 'not in cache'
WARNING: Stage 'extract_sathub_dataset.dvc' changed.

I suspect that the "WARNING: corrupted cache file" creates the bug here, but why the cache is corrupted, I do not know?

My dvc version is

(py37_v7) fogh@ubuntuVM:~/work/seges/the_digital_national_trials/dvc_pipeline$ dvc version
DVC version: 0.60.0
Python version: 3.7.4
Platform: Linux-5.0.0-29-generic-x86_64-with-debian-buster-sid
Binary: False
Cache: reflink - False, hardlink - True, symlink - True
Filesystem type (cache directory): ('ext4', '/dev/sda1')
Filesystem type (workspace): ('ext4', '/dev/sda1')

But as seen in the discord messages https://discordapp.com/channels/485586884165107732/563406153334128681/627079206760874005 I have also tried with version 0.54.1, but still with the same error.

I know this bug report is difficult to understand, so I'm welcoming any questions on this issue, but are also open for a call on Discord.

The text was updated successfully, but these errors were encountered:

efiop · 2019-09-27T13:13:05Z

For the record, discussing on discord for now https://discordapp.com/channels/485586884165107732/563406153334128681/627125740072075264

efiop · 2019-09-30T16:56:26Z

For the record: fresh discussion https://discordapp.com/channels/485586884165107732/563406153334128681/628242212769103872

efiop · 2019-10-01T08:41:54Z

For the record: https://discordapp.com/channels/485586884165107732/563406153334128681/628509561644646400

So what happened here, is that files in the output directory were still being written to even after the script finished running, probably due to incorrectly implemented shutdown procedure or maybe that is just how dask works and it is dask's issue. So when dvc went on to save them, it sftp.renameed them, so the workers still had correct fd pointing to that renamed file that was now in the cache. When sequential dvc status was called, it detected that one log file had incorrect checksum and deleted it as corrupted, which triggered other workers to fail and corrupt cache for other log files, since they were also still attached to those.

This issue might be relevant not only for ssh, but for local too, as we shutil.move there, which might rename the file, causing the same type of issues. What we need to do is to change the inode when we move files to cache, so no other processes could be having access to our cache files.

efiop · 2019-10-01T08:44:56Z

Also, currently we move things to the cache as we go, which might be fragile if the destination doesn't have enough storage space left. What we should really do is copy everything we need to cache and only then delete the sources, so if we break at some point, at least we have the original files intact. That approach would use additional storage space temporarily, but I think it is totally worth it.

PeterFogh · 2019-10-01T09:09:36Z

Here is an update on how I got my DVC stage (i.e. python script) to work.
The problem was that the dask workers still (after the job was finished) had an open file handler to the log files, created as part of the dask.delayed jobs. I do not think the file content checked after the job was finished, however, I assume opened and closed files do not have the same checksum??

My solution was just to manually in the dask.delayed job to close the logging filehandle by

import logging

@dask.delayed
def task(...):
    logger = logging.getLogger()
    logger.setLevel(logging.DEBUG)
    fh = get_log_filehandler(path_log)
    logger.addHandler(fh)

   ........the actual code for the task......

    for handler in logger.handlers[:]:
        handler.close()
        logger.removeHandler(handler)

@efiop, will you decide if the issue can be closed? :)

efiop · 2019-10-01T10:22:55Z

@PeterFogh Yeah, let's not close it for now. This issue has brought a very good point, so we will consider a better approach at moving files to cache, that will not be susceptible to these kinds of issues and a bunch of related ones :) Thank you!

efiop · 2019-10-08T15:26:38Z

As @MrOutis @pared noticed, copying approach will slow dvc by a lot, so we might want to consider a different approach. Maybe we could come up with another way? Need to research.

ghost · 2019-10-08T16:11:25Z

We can add an entry to the documentation about the dangers of running two processes working on the same file. Could be also a HINT under cache corruption warning.

Suor · 2019-10-22T12:48:19Z

Copying won't help anyway in general case as several processes might change out concurrently or someone just gets in between. Document this and maybe provide a HINT as @MrOutis suggested is as much as we can do.

Ultimately we rely on users and their code to behave ;)

efiop · 2019-10-22T15:01:01Z

Ok, so need to add a hint to the corrupted cache error and need to note this stuff in dvc add/run/etc command reference on dvc.org.

Suor · 2019-10-22T15:11:01Z

We probably need a concept of well behaving command:

only reads from specified deps
only writes to specified outs
completely rewrites outs (no appending, editing, etc)
stops reading/writing on command exit

Explain why is this important and then link to it from commands.

ghost · 2019-10-26T00:56:14Z

@PeterFogh , I'm wondering why the logger affected the output? Could you come up with a minimal example to reproduce this with Dask?

Maybe there's something else that we might be overlooking 🤔

Tried to reproduce it with the following example:

dvc init --no-scm
dvc config cache.type symlink
dvc config cache.protected true
dvc run -o foo 'echo -n f > foo && { sleep 3 && echo oo >> foo } &'

But since the cache is protected, it couldn't continue writing on the second echo.

Now that I'm thinking about it, looks like protected cache isn't implemented in SSH.

rm -rf /tmp/{foo,dvc-cache}
mkdir -p /tmp/dvc-cache

dvc init --no-scm
dvc remote add ssh ssh://localhost/tmp
dvc remote add cache remote://ssh/dvc-cache
dvc remote modify cache type symlink
dvc remote modify cache protected true
dvc config cache.ssh cache
dvc run -o remote://ssh/foo 'echo -n f > /tmp/foo && { sleep 3 && echo oo >> /tmp/foo } &'

It would be nice to have such feature, since it helps from preventing the user to accidentally corrupt the cache.

What do you think, @PeterFogh ?

ghost · 2019-10-26T01:27:10Z

@PeterFogh , also, consider using XFS as your remote's file system to enable reflinks, you'll have a better experience with DVC :)

efiop · 2019-11-04T12:42:01Z

For the record, we've discussed this privately with @MrOutis and protecting won't help anything, as those workers have an opened fd, that is not affected by sequential chmod changes. So there is not much we can do on our side, except copying to cache and only then computing the checksum, which would be slow and won't help us with the similar case of dependencies that other workers are writing to.

PeterFogh · 2019-12-07T15:48:43Z

Hi @efiop and @MrOutis, sorry that I have not answered you messages. I have worked on other projects the last month, where I have not used DVC.
I think it sounds like a fine solution to give the user a hint when the cache is corrupted. And/or perhaps a link to the documentation where the error and hint is explained in details.

Thanks for the advice about XFS, I will take a talk with my team about that 👍

efiop · 2019-12-07T16:45:58Z

@MrOutis @PeterFogh For the record, XFS won't help there, as we move file to cache, so it will still have the same inode to which workers would be writing too, corrupting it. XFS and reflinks come into place only during the checkout phase, which happens after we have the cache, so they won't be able to help there at all.

jorgeorpinel · 2019-12-19T18:44:31Z

What we should really do is copy everything we need to cache and only then delete the sources... copying approach will slow dvc by a lot... So there is not much we can do on our side, except copying to cache and only then computing the checksum...

What about both copying as you go BUT only delete sources after everything has been copied? Not sure about the computing checksums part though. Is there a separate issue to address this BTW? Just curious

efiop · 2019-12-19T20:01:29Z

@jorgeorpinel There is a comment later stating that it would bring too much storage overhead(e.g. you might run out of disk space), so copying is not the best solution.

jorgeorpinel · 2019-12-19T22:41:35Z

But if the system runs out of space this way, DVC can still catch the error and delete all the copies. I guess this is also true for copying as you go. Maybe keep a log of the "transaction" so it can be reverted if the file system gives any errors?

efiop · 2019-12-22T11:15:43Z

@jorgeorpinel But it would also bring copying overhead in terms of the time it takes. Currently, we do move, which doesn't take any time if we are moving within the safe FS, and if we start copying, it will always be slow. But even if we would sacrifice that, your workers might keep writing to that workspace file and so when you checkout next time, you might get an incomplete file from cache. So you still need to ensure that your program behaves well.

PeterFogh mentioned this issue Sep 27, 2019

output/dependency/remote: slow performance for remote directories #2009

Closed

efiop added p0-critical Critical issue. Needs to be fixed ASAP. bug Did we break something? labels Sep 27, 2019

efiop added research p1-important Important, aka current backlog of things to do and removed p0-critical Critical issue. Needs to be fixed ASAP. labels Oct 8, 2019

efiop added c8-full-day and removed research labels Oct 22, 2019

ghost self-assigned this Oct 25, 2019

ghost added the awaiting response we are waiting for your reply, please respond! :) label Oct 28, 2019

ghost mentioned this issue Nov 2, 2019

mention that cache options could be used besides remote modify params iterative/dvc.org#764

Closed

efiop removed the awaiting response we are waiting for your reply, please respond! :) label Nov 6, 2019

ghost mentioned this issue Dec 5, 2019

run: add concept of well-behaved commands iterative/dvc.org#839

Merged

shcheklein closed this as completed in iterative/dvc.org#839 Dec 7, 2019

ghost mentioned this issue Dec 19, 2019

Regular doc updates (mid Dec) iterative/dvc.org#877

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DVC SSH cache of folder has a bug #2542

DVC SSH cache of folder has a bug #2542

PeterFogh commented Sep 27, 2019 •

edited

Loading

efiop commented Sep 27, 2019

efiop commented Sep 30, 2019

efiop commented Oct 1, 2019

efiop commented Oct 1, 2019 •

edited

Loading

PeterFogh commented Oct 1, 2019

efiop commented Oct 1, 2019

efiop commented Oct 8, 2019

ghost commented Oct 8, 2019

Suor commented Oct 22, 2019

efiop commented Oct 22, 2019

Suor commented Oct 22, 2019

ghost commented Oct 26, 2019

ghost commented Oct 26, 2019

efiop commented Nov 4, 2019

PeterFogh commented Dec 7, 2019

efiop commented Dec 7, 2019 •

edited

Loading

jorgeorpinel commented Dec 19, 2019

efiop commented Dec 19, 2019

jorgeorpinel commented Dec 19, 2019

efiop commented Dec 22, 2019

DVC SSH cache of folder has a bug #2542

DVC SSH cache of folder has a bug #2542

Comments

PeterFogh commented Sep 27, 2019 • edited Loading

efiop commented Sep 27, 2019

efiop commented Sep 30, 2019

efiop commented Oct 1, 2019

efiop commented Oct 1, 2019 • edited Loading

PeterFogh commented Oct 1, 2019

efiop commented Oct 1, 2019

efiop commented Oct 8, 2019

ghost commented Oct 8, 2019

Suor commented Oct 22, 2019

efiop commented Oct 22, 2019

Suor commented Oct 22, 2019

ghost commented Oct 26, 2019

ghost commented Oct 26, 2019

efiop commented Nov 4, 2019

PeterFogh commented Dec 7, 2019

efiop commented Dec 7, 2019 • edited Loading

jorgeorpinel commented Dec 19, 2019

efiop commented Dec 19, 2019

jorgeorpinel commented Dec 19, 2019

efiop commented Dec 22, 2019

PeterFogh commented Sep 27, 2019 •

edited

Loading

efiop commented Oct 1, 2019 •

edited

Loading

efiop commented Dec 7, 2019 •

edited

Loading