Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add: incredibly slow #6977

Closed
wdixon opened this issue Nov 12, 2021 · 8 comments
Closed

add: incredibly slow #6977

wdixon opened this issue Nov 12, 2021 · 8 comments
Labels
awaiting response we are waiting for your reply, please respond! :)

Comments

@wdixon
Copy link

wdixon commented Nov 12, 2021

Bug Report

add: incredibly slow

Description

I have a dataset that consists of videos (large files) that sit along some metadata in json files. If the metadata (json files) are updated, if the dataset directory is re-added, it seems to be rehashing everything again?

dataset

  • A
    a_1.avi
    a_2.avi
    a.json
  • B
    b_1.avi
    b_2.avi
    b.json
    ....

Updating a handful of json files, dvc add of dataset takes ~1 hour to re-compute the md5 hashes? Which doesn't make sense given all the large files are untouched, and already in a local dvc cache?

Reproduce

  1. Existing DVC dataset directory with several large files
  2. update metatadata json files (along side the large files)
  3. dvc add dataset

Expected

would expect it to take just a few minutes; however, its taking an hour....

Environment information

Output of dvc doctor:

$ dvc doctor

DVC version: 2.8.2 (pip)
---------------------------------
Platform: Python 3.8.0 on Linux-3.10.0-1160.45.1.el7.x86_64-x86_64-with-glibc2.27
Supports:
	webhdfs (fsspec = 2021.10.1),
	http (aiohttp = 3.8.0, aiohttp-retry = 2.4.6),
	https (aiohttp = 3.8.0, aiohttp-retry = 2.4.6),
	s3 (s3fs = 2021.10.1, boto3 = 1.17.106)
Cache types: hardlink, symlink
Cache directory: nfs on LEB1MLNAS.comany.com:/leb1mlnas_projects
Caches: local
Remotes: s3
Workspace directory: nfs on LEB1MLNAS.company.com:/leb1mlnas_projects
Repo: dvc, git
@skshetry
Copy link
Member

I see that you are using nfs which could be the issue. Could you please update to 2.8.3 version and try?
Also, would you be able to share with us some profiling information?

pip install yappi
dvc add --yappi ...

This will generate a callgrind.dvc-*.out file that'll provide us with profiling information.
This issue may be related to nfs issues similar to #5132 and #5562, and to other general issues like #5746, #5594, and #6227.

@skshetry skshetry added the awaiting response we are waiting for your reply, please respond! :) label Nov 15, 2021
@wdixon
Copy link
Author

wdixon commented Nov 15, 2021

I updated to 2.8.3 and basically added a newline to the end of each json file in a directory....

for i in */*/*.json
do
dvc unprotect $i
echo '' >> $i
done

dvc add --yappi DIRECTORY

(updated 32 json files).

the output is attached here.... This took only a few minutes.

callgrind.dvc-20211115_070423.zip

Unclear if this had to do with the dvc update from 2.8.2 to 2.8.3?

This time, it seems to have only recalculated hashes for the updated json files - (as I would expect)....

The previous time (and the reason for the submission), the dvc add took ~1h, and it seemed to be calculating more than just the updated json files. The progress indicator suggested about the same number as the complete directory content, even though the video files were already added, and the links from the video files were already pointed at a shared cache.... The files had previously been added from the same project/folder structure as before?

Any explanation for the much faster add? was it 2.8.3? or should something else be investigated?

@skshetry
Copy link
Member

I am not really sure what changed between those versions that can explain performance improvement for dvc add. cc @efiop.

@Mastergalen
Copy link

Mastergalen commented Nov 15, 2021

For what it's worth I'm also experiencing very long processing times while running dvc add on a directory.

  • The directory is mounted over nfs
  • Total number of files in directory: 3920
  • Total disk usage: 235.4 MiB
  • It has been over an hour now, and it only has processed 372 files...

I'm a first time user here, so I can't say if this would be faster with an older version.

$ dvc doctor
DVC version: 2.8.3 (conda)
---------------------------------
Platform: Python 3.8.8 on Linux-5.4.0-84-generic-x86_64-with-glibc2.10
Supports:
        webhdfs (fsspec = 2021.11.0),
        http (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.6)
Cache types: hardlink, symlink
Cache directory: nfs4 on nas.local:/mnt/nas
Caches: local
Remotes: None
Workspace directory: nfs4 on nas.local:/mnt/nas
Repo: dvc, git
$ dvc add -v target_dir

...
2021-11-15 16:55:07,247 DEBUG: state save (35452974, 1576199516000000000, 109159) 83f0f7d129afd8426b64afe03b676e832594385c519ef4d2be9b                                                                        2c3cc659f8dbbb5cd1bb
2021-11-15 16:55:59,000 DEBUG: state save (35455177, 1576199535000000000, 31173) e6d3345363cbf81bd82505419a15e8cdf32df9b3c35e3e43f3bf                                                                         ebacfd4f91f69f0f28d5
2021-11-15 16:55:59,010 DEBUG: state save (35453121, 1576199517000000000, 29493) 055c1b3c556d63ddbc619a2f341470a0a35e436e368260aa514a6                                                                         f4e09012285917e3eafd
2021-11-15 16:55:59,034 DEBUG: state save (35453035, 1576199516000000000, 35135) e288a1ef2b79b735f968562a855078b19d42cc60f454a95ec73c                                                                         260f367cb8aa151e8d82
2021-11-15 16:57:00,709 DEBUG: state save (35453248, 1576199518000000000, 87356) e0cce6ff4fb5b08735b9ab4055853213cde3ae88fe077577304b                                                                         32594385c519ef4d2be9b
2021-11-15 16:57:00,845 DEBUG: state save (35453739, 1576199522000000000, 32705) 3f9118a5226af32df9b3c35e3e43f3bff3e879b03d2faf837463                                                                         35e436e368260aa514a6
...

@Mastergalen
Copy link

Mastergalen commented Nov 15, 2021

I copied the directory to the local file system and ran the same dvc add command. This time it only took 1 minute.

Seems like it's something to do with nfs?

EDIT: Looks like the team is already of this issue #5562

@wdixon
Copy link
Author

wdixon commented Nov 15, 2021

The other item we have seen.... using --to_remote -j xx, does also seem to speed up the progress quite a bit (though it bypasses file into cache - it does seem to compute the hashes in parallel (not obvious if there was a way to do that without going directly to the remote store - not sure why those options are tied?).

The infrastructure we use is on a san, so we are stuck using nfs.... Sometimes the directly level adding takes minutes, other times hours? seems like (though un verified), under different conditions the hashes seem to be computing more than just what is updated?

@efiop
Copy link
Contributor

efiop commented Nov 17, 2021

Hey guys. Great to know that 2.8.3 has sped up the operations for you. The reason is that we started saving 1 copy operation during dvc add for every file in the dataset, as well as saving some checks for the dataset as a whole. There are more optimizations coming in the near future, so please stay tuned 🙂

The other item we have seen.... using --to_remote -j xx, does also seem to speed up the progress quite a bit (though it bypasses file into cache - it does seem to compute the hashes in parallel (not obvious if there was a way to do that without going directly to the remote store - not sure why those options are tied?).

Hashes are always computed in parallel. Using --to-remote makes us not use local cache and not link things back into the workspace (the latter is done in 1 thread right now).

The infrastructure we use is on a san, so we are stuck using nfs.... Sometimes the directly level adding takes minutes, other times hours? seems like (though un verified), under different conditions the hashes seem to be computing more than just what is updated?

We try to compute hashes one and then save the result with an mtime, so we don't need to recompute it again. We will compute hashes for new files though. And if there is something that messes with mtimes, our old cached hashes might become invalid and so we'll have to compute them again.

@wdixon Looks like 2.8.3 has improved the performance for your scenario significantly, or is this issue still relelvant?

@wdixon
Copy link
Author

wdixon commented Nov 17, 2021

We can close, thank you for the work and the response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response we are waiting for your reply, please respond! :)
Projects
None yet
Development

No branches or pull requests

4 participants