-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add: incredibly slow #6977
Comments
I see that you are using pip install yappi
dvc add --yappi ... This will generate a |
I updated to 2.8.3 and basically added a newline to the end of each json file in a directory.... for i in */*/*.json
do
dvc unprotect $i
echo '' >> $i
done
dvc add --yappi DIRECTORY (updated 32 json files). the output is attached here.... This took only a few minutes. callgrind.dvc-20211115_070423.zip Unclear if this had to do with the dvc update from 2.8.2 to 2.8.3? This time, it seems to have only recalculated hashes for the updated json files - (as I would expect).... The previous time (and the reason for the submission), the dvc add took ~1h, and it seemed to be calculating more than just the updated json files. The progress indicator suggested about the same number as the complete directory content, even though the video files were already added, and the links from the video files were already pointed at a shared cache.... The files had previously been added from the same project/folder structure as before? Any explanation for the much faster add? was it 2.8.3? or should something else be investigated? |
I am not really sure what changed between those versions that can explain performance improvement for dvc add. cc @efiop. |
For what it's worth I'm also experiencing very long processing times while running
I'm a first time user here, so I can't say if this would be faster with an older version.
|
I copied the directory to the local file system and ran the same Seems like it's something to do with nfs? EDIT: Looks like the team is already of this issue #5562 |
The other item we have seen.... using --to_remote -j xx, does also seem to speed up the progress quite a bit (though it bypasses file into cache - it does seem to compute the hashes in parallel (not obvious if there was a way to do that without going directly to the remote store - not sure why those options are tied?). The infrastructure we use is on a san, so we are stuck using nfs.... Sometimes the directly level adding takes minutes, other times hours? seems like (though un verified), under different conditions the hashes seem to be computing more than just what is updated? |
Hey guys. Great to know that 2.8.3 has sped up the operations for you. The reason is that we started saving 1 copy operation during
Hashes are always computed in parallel. Using
We try to compute hashes one and then save the result with an mtime, so we don't need to recompute it again. We will compute hashes for new files though. And if there is something that messes with mtimes, our old cached hashes might become invalid and so we'll have to compute them again. @wdixon Looks like 2.8.3 has improved the performance for your scenario significantly, or is this issue still relelvant? |
We can close, thank you for the work and the response! |
Bug Report
add: incredibly slow
Description
I have a dataset that consists of videos (large files) that sit along some metadata in json files. If the metadata (json files) are updated, if the dataset directory is re-added, it seems to be rehashing everything again?
dataset
a_1.avi
a_2.avi
a.json
b_1.avi
b_2.avi
b.json
....
Updating a handful of json files, dvc add of dataset takes ~1 hour to re-compute the md5 hashes? Which doesn't make sense given all the large files are untouched, and already in a local dvc cache?
Reproduce
Expected
would expect it to take just a few minutes; however, its taking an hour....
Environment information
Output of
dvc doctor
:The text was updated successfully, but these errors were encountered: