Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not check for directory when looking for cache files #2877

Closed
skshetry opened this issue Dec 1, 2019 · 6 comments
Closed

Do not check for directory when looking for cache files #2877

skshetry opened this issue Dec 1, 2019 · 6 comments
Labels
enhancement Enhances DVC p2-medium Medium priority, should be done, but less important performance improvement over resource / time consuming tasks refactoring Factoring and re-factoring

Comments

@skshetry
Copy link
Member

skshetry commented Dec 1, 2019

Recent refactoring in #2873 and #2853 implemented isdir(), isfile() and exists(). exists() checks if the given path exists using isdir as well as isfile. isfile simply checks for existence of the blob (in case of s3 and gs), whereas isdir checks if any files with the prefix exists.

.cache_exists() uses .exists(), but does not care about directory paths. So, to reduce number of ._list_paths() remote calls, it's better to refactor .cache_exists() to use isfile.

As @Suor notes here #2873 (comment), isfile might not have been implemented in all remotes, or might be slower in some remotes.

Further infomation: #2873 (comment)

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Dec 1, 2019
@efiop efiop added enhancement Enhances DVC refactoring Factoring and re-factoring labels Dec 1, 2019
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Dec 1, 2019
@efiop efiop added performance improvement over resource / time consuming tasks triage Needs to be triaged labels Dec 1, 2019
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Dec 1, 2019
@efiop efiop added the p1-important Important, aka current backlog of things to do label Dec 1, 2019
@Suor
Copy link
Contributor

Suor commented Dec 1, 2019

BTW, this is a bit more complicated.

.exists() being more complex and slower is true for s3 and gs, but might be not true for other remotes likes local and ssh. So we should bench things and might even require using different calls for different remotes to get top performance everywhere.

@ghost
Copy link

ghost commented Dec 2, 2019

@skshetry, I was thinking about leveraging S3's CommonPrefixes to check if a file exist. That should handle it in one call, instead of the isfile or isdir check.

@dberenbaum
Copy link
Collaborator

@skshetry It looks like cache_exists() is no longer around, although I'm not sure if the underlying issue is still relevant. Can this be closed?

@dberenbaum
Copy link
Collaborator

Ping @skshetry

@dberenbaum
Copy link
Collaborator

@skshetry I'm lowering priority on this one for now.

@dberenbaum dberenbaum added p2-medium Medium priority, should be done, but less important and removed p1-important Important, aka current backlog of things to do labels Apr 5, 2022
@skshetry
Copy link
Member Author

Closing as stale.

@skshetry skshetry closed this as not planned Won't fix, can't repro, duplicate, stale May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC p2-medium Medium priority, should be done, but less important performance improvement over resource / time consuming tasks refactoring Factoring and re-factoring
Projects
None yet
Development

No branches or pull requests

4 participants