Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc.api.DVCFileSystem not working with rev=... #8705

Closed
PythonFZ opened this issue Dec 16, 2022 · 9 comments · Fixed by #9037
Closed

dvc.api.DVCFileSystem not working with rev=... #8705

PythonFZ opened this issue Dec 16, 2022 · 9 comments · Fixed by #9037
Assignees
Labels
A: api Related to the dvc.api bug Did we break something? research

Comments

@PythonFZ
Copy link
Contributor

Bug Report

Description

I am unable to use the dvc.api.DVCFileSystem(".", rev="da6..") if I set ref=....

Reproduce

git init
dvc init

echo "something" > deps.txt
dvc stage add -n mystage -d deps.txt -o outs.txt "echo '$(date)' > outs.txt"
dvc repro
git add .
git commit -m "..."

echo " else" >> deps.txt
dvc repro
git add .
git commit -m "..."

then use the following script main.py:

import dvc.api
import sys

rev = sys.argv[1] if len(sys.argv) == 2 else None

fs = dvc.api.DVCFileSystem(".", rev=rev)
for file in fs.find("/", detail=False, dvc_only=True):
    print(file)
    print(fs.read_text(file))

python main.py works fine for me but python main.py <any revision from git log> yields:

/outs.txt
Traceback (most recent call last):
  File "/ssd/fzills/Examples/dvc_fs_bug/main.py", line 11, in <module>
    print(fs.read_text(file))
  File "/tikhome/fzills/miniconda3/envs/znlib/lib/python3.10/site-packages/fsspec/spec.py", line 674, in read_text
    with self.open(
  File "/tikhome/fzills/miniconda3/envs/znlib/lib/python3.10/site-packages/fsspec/spec.py", line 1094, in open
    self.open(
  File "/tikhome/fzills/miniconda3/envs/znlib/lib/python3.10/site-packages/fsspec/spec.py", line 1106, in open
    f = self._open(
  File "/tikhome/fzills/miniconda3/envs/znlib/lib/python3.10/site-packages/dvc/fs/dvc.py", line 264, in _open
    return dvc_fs.open(dvc_path, mode=mode)
  File "/tikhome/fzills/miniconda3/envs/znlib/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 197, in open
    return self.fs.open(path, mode=mode, **kwargs)
  File "/tikhome/fzills/miniconda3/envs/znlib/lib/python3.10/site-packages/dvc_data/fs.py", line 70, in open
    fs, fspath = self._get_fs_path(path, **kwargs)
  File "/tikhome/fzills/miniconda3/envs/znlib/lib/python3.10/site-packages/dvc_data/fs.py", line 62, in _get_fs_path
    raise FileNotFoundError
FileNotFoundError

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.38.1 (pip)
---------------------------------
Platform: Python 3.10.6 on Linux-5.4.0-135-generic-x86_64-with-glibc2.31
Subprojects:
        dvc_data = 0.28.4
        dvc_objects = 0.14.0
        dvc_render = 0.0.15
        dvc_task = 0.1.8
        dvclive = 1.2.2
        scmrepo = 0.1.4
Supports:
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2022.11.0, boto3 = 1.24.59)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sda6
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/sda6
Repo: dvc, git
@rlamy rlamy added bug Did we break something? A: api Related to the dvc.api labels Dec 16, 2022
@karanjakhar
Copy link
Contributor

Hi, @rlamy I would like to work on this issue. I have set up everything and reproduced the error. So far I think it is a file full path-related problem.

@rlamy rlamy moved this to Backlog in DVC Dec 20, 2022
@rlamy rlamy added this to DVC Dec 20, 2022
@rlamy
Copy link
Contributor

rlamy commented Dec 20, 2022

@karanjakhar I'm investigating this as well, but feel free to send a PR, as I probably won't have time to fix this before the holidays.

So the issue happens in dvc_data.fs.DataFileSystem._get_fs_path(). In that function, entry.odb.path has the wrong value /.dvc/cache, which is interpreted as an absolute path (note the leading slash) and causes the cache lookup to fail.

That wrong value comes from Repo.__init__ and specifically the initialisation of the ODBManager. When the rev parameter is used, the repo uses GitFileSystem and there is a confusion between repo.config["cache"]["dir"] being /.dvc/cache (which seems correct for gitfs) and ODBManager using that value verbatim as the path to the cache, even though it uses localfs. I'm not sure where exactly is the error in that process, but, in any case, the following code is enough to demonstrate the issue:

from dvc.repo import Repo
repo = Repo(rev="HEAD")
odb = repo.odb._odb["repo"]
print(odb.fs, odb.path)

Result:

<dvc_objects.fs.local.LocalFileSystem object at 0x102bdf0a0> /.dvc/cache

@karanjakhar
Copy link
Contributor

@rlamy why /.dvc/cache is correct for gitfs?

@karanjakhar
Copy link
Contributor

@rlamy so far I can't find any bug or problem with the process, I think we need to pass the full path in the initialization of ODBManger in case the rev parameter is used.

@karanjakhar
Copy link
Contributor

@rlamy Should FileSystem be gitfs or path should be absolute?

@efiop
Copy link
Contributor

efiop commented Feb 9, 2023

@karanjakhar sorry for the delay, got lost in notifications.

@rlamy why /.dvc/cache is correct for gitfs?

He meant that it is "formally" correct, because the root of the repo in gitfs is /. But obviously that is not correct at the same time because we can't have cache right in the gitfs.

@rlamy so far I can't find any bug or problem with the process, I think we need to pass the full path in the initialization of ODBManger in case the rev parameter is used.

That sounds correct.

@rlamy Should FileSystem be gitfs or path should be absolute?

gitfs. From DvcFileSystem perspective it's root is / and it is completely detached from the local filesystem that it resides on.

@karanjakhar
Copy link
Contributor

@efiop So when "rev" is passed. odb.fs should be gitfs right?

@efiop
Copy link
Contributor

efiop commented Feb 10, 2023

@karanjakhar Ah, looks like I misunderstood you. odb.fs should be localfs, not gitfs and odb.path should be an absolute local path, not gitfs path.

@karanjakhar
Copy link
Contributor

As odb is renamed to cache.
#8987 (comment)

To reproduce this issue now:

from dvc.repo import Repo
repo = Repo(rev="HEAD")
odb = repo.cache._odb["repo"]
print(odb.fs, odb.path)

karanjakhar added a commit to karanjakhar/dvc that referenced this issue Feb 12, 2023
karanjakhar added a commit to karanjakhar/dvc that referenced this issue Feb 13, 2023
karanjakhar added a commit to karanjakhar/dvc that referenced this issue Feb 13, 2023
karanjakhar added a commit to karanjakhar/dvc that referenced this issue Feb 13, 2023
efiop added a commit to efiop/dvc that referenced this issue Feb 16, 2023
Setting default values is very untypical for all other applications and the
info config has is not enough to make a decision on the default cache dir,
especially when dealing with gitfs.

Fixes iterative#8705
efiop added a commit to efiop/dvc that referenced this issue Feb 16, 2023
Setting default values is very untypical for all other applications and the
info config has is not enough to make a decision on the default cache dir,
especially when dealing with gitfs.

Fixes iterative#8705
efiop added a commit to efiop/dvc that referenced this issue Feb 16, 2023
Setting default values is very untypical for all other applications and the
info config has is not enough to make a decision on the default cache dir,
especially when dealing with gitfs.

Fixes iterative#8705
@efiop efiop moved this from Backlog to Review In Progress in DVC Feb 16, 2023
@efiop efiop self-assigned this Feb 16, 2023
efiop added a commit to efiop/dvc that referenced this issue Feb 16, 2023
Setting default values is very untypical for all other applications and the
info config has is not enough to make a decision on the default cache dir,
especially when dealing with gitfs.

Fixes iterative#8705
efiop added a commit to efiop/dvc that referenced this issue Feb 16, 2023
Setting default values is very untypical for all other applications and the
info config has is not enough to make a decision on the default cache dir,
especially when dealing with gitfs.

Fixes iterative#8705
efiop added a commit to efiop/dvc that referenced this issue Feb 16, 2023
Setting default values is very untypical for all other applications and the
info config has is not enough to make a decision on the default cache dir,
especially when dealing with gitfs.

Fixes iterative#8705
efiop added a commit to efiop/dvc that referenced this issue Feb 16, 2023
Setting default values is very untypical for all other applications and the
info config has is not enough to make a decision on the default cache dir,
especially when dealing with gitfs.

Fixes iterative#8705
@github-project-automation github-project-automation bot moved this from Review In Progress to Done in DVC Feb 17, 2023
efiop added a commit that referenced this issue Feb 17, 2023
Setting default values is very untypical for all other applications and the
info config has is not enough to make a decision on the default cache dir,
especially when dealing with gitfs.

Fixes #8705
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: api Related to the dvc.api bug Did we break something? research
Projects
No open projects
Archived in project
5 participants