Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc data status reports imported directories as "not in remote" #9346

Closed
johnyaku opened this issue Apr 19, 2023 · 11 comments
Closed

dvc data status reports imported directories as "not in remote" #9346

johnyaku opened this issue Apr 19, 2023 · 11 comments
Assignees
Labels
A: status Related to the dvc diff/list/status awaiting response we are waiting for your reply, please respond! :) bug Did we break something? p2-medium Medium priority, should be done, but less important

Comments

@johnyaku
Copy link

Bug Report

Description

dvc data status reports imported directories as "not in remote".

Technically, this is correct, as the data is in the remote for the source repo, not the current repo. But this is a bit confusing.

Reproduce

dvc import <source-repo> <directory>
dvc data status

Expected

Either no message about not being in the remote.
Or, perhaps more helpfully, dvc data status could look up the remote for the source repo and check if the data is there, and only report a problem if it is not found.

Environment information

dvc 2.53

@daavoo daavoo added A: status Related to the dvc diff/list/status bug Did we break something? labels Apr 26, 2023
@daavoo
Copy link
Contributor

daavoo commented Apr 26, 2023

@efiop looks like imports are not being considered at all in the DataIndex and thus in dvc data status.

Is there a plan to include info about an entry being part of an import stage in the data index or should this "filter" happen at the UI level (i.e. post-process repo.data_status() output to account for import stages)?

@daavoo daavoo added the p1-important Important, aka current backlog of things to do label Apr 26, 2023
@daavoo
Copy link
Contributor

daavoo commented Apr 26, 2023

@dberenbaum I am putting p1 here as it creates quite some noise in repos using dvc import which is an important scenario + we are pushing for dvc data status, but feel free to re-assign a different priority.

@efiop
Copy link
Contributor

efiop commented Apr 27, 2023

imports are recorded in the index, but as a source for the outputs, which they really are.

The issue here is really that in data status we check against a particular remote, which, as expected, doesn't have imports. We should check against all corresponding remotes instead (e.g. if there are per-output remotes), which for imports might also mean that we should skip them.

@dberenbaum dberenbaum added this to DVC May 18, 2023
@github-project-automation github-project-automation bot moved this to Backlog in DVC May 18, 2023
@daavoo daavoo added p2-medium Medium priority, should be done, but less important and removed p1-important Important, aka current backlog of things to do labels May 23, 2023
@johnyaku
Copy link
Author

johnyaku commented Jun 5, 2023

We've updated to v2.58.2 and we are no longer seeing "not in remote".

Instead, the status of imports is "deleted", which again feels misleading.

@dlroden

@dberenbaum
Copy link
Collaborator

@johnyaku when does it show as "deleted"? When they are missing locally, or even when they exist locally?

@johnyaku
Copy link
Author

johnyaku commented Jun 6, 2023

The only common denominator in the spurious "deleted" reports is that all of these files have been imported from a dvc data registry.

The files report as "deleted" all exist in the local workspace, as links to a shared external cache.
The checksums in the link paths match the checksums in the .dvc files.
The files all exist on the remote for the source registry, but not in the remote of the destination dataset (as we would expect).

So this looks to me like another twist on the "not in remote" message, which has been fixed by no longer using this as the default message, but I suspect that the basic problem is the same. Namely, dvc data status does not seem to distinguish "imports" from "indigenous" data.

@efiop efiop self-assigned this Jun 12, 2023
@efiop
Copy link
Contributor

efiop commented Jun 23, 2023

@johnyaku Are you still able to reproduce the issues? These days dvc data status checks against relevant cache/remote. I can't reproduce your problem so far though. If you could come up with a reproducible script - that would help.

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Jun 23, 2023
@efiop
Copy link
Contributor

efiop commented Jun 23, 2023

@johnyaku Or, if you are still able to reproduce, I'm happy to maybe jump on a quick call to figure it out at the spot.

@efiop efiop closed this as not planned Won't fix, can't repro, duplicate, stale Jul 17, 2023
@github-project-automation github-project-automation bot moved this from Backlog to Done in DVC Jul 17, 2023
@johnyaku
Copy link
Author

Apologies for the slow response on this one. Thanks to your help the other day we have our index mirroring sorted out now and I can confirm that I can reproduce what my colleague was seeing.
@dlroden

@efiop efiop reopened this Jul 31, 2023
@github-project-automation github-project-automation bot moved this from Done to Todo in DVC Jul 31, 2023
@johnyaku
Copy link
Author

I tried to create a reprex using DVC v3.15.2 to check if this had been fixed since v2.58.2.

I made a toy registry here: https://github.com/johnyaku/test_reg

This contains one file (test.txt) which I have pushed to a local "remote" at ../test_reg_remote.

I then create a toy dataset here: [email protected]:johnyaku/imp_test.git

Nothing to see there yet, because dvc import failed:

dvc import [email protected]:johnyaku/test_reg.git test.txt
Importing 'test.txt ([email protected]:johnyaku/test_reg.git)' -> 'test.txt'
ERROR: unexpected error - [Errno 2] No storage files available: 'test.txt'

This seems reminiscent of this issue: iterative/dvc-gdrive#29

I can paste a full stack trace if you like, but the main take-aways are as follows:

  File "/home/johree/miniconda3/envs/dmdb/lib/python3.11/site-packages/dvc_data/fs.py", line 73, in _get_fs_path
    raise FileNotFoundError(
FileNotFoundError: [Errno 2] No storage files available: 'test.txt'

DVC version: 3.15.2 (conda)
---------------------------
Platform: Python 3.11.4 on Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Subprojects:
        dvc_data = 2.13.1
        dvc_objects = 0.25.0
        dvc_render = 0.5.3
        dvc_task = 0.3.0
        scmrepo = 1.2.1
Supports:
        http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
        ssh (sshfs = 2023.7.0)
Config:
        Global: /home/johree/.config/dvc
        System: /etc/xdg/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: ssh, local
Workspace directory: ext4 on /dev/sdb
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/b0f5a2f92faec80920e3bf13e3e8daa

I know we have veered off into a new issue here, but I'll need to work thru this in order to create a reprex.

@efiop
Copy link
Contributor

efiop commented Oct 4, 2023

@johnyaku Can't reproduce original issue anymore with newest dvc.

Regarding import, there was probably something else wrong there, as I also can't reproduce it.

Feel free to create a new issue if you run into anything still not working.

@efiop efiop closed this as completed Oct 4, 2023
@github-project-automation github-project-automation bot moved this from Todo to Done in DVC Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: status Related to the dvc diff/list/status awaiting response we are waiting for your reply, please respond! :) bug Did we break something? p2-medium Medium priority, should be done, but less important
Projects
No open projects
Archived in project
Development

No branches or pull requests

4 participants