Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull: -R does not check immediate target #7756

Closed
mattseddon opened this issue May 17, 2022 · 34 comments
Closed

pull: -R does not check immediate target #7756

mattseddon opened this issue May 17, 2022 · 34 comments
Labels
A: data-sync Related to dvc get/fetch/import/pull/push feature request Requesting a new feature product: VSCode Integration with VSCode extension

Comments

@mattseddon
Copy link
Member

mattseddon commented May 17, 2022

Bug Report

Description

Firstly, from the docs I realise that pull -R <target> is probably working exactly as advertised.

In the VS Code extension, we show a tracked tree which can be used to selectively pull files from the remote.

We currently use the output of dvc list . -R --show-json --dvc-only to generate this tree (we will shortly be using the output from the new data:status command). We mark everything provided by the list output as tracked.

When calling pull against these tracked paths we check to see if the path exists in the list output. If it does then we call dvc pull <target>. If it does not we call dvc pull -R <target>.

When calling dvc pull -R we get mixed results. Here is an example of -R stating that everything is up to date when things clearly haven't changed:

Screen.Recording.2022-05-17.at.3.35.58.pm.mov

dvc.yaml for the above project is here. training_metrics is tracked but there is no way currently for us to easily/consistently tell this from the combined output of list, status & diff.

Reproduce

  1. Open demo project for the first time.
  2. Run dvc pull -R training_metrics from the root.
  3. “everything is up to date” will be returned by the command
  4. No data will have been updated.

Expected

dvc pull -R target checks the target as well as all searching inside the target.

We could take the alternative approach of including the appropriate information in the new data:status command. I.e training_metrics/ would be provided as part of the output to let us know that it is tracked.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.10.2 (pip)
---------------------------------
Platform: Python 3.9.9 on macOS-12.3.1-x86_64-i386-64bit
Supports:
        webhdfs (fsspec = 2022.3.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        s3 (s3fs = 2022.3.0, boto3 = 1.21.21)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s5s1
Caches: local
Remotes: s3
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc (subdir), git

Additional Information (if any):

Please let me know if you need anything else from me. Thanks

@mattseddon mattseddon added product: VSCode Integration with VSCode extension A: data-sync Related to dvc get/fetch/import/pull/push labels May 17, 2022
@efiop efiop added the feature request Requesting a new feature label May 17, 2022
@skshetry
Copy link
Member

skshetry commented May 17, 2022

Related: #5326

EDIT: not related, it's a bit different.

@dberenbaum
Copy link
Collaborator

Reproduce

1. Open demo project for the first time.

2. Run `dvc pull -R training_metrics` from the root.

3. “everything is up to date” will be returned by the command

4. No data will have been updated.

I'm either unable to reproduce or misunderstanding:

$ git clone [email protected]:iterative/vscode-dvc.git
Cloning into 'vscode-dvc'...
remote: Enumerating objects: 28550, done.
remote: Counting objects: 100% (196/196), done.
remote: Compressing objects: 100% (132/132), done.
remote: Total 28550 (delta 87), reused 106 (delta 56), pack-reused 28354
Receiving objects: 100% (28550/28550), 10.73 MiB | 32.90 MiB/s, done.
Resolving deltas: 100% (20665/20665), done.
dave@davids-air:/tmp 11:23:05
$ cd vscode-dvc/demo
$ dvc pull -R training_metrics
A       training_metrics/
1 file added and 2 files fetched
$ dvc doctor
DVC version: 2.10.3.dev54+g570e3a535
---------------------------------
Platform: Python 3.9.2 on macOS-12.3.1-arm64-arm-64bit
Supports:
        azure (adlfs = 2022.2.0, knack = 0.9.0, azure-identity = 1.8.0),
        gdrive (pydrive2 = 1.10.1),
        gs (gcsfs = 2022.2.0),
        hdfs (fsspec = 2022.2.0, pyarrow = 7.0.0),
        webhdfs (fsspec = 2022.2.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        s3 (s3fs = 2022.2.0, boto3 = 1.20.24),
        ssh (sshfs = 2021.11.2),
        oss (ossfs = 2021.8.0),
        webdav (webdav4 = 0.9.5),
        webdavs (webdav4 = 0.9.5)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc (subdir), git

@mattseddon
Copy link
Member Author

mattseddon commented May 18, 2022

@dberenbaum in order to test this I upgraded from 2.10.2 to the latest version of dvc from main. It appears that the files in our subrepo have been renamed:

image

I cannot get the repo out of this renamed state. I have tried:

  1. dvc pull
  2. dvc commit
  3. dvc commit -f
  4. dvc exp run

Still stuck here:

image

Reverting to 2.10.2 brings things back to normal. I can also recreate on a fresh clone.

> dvc doctor            
DVC version: 2.10.3.dev66+g71a4b482 
---------------------------------
Platform: Python 3.9.9 on macOS-12.3.1-x86_64-i386-64bit
Supports:
        webhdfs (fsspec = 2022.3.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        s3 (s3fs = 2022.3.0, boto3 = 1.21.21)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s5s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc (subdir), git

@shcheklein
Copy link
Member

Reading the description one more time it might be a bigger issue for DVC extension. -R semantics is clearly can lead that if there is a dvc.yaml inside that directory that describes a dataset outside we'll also pull it. And, yes this semantics mean that we can't do dvc pull -R target where target is a DVC-tracked directory.

  1. is there a workaround on the extension side?
  2. DVC product question - how consistent -R is across commands?
  3. how useful this semantics vs looking for outputs inside the specified directory? (can we find some tickets where we can find the motivation? when we first introduced this?).

My guess would be that it is as old as dvc pull accepting as a target only .dvc files. (I would double check this). If were adding -R before expanding targets to .dvc + actual paths - it makes sense to me, and we can reconsider -R (or introduce --recusrsive?)

@mattseddon
Copy link
Member Author

mattseddon commented May 18, 2022

@efiop looks like the change that introduce the renamed issue was this one: a80a85e.

Was there anything in there that would make a subrepo exhibit this diff no matter what (~/vscode-dvc is the git repo, ~/vscode-dvc/demo is the dvc project):

~/vscode-dvc/demo ❯  dvc diff
Renamed:                                                              
    demo/data/MNIST/raw/ -> data/MNIST/raw/
    demo/data/MNIST/raw/t10k-images-idx3-ubyte -> data/MNIST/raw/t10k-images-idx3-ubyte
    demo/data/MNIST/raw/t10k-images-idx3-ubyte.gz -> data/MNIST/raw/t10k-images-idx3-ubyte.gz
    demo/data/MNIST/raw/t10k-labels-idx1-ubyte -> data/MNIST/raw/t10k-labels-idx1-ubyte
    demo/data/MNIST/raw/t10k-labels-idx1-ubyte.gz -> data/MNIST/raw/t10k-labels-idx1-ubyte.gz
    demo/data/MNIST/raw/train-images-idx3-ubyte -> data/MNIST/raw/train-images-idx3-ubyte
    demo/data/MNIST/raw/train-images-idx3-ubyte.gz -> data/MNIST/raw/train-images-idx3-ubyte.gz
    demo/data/MNIST/raw/train-labels-idx1-ubyte -> data/MNIST/raw/train-labels-idx1-ubyte
    demo/data/MNIST/raw/train-labels-idx1-ubyte.gz -> data/MNIST/raw/train-labels-idx1-ubyte.gz
    demo/misclassified.jpg -> misclassified.jpg
    demo/model.pt -> model.pt
    demo/predictions.json -> predictions.json
    demo/training_metrics.json -> training_metrics.json
    demo/training_metrics/ -> training_metrics/
    demo/training_metrics/scalars/acc.tsv -> training_metrics/scalars/acc.tsv
    demo/training_metrics/scalars/loss.tsv -> training_metrics/scalars/loss.tsv

I can see that scmrepo got bumped from 0.0.19 to 0.0.22 in that commit.

LMK if you want a separate issue for this.

Thanks.

@dberenbaum
Copy link
Collaborator

@mattseddon Thanks, I can confirm the renaming issue.

I'm still not following the original issue since it seems that pull -R is working as expected for me. Can you clarify?

@mattseddon
Copy link
Member Author

I can confirm that #7756 (comment) is correct (did not work for 2.10.2 but since fixed).

In testing I think have found a new issue where if a .dvc files is present within a directory then -R will pull that file but not any other directory deps/outs available within that directory.

Using example-get-started:

~/example-get-started master ❯ tree data/ 
data/
├── data.xml
├── data.xml.dvc
├── features
│   ├── test.pkl
│   └── train.pkl
└── prepared
    ├── test.tsv
    └── train.tsv

2 directories, 6 files
~/example-get-started master ❯ rm -rf data/features  
~/example-get-started master ❯ rm -rf data/prepared
~/example-get-started master ❯ rm data/data.xml
~/example-get-started master ❯ tree data/ 
data/
└── data.xml.dvc

0 directories, 1 file
~/example-get-started master ❯ dvc pull -R data
A       data/data.xml                                                                                                                                
1 file added
~/example-get-started master ❯ tree data/  
data/
├── data.xml
└── data.xml.dvc

0 directories, 2 files
~/example-get-started master ❯ dvc doctor
DVC version: 2.10.3.dev71+gf23d31af 
---------------------------------
Platform: Python 3.9.9 on macOS-12.4-x86_64-i386-64bit
Supports:
        webhdfs (fsspec = 2022.1.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        s3 (s3fs = 2022.1.0, boto3 = 1.20.24)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s5s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc, git

Does this make sense? Is this the intended behaviour of -R?

@shcheklein
Copy link
Member

@mattseddon yes, that's expected, it's not a bug. This is the current behavior of the -R . I've tried to explain it here. Now we need to find a workaround with the DVC team.

@shcheklein
Copy link
Member

shcheklein commented May 19, 2022

@dberenbaum

I'm still not following the original issue since it seems that pull -R is working as expected for me. Can you clarify?

It's not a bug, but behavior might be confusing to be honest. But even that doesn't matter - we need to find a way to all tracked files inside a directory. -R gives something completely different (good question how useful it is, but it's a separate topic). Question - do we have some workaround for the behavior we are looking for?

@dberenbaum
Copy link
Collaborator

For VS Code, can you use dvc list --dvc-only to get the appropriate targets and pass those to dvc pull?

@shcheklein
Copy link
Member

@dberenbaum we would have to run make it recursive as well (to get all DVC-tracked files and directories inside a specific directory)- that can make it slow. But on the other hand we already do dvc list --dvc-only or something, right? @mattseddon how do we create the initial DVC-tracked tree?

@mattseddon
Copy link
Member Author

@mattseddon how do we create the initial DVC-tracked tree?

We can create the necessary data. Would it be correct to pull -

  1. All files in directories that do not have a corresponding .dvc file
  2. All directories that have a .dvc file.

Would these commands be correct for these structures?

1.

Structure:

data/
├── data.xml
├── data.xml.dvc
├── features
│   ├── test.pkl
│   └── train.pkl
└── prepared
    ├── test.tsv
    └── train.tsv

On pull we run:
dvc pull data/data.xml data/features/test.pkl data/features/train.pkl data/prepared/test.tsv data/prepared/train.tsv

2.

Structure:

├── data
│   └── MNIST
│       ├── raw
│       │   ├── t10k-images-idx3-ubyte
│       │   ├── t10k-images-idx3-ubyte.gz
│       │   ├── t10k-labels-idx1-ubyte
│       │   ├── t10k-labels-idx1-ubyte.gz
│       │   ├── train-images-idx3-ubyte
│       │   ├── train-images-idx3-ubyte.gz
│       │   ├── train-labels-idx1-ubyte
│       │   └── train-labels-idx1-ubyte.gz
│       └── raw.dvc
├── misclassified.jpg
├── model.pt
├── predictions.json
├── training_metrics
│   └── scalars
│       ├── acc.tsv
│       └── loss.tsv

On pull we run: dvc pull data/MNIST/raw misclassified.jpg model.pt predictions.json training_metrics/scalars/acc.tsv training_metrics/scalars/loss.tsv

@shcheklein
Copy link
Member

data/features/test.pkl

I should be just data/features (it should be defined in dvc.yaml like a directory)

We can create the necessary data.

can you describe how and how do we get it now?

One thing to be mindful and try to avoid are very large directories with a lot of files. No matter what you do if we start (on some VS Code code) traversing them that can become painful.

@mattseddon
Copy link
Member Author

I should be just data/features (it should be defined in dvc.yaml like a directory)

How do I get this information without parsing all of the yaml files and generating the DAG?

can you describe how and how do we get it now?

We use dvc list . -R --dvc-only to get all of the absolute paths. We then build a tree. We split each path by the OS separator and then add each path into a new Map<string, PathItem[]>() structure (with some optimisations).

For the data held in the a PathItem:

export type PathItem = {
  dvcRoot: string
  resourceUri: Uri
  isDirectory: boolean
  isTracked: boolean
}

Note: All of this will be getting rewritten once we get the new data from data:status.

@shcheklein
Copy link
Member

shcheklein commented May 20, 2022

How do I get this information without parsing all of the yaml files and generating the DAG?

dvc dag, or something else ... ideally this should be part of the dvc status, list, etc. Btw, this also should be available now in dvc exp since we include "outputs". what is the progress there @daavoo ?

We use dvc list . -R --dvc-only to get all of the absolute paths

( It's slow :( on the example-get-started it takes 10s - probably bc it goes inside .venv? @dberenbaum )

(that's why btw -R behavior in this case is not very friendly, or at least no consistent cc @dberenbaum )

It returns the whole tree - including all the files inside tracked directories, we don't need this. At least for this specific case. We might need this for SCM ...

@daavoo
Copy link
Contributor

daavoo commented May 20, 2022

what is the progress there @daavoo ?

Waiting for approval from someone #7690

@dberenbaum
Copy link
Collaborator

( It's slow :( on the example-get-started it takes 10s - probably bc it goes inside .venv? @dberenbaum )

Hm, not sure. For me it's <0.6 seconds. Still not sure if that's fast enough for bigger repos, but 10s seems odd.

(that's why btw -R behavior in this case is not very friendly, or at least no consistent cc @dberenbaum )

It returns the whole tree - including all the files inside tracked directories, we don't need this. At least for this specific case. We might need this for SCM ...

Got it. Seems like we need some flag for whether to traverse into DVC-tracked paths?

Can you explain more how dvc pull -R is being used by VS Code? In what circumstances is it called?

@shcheklein
Copy link
Member

shcheklein commented May 20, 2022

Hm, not sure. For me it's <0.6 seconds. Still not sure if that's fast enough for bigger repos, but 10s seems odd.

Hmm .. do you have a virtualenv setup with all the deps installed? I have it in the .venv in the root of the project.
#7756 (comment)

Can you explain more how dvc pull -R is being used by VS Code? In what circumstances is it called?

yep, I right click on data ask to pull and exepct it to bring me data/MNIST/raw inside. W/o me going two level down (I just don't even know which one is tracked). Or in the example-get-started I'd like to do dvc pull -R data to download data.xml and some intemediate things inside ... again, it'a all about simple and intuitive interface to manipulate data.

Current behavior is not useful at all to my mind and comes from some legacy (when dvc pull was taking only .dvc files as targets).

@shcheklein
Copy link
Member

Got it. Seems like we need some flag for whether to traverse into DVC-tracked paths?

Like --shallow - yes, may be. I would say it'a perf optimization, first I would solve the main use case somehow.

@skshetry
Copy link
Member

@shcheklein, --dvc-only was optimized in #7659. It is not released yet.

@shcheklein
Copy link
Member

@skshetry yep, usually I'm testing on the dev version, in this case dvc was coming from a different project. It is improved to 0.21! 🎉

@dberenbaum
Copy link
Collaborator

yep, I right click on data ask to pull and exepct it to bring me data/MNIST/raw inside. W/o me going two level down (I just don't even know which one is tracked). Or in the example-get-started I'd like to do dvc pull -R data to download data.xml and some intemediate things inside ... again, it'a all about simple and intuitive interface to manipulate data.

Current behavior is not useful at all to my mind and comes from some legacy (when dvc pull was taking only .dvc files as targets).

Yup, makes sense. We need to move all our commands towards operating on all DVC-tracked data within a path without the users worrying about where the paths are specified in .dvc, dvc.yaml, etc. I think the current dvc pull -R logic is how most DVC commands work, so I would like to have a more systematic effort to change it across commands rather than have inconsistent behavior.

I think it's somewhat related to the goal to "auto manage directories," which is currently planned for Q3, and dvc data status is planned with this in mind.

@shcheklein @mattseddon What is the priority for VS Code (when do you need it)?

@efiop Any thoughts?

@mattseddon
Copy link
Member Author

@dberenbaum we need to come up with a workaround for the release and a long term solution. This suggestion was for a short term workaround.

@shcheklein
Copy link
Member

@mattseddon if I understand correctly, that solution won't work - we can't pass the full list into dvc pull. It will start breaking even on mid-size repos. At least if I understand the solution correctly.

Probably dvc exp show is our best shot here.

@mattseddon
Copy link
Member Author

It would be full list for everything that doesn't have a .dvc files. In example 2 we only send data/MNIST/raw to pull not each file inside the directory. Still might not be enough but the exp show data currently won't work either because of #7790.

@shcheklein
Copy link
Member

Tbh, I would wait for a proper fix in this case.

@dberenbaum
Copy link
Collaborator

AFAIK, the commands that would need to be updated to have consistent behavior are:

  • pull
  • push
  • fetch
  • checkout
  • commit
  • status

It would be a breaking change for those commands, so it probably can't be changed in the short term even though the suggested behavior does seem like it's more expected.

We could introduce a new, possibly hidden, command like dvc data pull with the suggested behavior. Then, we could decide later whether to replace the existing command, keep both, or find some other solution.

Thoughts?

@mattseddon
Copy link
Member Author

Sounds good to me.

The current VS Code solution/workaround/hack is to get the required information from exp show's deps and outs. On pull of a directory we will walk down the tree until we find some targets to pull.

@shcheklein
Copy link
Member

@dberenbaum do we plan to deprecate global pull, push, etc later? or just use these hidden commands as a playground and then during 3.0 replace those? does it makes sense for us to start releasing major versions a bit more often? (not trying to squeeze everything)?

@mattseddon
Copy link
Member Author

Update on the VS Code patch:

I am using the rel paths keys in outs from exp show where use_cache: true to identify paths that can be pushed/pulled directly. Would it be possible to add this information into the data:status json output ?

@dberenbaum
Copy link
Collaborator

@dberenbaum do we plan to deprecate global pull, push, etc later? or just use these hidden commands as a playground and then during 3.0 replace those? does it makes sense for us to start releasing major versions a bit more often? (not trying to squeeze everything)?

I would like to see the commands first and evaluate whether they could easily replace the existing commands. To start, we could copy the existing functionality to a new command and add in support for -R to cover paths specified in pipelines. Then, hopefully we can test whether we could replace the existing commands and update the major version.

Releasing major versions more often is a good idea. We can work to come up with a streamlined release process and then decide how much release-related activity needs to be associated with each major version.

I am using the rel paths keys in outs from exp show where use_cache: true to identify paths that can be pushed/pulled directly. Would it be possible to add this information into the data:status json output ?

How does this differ from the --dvc-only flag? I think this will only include outputs that use the cache.

@mattseddon
Copy link
Member Author

@dberenbaum as long as all of the tracked directories are returned in the output then we will be covered 👍🏻.

Examples of all the paths we need:

VS Code demo project

data/MNIST/raw
data/MNIST/raw/t10k-images-idx3-ubyte
data/MNIST/raw/t10k-images-idx3-ubyte.gz
data/MNIST/raw/t10k-labels-idx1-ubyte
data/MNIST/raw/t10k-labels-idx1-ubyte.gz
data/MNIST/raw/train-images-idx3-ubyte
data/MNIST/raw/train-images-idx3-ubyte.gz
data/MNIST/raw/train-labels-idx1-ubyte
data/MNIST/raw/train-labels-idx1-ubyte.gz
training_metrics
training_metrics/scalars
training_metrics/scalars/acc.tsv
training_metrics/scalars/loss.tsv
model.pt
misclassified.jpg
predictions.json

example-get-started

data/features
data/features/test.pkl
data/features/train.pkl
data/prepared
data/prepared/test.tsv
data/prepared/train.tsv
data/data.xml
evaluation/importance.png
model.pkl

LMK if I need to expand on anything or give more context/clarification.

Thanks

Note: This should include cached plot data/directories which we cannot currently get from exp show.

@dberenbaum
Copy link
Collaborator

Related/duplicate: #1705?

@mattseddon
Copy link
Member Author

Problem has been solved in the extension. Can close as duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push feature request Requesting a new feature product: VSCode Integration with VSCode extension
Projects
None yet
Development

No branches or pull requests

6 participants