-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pull
: -R
does not check immediate target
#7756
Comments
EDIT: not related, it's a bit different. |
I'm either unable to reproduce or misunderstanding: $ git clone [email protected]:iterative/vscode-dvc.git
Cloning into 'vscode-dvc'...
remote: Enumerating objects: 28550, done.
remote: Counting objects: 100% (196/196), done.
remote: Compressing objects: 100% (132/132), done.
remote: Total 28550 (delta 87), reused 106 (delta 56), pack-reused 28354
Receiving objects: 100% (28550/28550), 10.73 MiB | 32.90 MiB/s, done.
Resolving deltas: 100% (20665/20665), done.
dave@davids-air:/tmp 11:23:05
$ cd vscode-dvc/demo
$ dvc pull -R training_metrics
A training_metrics/
1 file added and 2 files fetched
$ dvc doctor
DVC version: 2.10.3.dev54+g570e3a535
---------------------------------
Platform: Python 3.9.2 on macOS-12.3.1-arm64-arm-64bit
Supports:
azure (adlfs = 2022.2.0, knack = 0.9.0, azure-identity = 1.8.0),
gdrive (pydrive2 = 1.10.1),
gs (gcsfs = 2022.2.0),
hdfs (fsspec = 2022.2.0, pyarrow = 7.0.0),
webhdfs (fsspec = 2022.2.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
s3 (s3fs = 2022.2.0, boto3 = 1.20.24),
ssh (sshfs = 2021.11.2),
oss (ossfs = 2021.8.0),
webdav (webdav4 = 0.9.5),
webdavs (webdav4 = 0.9.5)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc (subdir), git |
@dberenbaum in order to test this I upgraded from I cannot get the repo out of this renamed state. I have tried:
Still stuck here: Reverting to
|
Reading the description one more time it might be a bigger issue for DVC extension.
My guess would be that it is as old as |
@efiop looks like the change that introduce the renamed issue was this one: a80a85e. Was there anything in there that would make a subrepo exhibit this diff no matter what ( ~/vscode-dvc/demo ❯ dvc diff
Renamed:
demo/data/MNIST/raw/ -> data/MNIST/raw/
demo/data/MNIST/raw/t10k-images-idx3-ubyte -> data/MNIST/raw/t10k-images-idx3-ubyte
demo/data/MNIST/raw/t10k-images-idx3-ubyte.gz -> data/MNIST/raw/t10k-images-idx3-ubyte.gz
demo/data/MNIST/raw/t10k-labels-idx1-ubyte -> data/MNIST/raw/t10k-labels-idx1-ubyte
demo/data/MNIST/raw/t10k-labels-idx1-ubyte.gz -> data/MNIST/raw/t10k-labels-idx1-ubyte.gz
demo/data/MNIST/raw/train-images-idx3-ubyte -> data/MNIST/raw/train-images-idx3-ubyte
demo/data/MNIST/raw/train-images-idx3-ubyte.gz -> data/MNIST/raw/train-images-idx3-ubyte.gz
demo/data/MNIST/raw/train-labels-idx1-ubyte -> data/MNIST/raw/train-labels-idx1-ubyte
demo/data/MNIST/raw/train-labels-idx1-ubyte.gz -> data/MNIST/raw/train-labels-idx1-ubyte.gz
demo/misclassified.jpg -> misclassified.jpg
demo/model.pt -> model.pt
demo/predictions.json -> predictions.json
demo/training_metrics.json -> training_metrics.json
demo/training_metrics/ -> training_metrics/
demo/training_metrics/scalars/acc.tsv -> training_metrics/scalars/acc.tsv
demo/training_metrics/scalars/loss.tsv -> training_metrics/scalars/loss.tsv I can see that scmrepo got bumped from LMK if you want a separate issue for this. Thanks. |
@mattseddon Thanks, I can confirm the renaming issue. I'm still not following the original issue since it seems that |
I can confirm that #7756 (comment) is correct (did not work for In testing I think have found a new issue where if a Using example-get-started: ~/example-get-started master ❯ tree data/
data/
├── data.xml
├── data.xml.dvc
├── features
│ ├── test.pkl
│ └── train.pkl
└── prepared
├── test.tsv
└── train.tsv
2 directories, 6 files
~/example-get-started master ❯ rm -rf data/features
~/example-get-started master ❯ rm -rf data/prepared
~/example-get-started master ❯ rm data/data.xml
~/example-get-started master ❯ tree data/
data/
└── data.xml.dvc
0 directories, 1 file
~/example-get-started master ❯ dvc pull -R data
A data/data.xml
1 file added
~/example-get-started master ❯ tree data/
data/
├── data.xml
└── data.xml.dvc
0 directories, 2 files ~/example-get-started master ❯ dvc doctor
DVC version: 2.10.3.dev71+gf23d31af
---------------------------------
Platform: Python 3.9.9 on macOS-12.4-x86_64-i386-64bit
Supports:
webhdfs (fsspec = 2022.1.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
s3 (s3fs = 2022.1.0, boto3 = 1.20.24)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s5s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc, git Does this make sense? Is this the intended behaviour of |
@mattseddon yes, that's expected, it's not a bug. This is the current behavior of the |
It's not a bug, but behavior might be confusing to be honest. But even that doesn't matter - we need to find a way to all tracked files inside a directory. |
For VS Code, can you use |
@dberenbaum we would have to run make it recursive as well (to get all DVC-tracked files and directories inside a specific directory)- that can make it slow. But on the other hand we already do |
We can create the necessary data. Would it be correct to pull -
Would these commands be correct for these structures? 1.Structure:
On pull we run: 2.Structure:
On pull we run: |
I should be just
can you describe how and how do we get it now? One thing to be mindful and try to avoid are very large directories with a lot of files. No matter what you do if we start (on some VS Code code) traversing them that can become painful. |
How do I get this information without parsing all of the yaml files and generating the DAG?
We use For the data held in the a
Note: All of this will be getting rewritten once we get the new data from |
(that's why btw It returns the whole tree - including all the files inside tracked directories, we don't need this. At least for this specific case. We might need this for SCM ... |
Hm, not sure. For me it's <0.6 seconds. Still not sure if that's fast enough for bigger repos, but 10s seems odd.
Got it. Seems like we need some flag for whether to traverse into DVC-tracked paths? Can you explain more how |
yep, I right click on Current behavior is not useful at all to my mind and comes from some legacy (when |
Like |
@shcheklein, |
@skshetry yep, usually I'm testing on the dev version, in this case dvc was coming from a different project. It is improved to 0.21! 🎉 |
Yup, makes sense. We need to move all our commands towards operating on all DVC-tracked data within a path without the users worrying about where the paths are specified in I think it's somewhat related to the goal to "auto manage directories," which is currently planned for Q3, and @shcheklein @mattseddon What is the priority for VS Code (when do you need it)? @efiop Any thoughts? |
@dberenbaum we need to come up with a workaround for the release and a long term solution. This suggestion was for a short term workaround. |
@mattseddon if I understand correctly, that solution won't work - we can't pass the full list into Probably |
It would be full list for everything that doesn't have a |
Tbh, I would wait for a proper fix in this case. |
AFAIK, the commands that would need to be updated to have consistent behavior are:
It would be a breaking change for those commands, so it probably can't be changed in the short term even though the suggested behavior does seem like it's more expected. We could introduce a new, possibly hidden, command like Thoughts? |
Sounds good to me. The current VS Code solution/workaround/hack is to get the required information from |
@dberenbaum do we plan to deprecate global pull, push, etc later? or just use these hidden commands as a playground and then during 3.0 replace those? does it makes sense for us to start releasing major versions a bit more often? (not trying to squeeze everything)? |
Update on the VS Code patch: I am using the rel paths keys in |
I would like to see the commands first and evaluate whether they could easily replace the existing commands. To start, we could copy the existing functionality to a new command and add in support for Releasing major versions more often is a good idea. We can work to come up with a streamlined release process and then decide how much release-related activity needs to be associated with each major version.
How does this differ from the |
@dberenbaum as long as all of the tracked directories are returned in the output then we will be covered 👍🏻. Examples of all the paths we need: VS Code demo project
example-get-started
LMK if I need to expand on anything or give more context/clarification. Thanks Note: This should include cached plot data/directories which we cannot currently get from |
Related/duplicate: #1705? |
Problem has been solved in the extension. Can close as duplicate. |
Bug Report
Description
Firstly, from the docs I realise that
pull -R <target>
is probably working exactly as advertised.In the VS Code extension, we show a tracked tree which can be used to selectively pull files from the remote.
We currently use the output of
dvc list . -R --show-json --dvc-only
to generate this tree (we will shortly be using the output from the new data:status command). We mark everything provided by the list output as tracked.When calling
pull
against these tracked paths we check to see if the path exists in the list output. If it does then we calldvc pull <target>
. If it does not we calldvc pull -R <target>
.When calling
dvc pull -R
we get mixed results. Here is an example of-R
stating that everything is up to date when things clearly haven't changed:Screen.Recording.2022-05-17.at.3.35.58.pm.mov
dvc.yaml for the above project is here.
training_metrics
is tracked but there is no way currently for us to easily/consistently tell this from the combined output oflist
,status
&diff
.Reproduce
dvc pull -R training_metrics
from the root.Expected
dvc pull -R target
checks the target as well as all searching inside the target.We could take the alternative approach of including the appropriate information in the new data:status command. I.e
training_metrics/
would be provided as part of the output to let us know that it is tracked.Environment information
Output of
dvc doctor
:Additional Information (if any):
Please let me know if you need anything else from me. Thanks
The text was updated successfully, but these errors were encountered: