-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new command to list data artifacts in a DVC project #2509
Comments
+1 for |
@efiop @jorgeorpinel another option is to do On the other hand it can be good to show just a list of all DVC outputs. It can be done with What do you think? In general I'm +100 for |
Clarification: |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@efiop yes, exactly. Extended example: $ dvc list https://github.com/iterative/dvc
scripts/innosetup/dvc.ico
scripts/innosetup/dvc_left.bmp
scripts/innosetup/dvc_up.bmp This makes me think, maybe an output that combines the DVC-files (similar to $ dvc list https://github.com/iterative/dvc
scripts/innosetup/dvc.ico (from scripts/innosetup/dvc.ico.dvc)
scripts/innosetup/dvc_left.bmp (from scripts/innosetup/dvc_left.bmp.dvc)
scripts/innosetup/dvc_up.bmp (from scripts/innosetup/dvc_up.bmp.dvc) UPDATE: Thought of yet another 😓 name for the command above: |
@jorgeorpinel I think showing the full project in an ls-way is just more natural as opposed to creating our own type of output. There are few benefits:
The idea is that by default it's not recursive, it's not limited to outputs only. You go down on your own if you need by clarifying path - the same way you do with ls, aws ls, etc.
I really don't like making piping and less and complex interfaces part of the tool. You can always use |
@shcheklein I concur only with benefit #1, so I can agree with showing all files but with an optional flag (which can be developed later, with less priority), not as the default behavior. Thus I wouldn't call the command
Because it can already be done with
The "full picture" could still be achieved from CLI by separately using git ls-tree.
I can also agree with this: Non-recursive by default sounds easier to implement. So, it would definitely also need an optional
Also agree. I just listed it as an alternative. Anyway, unless anyone else has relevant comments, I suggest Ivan decides the spec for this new command based on all the comments above, so we can get it onto a dev sprint. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This would actually need to be Also I think we've established we want the list to be of all the regular files along with "data files" (outputs and their DVC-files), not just the latter.
Please open a separate issue to decide on adding new options to
This could be useful actually. Optionally specifying target DVC-files to the |
OK I hid some resolved comments and updated the proposed command spec in #2509 (comment) based on all the good feedback. Does it look good to get started on? |
@jorgeorpinel sounds good to me, let's move the spec to the initial message? Also, would be great to specify how does output look like in different cases. Again, I suggest it to be have the same way as if Git repo is a bucket you are listing files in with |
OK spec moved to initial issue comment.
I also think copying the Hypothetical example outputsUsing file system paths as
$ cd ~
$ dvc list # Default url = .
ERROR: The current working directory doesn't seem to be part of a DVC project.
$ git clone [email protected]:iterative/example-get-started.git
$ cd example-get-started
$ dvc list .
INFO: Listing LOCAL DVC project files, directories, and data at
/home/uname/example-get-started/
17B 2019-09-20 .gitignore
6.0K 2019-09-20 README.md
...
339B 2019-09-20 train.dvc
5.8M └ out: (model.pkl)
...
$ dvc pull featurize.dvc
$ dvc list featurize.dvc # With target DVC-file
...
INFO: Limiting list to data outputs from featurize.dvc stage.
367B 2019-09-20 featurize.dvc
2.7M └ out: data/features/test.pkl
11M └ out: data/features/train.pkl
Note that the Using network Git
$ dvc list [email protected]:iterative/example-get-started.git # SSH URL
17B 2019-09-03 .gitignore
...
339B 2019-09-03 train.dvc
5.8M └ out: model.pkl
$ dvc list https://github.com/iterative/dataset-registry # HTTP URL
1.9K 2019-08-27 README.md
160B 2019-08-27 get-started/
128B 2019-08-27 tutorial/
$ dvc list --recursive https://github.com/iterative/dataset-registry tutorial # Recursive inside target dir
...
INFO: Expanding list recursively.
29B 2019-08-29 tutorial/nlp/.gitignore
178B 2019-08-29 tutorial/nlp/Posts.xml.zip.dvc
10M └ out: tutorial/nlp/Posts.xml.zip
177B 2019-08-29 tutorial/nlp/pipeline.zip.dvc
4.6K └ out: tutorial/nlp/pipeline.zip
...
Going through these made me realize this command can easily get very complicated so all feedback is welcomed to try and simplify it as much as possible to a point where it still revolves the main problem (listing project outputs in order to know what's available for |
@jorgeorpinel this is great! can you make it as a gist or a commit - so that we can leave some comments line by line? |
OK, added to https://gist.github.com/jorgeorpinel/61719795628fc0fe64e04e4cc4c0ca1c and updated #2509 (comment) above. |
@jorgeorpinel I can't comment on a specific line on them. I wonder if there is a better tools for this? create a temporary PR with these files to dvc.org? |
As a first step, we could simply print lines with all outputs. E.g. $ dvc list https://github.com/iterative/dvc
scripts/innosetup/dvc.ico
scripts/innosetup/dvc_up.bmp
scripts/innosetup/dvc_left.bmp and then move on to polishing. |
Reopening since there are some details that we need to follow up on. |
@jorgeorpinel Ok, let's close then 🙂 Thanks for the heads up! |
Especially useful for "browsing" external DVC projects on Git hosting before using
dvc get
ordvc import
. Looking at the Git repo doesn't show the artifacts because they're only referenced in DVC-files (which can be found anywhere), not tracked by Git.Perhaps
dvc list
ordvc artifacts
? (Or/and bothdvc get list
anddvc import list
)UPDATE: Proposed spec (from #2509 (comment)):
UPDATE: Don't forget to update docs AND tab completion scripts when this is implemented.
The text was updated successfully, but these errors were encountered: