Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new command to list data artifacts in a DVC project #2509

Closed
jorgeorpinel opened this issue Sep 17, 2019 · 45 comments · Fixed by #3246 or #3462
Closed

new command to list data artifacts in a DVC project #2509

jorgeorpinel opened this issue Sep 17, 2019 · 45 comments · Fixed by #3246 or #3462
Assignees
Labels
feature request Requesting a new feature p1-important Important, aka current backlog of things to do

Comments

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Sep 17, 2019

Especially useful for "browsing" external DVC projects on Git hosting before using dvc get or dvc import. Looking at the Git repo doesn't show the artifacts because they're only referenced in DVC-files (which can be found anywhere), not tracked by Git.

Perhaps dvc list or dvc artifacts? (Or/and both dvc get list and dvc import list)

As mentioned in iterative/dvc.org#611 (comment) and other discussions.


UPDATE: Proposed spec (from #2509 (comment)):

usage: dvc list [-h] [-q | -v] [--recursive [LEVEL]] [--rev REV | --versions]
                url [target [target ...]]

positional arguments:
  url         URL of Git repository with DVC project to download from.
  target      Paths to DVC-files or directories within the repository to list outputs
              for.

UPDATE: Don't forget to update docs AND tab completion scripts when this is implemented.

@efiop
Copy link
Contributor

efiop commented Sep 17, 2019

+1 for dvc list 🙂

@efiop efiop added the p2-medium Medium priority, should be done, but less important label Sep 17, 2019
@shcheklein
Copy link
Member

@efiop @jorgeorpinel another option is to do dvc ls and it should behave exactly like a regular ls or aws s3 ls. Show all the files (including hidden data) by specifying a Git url. This way you can control the scope to show (by not going into all directories by default) - also you can see your data in the context (other files) with an option to filter them out.

On the other hand it can be good to show just a list of all DVC outputs. It can be done with dvc ls --recursive --outputs-only for example.

What do you think?

In general I'm +100 for dvc list or something similar :)

@efiop
Copy link
Contributor

efiop commented Sep 17, 2019

Clarification: dvc list should work on dvc repositories. E.g. dvc list https://github.com/iterative/dvc should list scripts/innosetup/dvc.ico, etc.

@shcheklein

This comment has been minimized.

@efiop efiop added p1-important Important, aka current backlog of things to do and removed p2-medium Medium priority, should be done, but less important labels Sep 18, 2019
@efiop

This comment has been minimized.

@jorgeorpinel

This comment has been minimized.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Sep 18, 2019

Clarification: dvc list should work on dvc repositories. E.g. dvc list https://github.com/iterative/dvc should list scripts/innosetup/dvc.ico/etc.

@efiop yes, exactly. Extended example:

$ dvc list https://github.com/iterative/dvc
scripts/innosetup/dvc.ico
scripts/innosetup/dvc_left.bmp
scripts/innosetup/dvc_up.bmp

This makes me think, maybe an output that combines the DVC-files (similar to dvc pipeline list) with their outputs could be most informative (providing some of the context Ivan was looking for). Something like:

$ dvc list https://github.com/iterative/dvc
scripts/innosetup/dvc.ico	(from scripts/innosetup/dvc.ico.dvc)
scripts/innosetup/dvc_left.bmp	(from scripts/innosetup/dvc_left.bmp.dvc)
scripts/innosetup/dvc_up.bmp	(from scripts/innosetup/dvc_up.bmp.dvc)

UPDATE: Thought of yet another 😓 name for the command above: dvc stage list --outs

@shcheklein
Copy link
Member

@jorgeorpinel I think showing the full project in an ls-way is just more natural as opposed to creating our own type of output. There are few benefits:

  1. You don't have to use two interface to see the full picture - Github + dvc list. Instead you just use dvc and see the workspace. And can filter it if it's needed.
  2. It looks like it's beneficial for dvc get to handle regular Git files. Why not? It can be useful.
  3. Like I mentioned - single place in CLI, no need to go to Github to get the full picture.
  4. Just easier to understand since people are familiar with ls and intuitively can expect the result.

The idea is that by default it's not recursive, it's not limited to outputs only. You go down on your own if you need by clarifying path - the same way you do with ls, aws ls, etc.

ls, aws ls, etc - they are all not recursive by default for a reason. In a real case the output can be huge and just won't make sense for you. People tend to go down level by level, or use recursive option when it's exactly clear that it's what they need.

I really don't like making piping and less and complex interfaces part of the tool. You can always use less if it's needed.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Sep 19, 2019

@shcheklein I concur only with benefit #1, so I can agree with showing all files but with an optional flag (which can be developed later, with less priority), not as the default behavior. Thus I wouldn't call the command dvc ls – since it's not that similar to GNU ls. I would vote for dvc outs or dvc list.

  1. It looks like it's beneficial for dvc get to handle regular Git files. Why not?

Because it can already be done with git archive (as explained in this SO answer).

3... single place in CLI, no need to go to Github to get the full picture.

The "full picture" could still be achieved from CLI by separately using git ls-tree.

The idea is that by default it's not recursive... You go down on your own if you need by clarifying path...

I can also agree with this: Non-recursive by default sounds easier to implement. So, it would definitely also need an optional dir (path) argument and a --recursive option.

I really don't like making piping and less and complex interfaces part of the tool.

Also agree. I just listed it as an alternative.


Anyway, unless anyone else has relevant comments, I suggest Ivan decides the spec for this new command based on all the comments above, so we can get it onto a dev sprint.

@jorgeorpinel

This comment has been minimized.

@dashohoxha

This comment has been minimized.

@dashohoxha

This comment has been minimized.

@shcheklein

This comment has been minimized.

@jorgeorpinel

This comment has been minimized.

@dashohoxha

This comment has been minimized.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Sep 20, 2019

# ...list of data files, along with the corresponding `.dvc` file
dvc get <url> list

This would actually need to be dvc get list [url] to adhere to DVC syntax but no, we're talking about a new, separate command (dvc list), not a new subcommand for dvc get. (It also affects dvc import, for example.)

Also I think we've established we want the list to be of all the regular files along with "data files" (outputs and their DVC-files), not just the latter.

dvc get <url> list --show-hashes
...
dvc get <url> list --show-download-url

Please open a separate issue to decide on adding new options to dvc get @dashohoxha.

# limit listing only to certain DVC-files
dvc get <url> list <file1.dvc> <file2.dvc>

This could be useful actually. Optionally specifying target DVC-files to the dvc list command could be an alternative for limiting the number of results.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Sep 20, 2019

OK I hid some resolved comments and updated the proposed command spec in #2509 (comment) based on all the good feedback. Does it look good to get started on?

@shcheklein
Copy link
Member

@jorgeorpinel sounds good to me, let's move the spec to the initial message?

Also, would be great to specify how does output look like in different cases. Again, I suggest it to be have the same way as if Git repo is a bucket you are listing files in with aws s3 ls. May be it would be helpful to come with a few different outputs and compare them.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Sep 20, 2019

OK spec moved to initial issue comment.

specify how does output look like in different cases...

I also think copying the aws s3 ls output is a good starting point.

Hypothetical example outputs

Using file system paths as url:

See full example outputs in dvc list proposal: output examples Gist.

$ cd ~
$ dvc list  # Default url = .
ERROR: The current working directory doesn't seem to be part of a DVC project.

$ git clone [email protected]:iterative/example-get-started.git
$ cd example-get-started
$ dvc list .
INFO: Listing LOCAL DVC project files, directories, and data at
      /home/uname/example-get-started/

 17B 2019-09-20 .gitignore
6.0K 2019-09-20 README.md
...
339B 2019-09-20 train.dvc
5.8M            └ out: (model.pkl)
...

$ dvc pull featurize.dvc
$ dvc list featurize.dvc  # With target DVC-file
...
INFO: Limiting list to data outputs from featurize.dvc stage.

367B 2019-09-20 featurize.dvc
2.7M            └ out: data/features/test.pkl
 11M            └ out: data/features/train.pkl

NOTE: The latter case brings up several questions about how to display outputs located in a different dir as the target DVC-file, especially vs. using that location as the target instead. In this case I listed them even without --recursive.

Note that the .dvc/ dir is omitted from the output. Also, the dates above come from the file system, same as ls. In the following examples, they come from git history.

Using network Git urls:

See full example outputs in dvc list proposal: output examples Gist.

$ dvc list [email protected]:iterative/example-get-started.git  # SSH URL
 17B 2019-09-03 .gitignore
...
339B 2019-09-03 train.dvc
5.8M            └ out: model.pkl

$ dvc list https://github.com/iterative/dataset-registry  # HTTP URL
1.9K 2019-08-27 README.md
160B 2019-08-27 get-started/
128B 2019-08-27 tutorial/

$ dvc list --recursive https://github.com/iterative/dataset-registry tutorial  # Recursive inside target dir
...
INFO: Expanding list recursively.

 29B 2019-08-29 tutorial/nlp/.gitignore
178B 2019-08-29 tutorial/nlp/Posts.xml.zip.dvc
 10M            └ out: tutorial/nlp/Posts.xml.zip
177B 2019-08-29 tutorial/nlp/pipeline.zip.dvc
4.6K            └ out: tutorial/nlp/pipeline.zip
...

NOTE: Another question is whether outputs having no date is OK (just associated to the date of their DVC-files) or whether we should also get that from the default remote. Or maybe we don't need dates at all...

Going through these made me realize this command can easily get very complicated so all feedback is welcomed to try and simplify it as much as possible to a point where it still revolves the main problem (listing project outputs in order to know what's available for dvc get and dvc import), but doesn't explode in complexity.

@shcheklein
Copy link
Member

@jorgeorpinel this is great! can you make it as a gist or a commit - so that we can leave some comments line by line?

@jorgeorpinel
Copy link
Contributor Author

@shcheklein
Copy link
Member

shcheklein commented Sep 25, 2019

@jorgeorpinel I can't comment on a specific line on them. I wonder if there is a better tools for this? create a temporary PR with these files to dvc.org?

@efiop
Copy link
Contributor

efiop commented Oct 1, 2019

As a first step, we could simply print lines with all outputs. E.g.

$ dvc list https://github.com/iterative/dvc
scripts/innosetup/dvc.ico
scripts/innosetup/dvc_up.bmp
scripts/innosetup/dvc_left.bmp

and then move on to polishing.

gurobokum added a commit to gurobokum/dvc that referenced this issue Feb 12, 2020
efiop pushed a commit that referenced this issue Feb 12, 2020
@efiop
Copy link
Contributor

efiop commented Feb 12, 2020

Reopening since there are some details that we need to follow up on.

@jorgeorpinel
Copy link
Contributor Author

@efiop but we have #3381 for that. Maybe OK to close this one?

@efiop
Copy link
Contributor

efiop commented Mar 15, 2020

@jorgeorpinel Ok, let's close then 🙂 Thanks for the heads up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature p1-important Important, aka current backlog of things to do
Projects
None yet
7 participants