Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get: show url to cache #2994

Closed
dmpetrov opened this issue Dec 22, 2019 · 26 comments · Fixed by #3156
Closed

get: show url to cache #2994

dmpetrov opened this issue Dec 22, 2019 · 26 comments · Fixed by #3156
Assignees
Labels
discussion requires active participation to reach a conclusion feature request Requesting a new feature p2-medium Medium priority, should be done, but less important product: VSCode Integration with VSCode extension

Comments

@dmpetrov
Copy link
Member

In some cases, users need a direct link to a data file in the cloud. It might be just a matter of convenience or when DVC is used from automation tools like CD4ML scenarios. We need to provide a way to get a direct link.

I suggest creating a new command dvc resolve. Other options can be also be considered. For example, it might be a part of dvc status (when it start supporting data files as argument).

$ dvc resolve file.txt df.csv
file.txt s3://dvc-temp/dataset1/2d10f00a05f1fe70eaa3c42aa9f44b95
df.csv s3://dvc-temp/dataset1/99370a5386cac00c93c9fdc836076a7d

It should support revisions (checksum, tag, branch or HEAD^^):

$ dvc resolve file.txt df.csv --rev ada9d973
file.txt s3://dvc-temp/dataset1/d30abac6950583e6a5f1dcd31cb1043b
df.csv gs://project42/cache/2290babda371e52eeca2a2065a358783

Note, the commands with revision options should resolve old remotes as well (see s3:// and gs:// above).

@dmpetrov dmpetrov added feature request Requesting a new feature product: VSCode Integration with VSCode extension and removed product: VSCode Integration with VSCode extension labels Dec 22, 2019
@efiop efiop added the p2-medium Medium priority, should be done, but less important label Dec 22, 2019
@efiop
Copy link
Contributor

efiop commented Dec 22, 2019

@dmpetrov Sounds like some automation helper, that can be worked around by simply reading the dvc file, extracting the checksum and then accessing the remote. Sounds like that wouldn't be a problem for anyone needing this for some automation script, not sure it is useful enough for many users though.

@dmpetrov
Copy link
Member Author

Sounds like some automation helper, that can be worked around by simply reading the dvc file

This feature is required in automation scenarios where people prefer to write bash code. Also, in the case of using revisions, it won't be as simple to implement. A single command - would be really great to have.

In terms of convenience, if you ask a user to do that in a manual way - it requires quite deep DVC knowledge and it is easy to make a mistake. The user should do 4 operations: find the corresponding dvc-file, find a right checksum, find default remote in config, combine them. 2x more complex for old revisions - file copies, need to know get command.

This command will hide a lot of internals and will simplify the usage. It is about lowering the bar.

@efiop efiop added the discussion requires active participation to reach a conclusion label Dec 24, 2019
@Suor
Copy link
Contributor

Suor commented Dec 24, 2019

This issue has a remote config complication - do we use a remote from config from specified rev or current one? Both might be useful:

  • use current if there was no default remote back then or if we changed default remote and moved all the data
  • use the one from rev if we intentionally use different default remotes in e.g. different branches

@jorgeorpinel
Copy link
Contributor

I think this already exists in dvc.api.get_url so may be a matter of wrapping it in a command 🙂 (and figuring the command arguments and their logic, etc.)

@efiop
Copy link
Contributor

efiop commented Jan 14, 2020

Change of plans. Talked with @dmpetrov @shcheklein , and they suggested that --show-url(or something similar) for dvc get would be better than introducing a separate command. Let's consider that instead.

Also, what to show for directories?

@efiop efiop changed the title dvc resolve: link to data-file source get: show url to cache Jan 14, 2020
@skshetry
Copy link
Member

@efiop, maybe I am missing something, but as per docs, get command downloads a file/directory from a DVC and Git repository. And, adding another responsibility to get, i.e. getting path/URL to the files in the cache, which in my view, is completely different from previous responsibility.

@shcheklein
Copy link
Member

@skshetry good point! Let me clarify the way I see it. The reasons behind this decision to piggy-back on dvc get for this feature is to avoid introducing a new global command (dvc resolve) + it feels like a "dry-run" dvc get (show an URL it would download, but do not actually download it), especially considering that they have a common interface - URL + path.

Btw, do you have a better place in mind to implement this?

@skshetry
Copy link
Member

@shcheklein, my argument is for Unix philosophy, and have a better interface and not become something similar to git checkout.

git-checkout - Switch branches or restore working tree files

But, my opinion is based on the discussions here on the issue. I'll check the implementation to see what get_url() really does and will make up my mind.

@shcheklein
Copy link
Member

my argument is for Unix philosophy, and have a better interface and not become something similar to git checkout

💯, but we should be reasonable about adding new top level commands. Otherwise we will become like git from a different angle - https://git-man-page-generator.lokaltog.net/

I'll check the implementation to see what get_url() really does and will make up my mind.

yes, please take a look. Open to other options.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jan 15, 2020

Will dvc import also have the same flag? I also think these commands already do too many things tbh. And won't it be confusing that dvc list will show paths inside the repo url but when you actually run dvc get/import --show-url on them, the URL is to the remote instead?

What about adding the option in the new dvc list command instead? See #2509
This would give people the alternative to download their tools of choice e.g. wget or aws s3 cp

@shcheklein
Copy link
Member

@jorgeorpinel dvc list is a good candidate! (we can consider migrating this option from get when it's ready).

@jorgeorpinel
Copy link
Contributor

OK. Another Q: Will this new option also say whether the file actually exists in that URL? Related to #3155

@shcheklein
Copy link
Member

@jorgeorpinel no, that's out of scope for now, I think.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jan 15, 2020

Yeah makes sense. I'm going to add a note in the dvc.api.get_url docs to emphasize that the existence of the file is not implied by having the URL and that users should keep that in mind when they use the URL (wrap in try/catch statement). So the same note would apply to this option when we write its docs.

@efiop
Copy link
Contributor

efiop commented Jan 17, 2020

Also, what to show for directories?

Since we are not able to access urls, we can't really parse and show directories. We will only show url to .dir file that can then be downloaded and parsed by the user.

Also, git files will be handled separately when we add support for them to api itself.

@skshetry
Copy link
Member

@efiop, opened #3182 for further discussion regarding directories.

@Suor
Copy link
Contributor

Suor commented Jan 18, 2020

I agree with @skshetry here - an option that completely hijacks what the command does is confusing. Are we trying to beat git in bad command line UI? ;)

@shcheklein
Copy link
Member

@Suor mention alternatives? what do you think about other options mentioned above? (please, let's be constructive a little bit 🙏).

@Suor
Copy link
Contributor

Suor commented Jan 19, 2020

@shcheklein dvc show-url <repo-url> <path>

@efiop
Copy link
Contributor

efiop commented Jan 19, 2020

@Suor Suffers from the same issue of dedicating a full command to this functionality, same as dvc resolve. Let's keep as is for now.

@Suor
Copy link
Contributor

Suor commented Jan 19, 2020

I don't understand why is that the issue.

@shcheklein
Copy link
Member

@Suor every command has a lot of boilerplate (code, docs, etc) that we would need to support - this is not nice. This approach is not scalable in terms of even reading the list of commands (we can't do 100 commands, for example).

@ghost
Copy link

ghost commented Jan 19, 2020

@shcheklein, why not trying to reduce the boilerplate instead) ?

@shcheklein
Copy link
Member

why not trying to reduce the boilerplate instead

sure, but it's never zero anyway right? )

@skshetry
Copy link
Member

skshetry commented Jan 19, 2020

I'd like to add another point. If it's just to show URL, it does not make much sense to introduce a new command if it will be going to a new home anyway (see: #2994 (comment)) and it makes more sense there imo.

And, i think, it'll be problematic to deprecate a global command compared to an option?

EDIT: Simplified a bit.

@shcheklein
Copy link
Member

And, i think, it'll be problematic to deprecate a global command compared to an option?

💯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion requires active participation to reach a conclusion feature request Requesting a new feature p2-medium Medium priority, should be done, but less important product: VSCode Integration with VSCode extension
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants