Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local config user respected in "dvc pull" but not "dvc list" #4604

Closed
Erotemic opened this issue Sep 23, 2020 · 8 comments
Closed

Local config user respected in "dvc pull" but not "dvc list" #4604

Erotemic opened this issue Sep 23, 2020 · 8 comments
Assignees
Labels
A: data-sync Related to dvc get/fetch/import/pull/push feature request Requesting a new feature p2-medium Medium priority, should be done, but less important

Comments

@Erotemic
Copy link
Contributor

Bug Report

My computer username (joncrall) is different than my credentials that I use to access the dvc remote cache. In my local config I've modified the dvc remote to register my appropriate remote username (jon.crall).

Running:

dvc list -v [email protected]/myrepo.git path/to/data

I get:

2020-09-23 16:26:45,710 DEBUG: Collecting information from remote cache...                                                            
2020-09-23 16:26:45,712 DEBUG: Establishing ssh connection with '<internal-remote-store>' through port '22' as user 'joncrall'                       
2020-09-23 16:26:46,412 ERROR: unexpected error - No existing session                                                                 

and then there is a big long paramiko.ssh_exception.SSHException that ultimately stems from the fact that it did not pick up my usename in my local config, which looks like this:

['remote "public-storage"']
    user = jon.crall

For reference my normal dvc/config looks like:

[cache]
    type = "reflink,symlink,copy"
[core]
    remote = public-storage
['remote "public-storage"']
    url = ssh://an/internal/url/dvc-caches/public

However, if I use something like: dvc pull -v it works

2020-09-23 16:31:15,614 DEBUG: Establishing ssh connection with '<internal-remote-store>' through port '22' as user 'jon.crall'                      

Also, if I modify my global $HOME/.cache/dvc/config to include the custom "user" it works.

Please provide information about your setup

Output of dvc version:

DVC version: 1.7.4+e1b344.mod 
---------------------------------
Platform: Python 3.8.2 on Linux-5.4.0-47-generic-x86_64-with-glibc2.10
Supports: http, https, ssh
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sdd1
Workspace directory: ext4 on /dev/sdd1
Repo: dvc, git

I've also tested on pypi latest of 1.7.4

I'm currently trying to trace where local credentials are loaded in the pull command and see if I can port that to the list command, but I figured Id make an issue and see if this is a known bug first.

@Erotemic
Copy link
Contributor Author

After diving into the source code, I think I found why this occurs. The list command, which corresponds with dvc.repo.Repo.ls is a static method and assumes that its not running within an existing repo. In the cli the CmdList object inherits from CmdBaseNoRepo whereas the pull command CmdDataPull inherits from CmdDataBase, which seems to populate itself with information about the local repo that list does have.

I think a fix to this issue would involve having some way to run list as if it was in a repo while being able to run it outside of a repo as well.

Also, the implementation of list currently seems a bit wasteful, if you already have the repo checked out why do we need to create a temporary external repo to run the command? Sure maybe you want to do the ls on a URL, but in the case where you want to run list on your local repo (which should be a read-only operation), it would be much faster to avoid creating the external temp repo.

@pmrowla pmrowla added the A: data-sync Related to dvc get/fetch/import/pull/push label Jul 26, 2022
@rick-van-veen
Copy link

Sorry if I misunderstand. This issue has been moved to done 1.5 years ago, currently it is still a problem though (#8016). What is the current status and planning of this issue?

@pmrowla
Copy link
Contributor

pmrowla commented Aug 8, 2022

@rick-van-veen the issue is still open and there is currently no estimate on when the core team will be able to get to it.

The in-progress/done status messages are just auto-generated from the github project boards, and they mean that some planned work related to this issue (whether it was on internal prerequisites or separate but related issues) was done in the given sprint.

@rick-van-veen
Copy link

@pmrowla Thanks!

Just want to add my +1 for this issue with this comment then :-)

@rick-van-veen
Copy link

rick-van-veen commented Aug 31, 2022

My work around for this issue (details here: #8016) is to add my registry as a submodule and then dvc import the local repo (after adding the config.local in the submodule... else it doesn't work). However, I ran into issues when queueing my experiments. Related to #7186.

@sisp
Copy link
Contributor

sisp commented Nov 10, 2022

This issue is a major blocker for us because it makes DVC useless for managing data in private data registries. See #8544.

This issue has had almost no activity since its creation more than 2 years ago. Any chance it gets prioritized soon?

@sisp
Copy link
Contributor

sisp commented Nov 10, 2022

To elaborate the blocker regarding private data registries a bit:

In the DVC documentation on dvc import about chained imports, it says at the very end:

The default DVC remotes for all repos in the import chain must also be accessible (repo C needs to have all the appropriate credentials).

We're in the process of building private data registries (A) on which a downstream project (B) can depend via dvc import. As the DVC documentation states correctly, a downstream project (B) needs to have the credentials for accessing the DVC remotes of the imported data from the upstream data registry (A). As it turns out, dvc import (or later dvc pull with .dvc files that reference an upstream data registry) fails to access the upstream DVC remote because DVC doesn't use the credentials provided in the downstream project (B).

This behavior differs from what is described in the documentation and prevents importing data via dvc import from a private data registry.

I believe DVC should merge the configs while pulling (and perhaps other operations like dvc list/dvc get/...), so that the config of A gets merged with the config of B, which contains the credentials for accessing the DVC remote of A. I'd be happy to contribute a fix, but it would certainly help to get a rough overview of the necessary changes as I'm not deeply familiar with the DVC code base.

@dberenbaum dberenbaum added this to DVC Nov 21, 2022
@dberenbaum dberenbaum moved this to Backlog in DVC Nov 21, 2022
@dberenbaum dberenbaum removed this from DVC Nov 21, 2022
@efiop
Copy link
Contributor

efiop commented Oct 13, 2023

list now supports --config and --remote-config (see https://dvc.org/doc/command-reference/list), which should be enough to work around this. I don't think it would be expected behaviour for dvc list to automatically pick up local config like that based on remote name, and probably a better mechanism for that would be #9922

@efiop efiop closed this as completed Oct 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push feature request Requesting a new feature p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

6 participants