Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get/list/import: not working with company github + webdavs storage #8016

Closed
rick-van-veen opened this issue Jul 14, 2022 · 2 comments
Closed
Labels
fs: webdav Related to the Webdav filesystem

Comments

@rick-van-veen
Copy link

rick-van-veen commented Jul 14, 2022

Bug Report

get/list/import: not working with (company) github + webdavs storage.

See also the discussion here.

Description

I'm trying to import data from a data registry. URL: [email protected]:company/data-registry.git. The data-registry stores its data on a webdavs accessible remote server (URL: webdavs://company.com/webdav/data-registry-storage). To access this server I have created (using the dvc command) a .dvc/config.local file containing the username and password. I've added the data with dvc add and pushed it to the remote server.

In a second repo, the repo I want to do my experiments, I want to import my a dataset from the data-registry. It's a folder so and I want to put in inside data/

dvc import [email protected]:company/data-registry.git dataset -o data

I have added the exact same config and config.local in this second repo, but I'm getting the following error

Importing 'dataset ([email protected]:company/data-registry.git)' -> 'data/dataset'
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:                                                                                                         
name: None, md5: 218ff9c6f73767c8126b66f7213cd0d2.dir
ERROR: unexpected error - received 401 (Unauthorized): Client error '401 Unauthorized' for url 'https://company.com/webdav/data-registry-storage/21/8ff9c6f73767c8126b66f7213cd0d2.dir'
For more information check: https://httpstatuses.com/401

First the warning is interesting, but maybe a consequence of the error, because the following path does exist on the remote storage /21/8ff9c6f73767c8126b66f7213cd0d2.dir'

Secondly, I know I am authorized it's how I got the data on the remote in the first place.

Interestingly when I get/list/import the data by just linking to the local repo it does work.

Reproduce

Folder structure: repos/repo1, repos/repo2.

Repo 1 (Data registry)

git init
dvc init
mkdir dataset
touch dataset/data.txt
dvc add dataset

git add dataset.dvc

git remote add git-remote [email protected]:company/data-registry.git

dvc remote add webdavs-remote webdavs://company.com/webdav/data-registry-storage
dvc remote --local modify webdavs-remote username <username>
dvc remote --local modify webdavs-remote password <password>

git add .dvc/config 
git commit -m "Setup data registry containing dataset and config with webdavs remote"

dvc push
git push

Repo 2 (Experiments)

git init
dvc init
cp ../repo1/.dvc/config
cp ../repo1/.dvc/config.local

git add .dvc/config 
git commit -m "Setup experiments repo"

mkdir data

Fails: With the error in the description (Unauthorized 401)

dvc import [email protected]:company/data-registry.git dataset -o data

FYI 218ff9c6f73767c8126b66f7213cd0d2.dir is the md5 hash in the dataset.dvc

Works:

dvc import ../repo1 dataset -o data

Same hold for get and list. Possibly others?

Expected

Expected to not make a difference and both methods (local and remote git) to import the data.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.13.0 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.29
Supports:
        azure (adlfs = 2022.7.0, knack = 0.9.0, azure-identity = 1.10.0),
        gdrive (pydrive2 = 1.10.1),
        gs (gcsfs = 2022.5.0),
        hdfs (fsspec = 2022.5.0, pyarrow = 8.0.0),
        webhdfs (fsspec = 2022.5.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.5.0),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.5.0),
        s3 (s3fs = 2022.5.0, boto3 = 1.21.21),
        ssh (sshfs = 2022.6.0),
        oss (ossfs = 2021.8.0),
        webdav (webdav4 = 0.9.7),
        webdavs (webdav4 = 0.9.7)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sdb
Caches: local
Remotes: webdavs
Workspace directory: ext4 on /dev/sdb
Repo: dvc, git

Additional Information (if any):

Please ask and I can edit/add.

EDIT

When using dvc status and dvc pull the error message is not very informative when the credentials are missing. I cloned a new data registry repo (repo 1 above) and tried to download the data using dvc pull which failed and then checked dvc status --cloud. It's interesting that these commands do not fail with an "unauthorized error". The commands work fine after adding config.local again.

$ dvc status --cloud
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:                                                                                                          
name: None, md5: 218ff9c6f73767c8126b66f7213cd0d2.dir
Cache and remote 'dataremote' are in sync.
$ dvc pull
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:                                                                                                          
name: None, md5: 218ff9c6f73767c8126b66f7213cd0d2.dir
WARNING: No file hash info found for '/workspaces/project/dataset'. It won't be created.                                                                                           
1 file failed                                                                                                                                                                                              
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
/workspaces/project/dataset
Is your cache up to date?
<https://error.dvc.org/missing-files>
@karajan1001 karajan1001 added the fs: webdav Related to the Webdav filesystem label Jul 14, 2022
@pmrowla
Copy link
Contributor

pmrowla commented Jul 26, 2022

This is the same as #4604 - currently the local config is not used when providing a repo URL in get/list/import, it will only be used when you provide a local path to a repo (as you already noted)

@efiop
Copy link
Contributor

efiop commented Jul 27, 2022

Closing in favor of #4604

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fs: webdav Related to the Webdav filesystem
Projects
None yet
Development

No branches or pull requests

4 participants