Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc import-url fails for folders in azure remote. #4693

Closed
EigenJT opened this issue Oct 9, 2020 · 4 comments · Fixed by #6238 or #6245
Closed

dvc import-url fails for folders in azure remote. #4693

EigenJT opened this issue Oct 9, 2020 · 4 comments · Fixed by #6238 or #6245
Assignees
Labels
bug Did we break something? fs: azure Related to the Azure filesystem p2-medium Medium priority, should be done, but less important research upstream Issues which need to be resolved in an upstream dependency

Comments

@EigenJT
Copy link

EigenJT commented Oct 9, 2020

Bug Report

I've been trying to import a url from a remote azure container using dvc import-url. It seems to work fine for single files, but fails for folders. Here is the output, from setting up dvc, importing the single file and finally failing to import a folder.

Please provide information about your setup

Output of dvc version:

DVC version: 1.8.1 (brew)
---------------------------------
Platform: Python 3.8.5 on macOS-10.15.6-x86_64-i386-64bit
Supports: azure, gdrive, gs, http, https, s3, ssh, oss, webdav, webdavs
Cache types: reflink, hardlink, symlink
Repo: dvc (no_scm)

Additional Information (if any):

Context: https://discord.com/channels/485586884165107732/485596304961962003/764182584883937312

~$ dvc remote add dvc_file_import_test azure://dvc-file-import-test/

~$ dvc remote modify --local dvc_file_import_test connection_string $remote_connection_string -v
2020-10-09 10:32:15,453 DEBUG: Writing '/Users/julien/dvc_file_import_test/.dvc/config.local'.
...

~$ dvc import-url remote://dvc_file_import_test/test_file.txt -v
2020-10-09 10:33:20,255 DEBUG: Check for update is enabled.
2020-10-09 10:33:20,257 DEBUG: fetched: [(3,)]
2020-10-09 10:33:20,582 DEBUG: Removing output 'test_file.txt' of stage: 'test_file.txt.dvc'.
Importing 'remote://dvc_file_import_test/test_file.txt' -> 'test_file.txt'
2020-10-09 10:33:20,583 DEBUG: Computed stage: 'test_file.txt.dvc' md5: 'ac01a9a397f504d0298fe8c7e1aef689'
2020-10-09 10:33:20,584 DEBUG: 'md5' of stage: 'test_file.txt.dvc' changed.
2020-10-09 10:33:20,584 DEBUG: URL azure://dvc-file-import-test/
2020-10-09 10:33:20,584 DEBUG: Using connection string 'remote_connection_string'
2020-10-09 10:33:20,592 DEBUG: Container name dvc-file-import-test
2020-10-09 10:33:20,908 DEBUG: Downloading 'azure://dvc-file-import-test/test_file.txt' to 'test_file.txt'
2020-10-09 10:33:21,065 DEBUG: Path '/Users/julien/dvc_file_import_test/test_file.txt' inode '8580528'
2020-10-09 10:33:21,066 DEBUG: fetched: []
2020-10-09 10:33:21,067 DEBUG: Path 'test_file.txt' inode '8580528'
2020-10-09 10:33:21,067 DEBUG: fetched: []
2020-10-09 10:33:21,067 DEBUG: {'test_file.txt': 'modified'}
2020-10-09 10:33:21,068 DEBUG: Path '/Users/julien/dvc_file_import_test/test_file.txt' inode '8580528'
2020-10-09 10:33:21,068 DEBUG: fetched: [('1602264800961265920', '2', 'b026324c6904b2a9cb4b88d6d61c81d1', '1602264801067587072')]
2020-10-09 10:33:21,068 DEBUG: Computed stage: 'test_file.txt.dvc' md5: 'e81fc216a3f58038ede5b08d93fdedf0'
2020-10-09 10:33:21,069 DEBUG: Saving 'test_file.txt' to '.dvc/cache/b0/26324c6904b2a9cb4b88d6d61c81d1'.
2020-10-09 10:33:21,070 DEBUG: Assuming '/Users/julien/dvc_file_import_test/.dvc/cache/b0/26324c6904b2a9cb4b88d6d61c81d1' is unchanged since it is read-only
2020-10-09 10:33:21,071 DEBUG: Created 'reflink': .dvc/cache/.cache_type_test_file -> .3GdfawJ8jYYwCRaeDNTSqh
2020-10-09 10:33:21,071 DEBUG: Removing '/Users/julien/dvc_file_import_test/.3GdfawJ8jYYwCRaeDNTSqh'
2020-10-09 10:33:21,071 DEBUG: Removing '/Users/julien/dvc_file_import_test/.dvc/cache/.cache_type_test_file'
2020-10-09 10:33:21,072 DEBUG: Removing '/Users/julien/dvc_file_import_test/test_file.txt'
2020-10-09 10:33:21,073 DEBUG: Created 'reflink': .dvc/cache/b0/26324c6904b2a9cb4b88d6d61c81d1 -> test_file.txt
2020-10-09 10:33:21,073 DEBUG: Path 'test_file.txt' inode '8580531'
2020-10-09 10:33:21,073 DEBUG: Path 'test_file.txt' inode '8580531'
2020-10-09 10:33:21,073 DEBUG: fetched: []
2020-10-09 10:33:21,073 DEBUG: Path '.dvc/cache/b0/26324c6904b2a9cb4b88d6d61c81d1' inode '8580365'
2020-10-09 10:33:21,074 DEBUG: fetched: [('1602264561767203584', '2', 'b026324c6904b2a9cb4b88d6d61c81d1', '1602264561886681088')]
2020-10-09 10:33:21,075 DEBUG: Saving information to 'test_file.txt.dvc'.
2020-10-09 10:33:21,080 DEBUG: fetched: [(2,)]
2020-10-09 10:33:21,083 DEBUG: Analytics is enabled.
2020-10-09 10:33:21,263 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/var/folders/v4/m3drx4jj7dgcg94kgkjwh6zc0000gp/T/tmp6ctys6ja']'
2020-10-09 10:33:21,265 DEBUG: Spawned '['daemon', '-q', 'analytics', '/var/folders/v4/m3drx4jj7dgcg94kgkjwh6zc0000gp/T/tmp6ctys6ja']'

~$ dvc import-url remote://dvc_file_import_test/test_folder/ -v
2020-10-09 10:35:45,312 DEBUG: Check for update is enabled.
2020-10-09 10:35:45,314 DEBUG: fetched: [(3,)]
2020-10-09 10:35:45,634 DEBUG: Removing output 'test_folder' of stage: 'test_folder.dvc'.
Importing 'remote://dvc_file_import_test/test_folder/' -> 'test_folder'
2020-10-09 10:35:45,635 DEBUG: Computed stage: 'test_folder.dvc' md5: 'e60962bd5c5965cfb3b297a753187e5b'
2020-10-09 10:35:45,636 DEBUG: 'md5' of stage: 'test_folder.dvc' changed.
2020-10-09 10:35:45,636 DEBUG: URL azure://dvc-file-import-test/
2020-10-09 10:35:45,636 DEBUG: Using connection string 'remote_connection_string'
2020-10-09 10:35:45,643 DEBUG: Container name dvc-file-import-test
2020-10-09 10:35:45,858 DEBUG: fetched: [(4,)]
2020-10-09 10:35:45,859 ERROR: failed to import remote://dvc_file_import_test/test_folder/. You could also try downloading it manually, and adding it with `dvc add`. - dependency 'remote://dvc_file_import_test/test_folder/' does not exist
------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/Cellar/dvc/1.8.1/libexec/lib/python3.8/site-packages/dvc/command/imp_url.py", line 14, in run
    self.repo.imp_url(
  File "/usr/local/Cellar/dvc/1.8.1/libexec/lib/python3.8/site-packages/dvc/repo/__init__.py", line 51, in wrapper
    return f(repo, *args, **kwargs)
  File "/usr/local/Cellar/dvc/1.8.1/libexec/lib/python3.8/site-packages/dvc/repo/scm_context.py", line 4, in run
    result = method(repo, *args, **kw)
  File "/usr/local/Cellar/dvc/1.8.1/libexec/lib/python3.8/site-packages/dvc/repo/imp_url.py", line 54, in imp_url
    stage.run()
  File "/usr/local/Cellar/dvc/1.8.1/libexec/lib/python3.8/site-packages/funcy/decorators.py", line 39, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/usr/local/Cellar/dvc/1.8.1/libexec/lib/python3.8/site-packages/dvc/stage/decorators.py", line 36, in rwlocked
    return call()
  File "/usr/local/Cellar/dvc/1.8.1/libexec/lib/python3.8/site-packages/funcy/decorators.py", line 60, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/usr/local/Cellar/dvc/1.8.1/libexec/lib/python3.8/site-packages/dvc/stage/__init__.py", line 429, in run
    sync_import(self, dry, force)
  File "/usr/local/Cellar/dvc/1.8.1/libexec/lib/python3.8/site-packages/dvc/stage/imports.py", line 29, in sync_import
    stage.save_deps()
  File "/usr/local/Cellar/dvc/1.8.1/libexec/lib/python3.8/site-packages/dvc/stage/__init__.py", line 392, in save_deps
    dep.save()
  File "/usr/local/Cellar/dvc/1.8.1/libexec/lib/python3.8/site-packages/dvc/output/base.py", line 254, in save
    raise self.DoesNotExistError(self)
dvc.dependency.base.DependencyDoesNotExistError: dependency 'remote://dvc_file_import_test/test_folder/' does not exist
------------------------------------------------------------
2020-10-09 10:35:45,870 DEBUG: Analytics is enabled.
2020-10-09 10:35:46,049 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/var/folders/v4/m3drx4jj7dgcg94kgkjwh6zc0000gp/T/tmp3xozxwul']'
2020-10-09 10:35:46,051 DEBUG: Spawned '['daemon', '-q', 'analytics', '/var/folders/v4/m3drx4jj7dgcg94kgkjwh6zc0000gp/T/tmp3xozxwul']'

Trying without a trailing / as suggested by @jorgeorpinel:

~$ dvc import-url remote://dvc_file_import_test/test_folder -v
2020-10-09 10:57:29,184 DEBUG: Check for update is enabled.
2020-10-09 10:57:29,186 DEBUG: fetched: [(3,)]
2020-10-09 10:57:29,536 DEBUG: Removing output 'test_folder' of stage: 'test_folder.dvc'.
Importing 'remote://dvc_file_import_test/test_folder' -> 'test_folder'
2020-10-09 10:57:29,537 DEBUG: Computed stage: 'test_folder.dvc' md5: 'ead4610f755d1be0125823348c9bd4ae'
2020-10-09 10:57:29,537 DEBUG: 'md5' of stage: 'test_folder.dvc' changed.
2020-10-09 10:57:29,537 DEBUG: URL azure://dvc-file-import-test/
2020-10-09 10:57:29,537 DEBUG: Using connection string 'remote_connection_string'
2020-10-09 10:57:29,544 DEBUG: Container name dvc-file-import-test
2020-10-09 10:57:29,784 DEBUG: fetched: [(4,)]
2020-10-09 10:57:29,785 ERROR: failed to import remote://dvc_file_import_test/test_folder. You could also try downloading it manually, and adding it with `dvc add`. - dependency 'remote://dvc_file_import_test/test_folder' does not exist
------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/Cellar/dvc/1.8.1/libexec/lib/python3.8/site-packages/dvc/command/imp_url.py", line 14, in run
    self.repo.imp_url(
  File "/usr/local/Cellar/dvc/1.8.1/libexec/lib/python3.8/site-packages/dvc/repo/__init__.py", line 51, in wrapper
    return f(repo, *args, **kwargs)
  File "/usr/local/Cellar/dvc/1.8.1/libexec/lib/python3.8/site-packages/dvc/repo/scm_context.py", line 4, in run
    result = method(repo, *args, **kw)
  File "/usr/local/Cellar/dvc/1.8.1/libexec/lib/python3.8/site-packages/dvc/repo/imp_url.py", line 54, in imp_url
    stage.run()
  File "/usr/local/Cellar/dvc/1.8.1/libexec/lib/python3.8/site-packages/funcy/decorators.py", line 39, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/usr/local/Cellar/dvc/1.8.1/libexec/lib/python3.8/site-packages/dvc/stage/decorators.py", line 36, in rwlocked
    return call()
  File "/usr/local/Cellar/dvc/1.8.1/libexec/lib/python3.8/site-packages/funcy/decorators.py", line 60, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/usr/local/Cellar/dvc/1.8.1/libexec/lib/python3.8/site-packages/dvc/stage/__init__.py", line 429, in run
    sync_import(self, dry, force)
  File "/usr/local/Cellar/dvc/1.8.1/libexec/lib/python3.8/site-packages/dvc/stage/imports.py", line 29, in sync_import
    stage.save_deps()
  File "/usr/local/Cellar/dvc/1.8.1/libexec/lib/python3.8/site-packages/dvc/stage/__init__.py", line 392, in save_deps
    dep.save()
  File "/usr/local/Cellar/dvc/1.8.1/libexec/lib/python3.8/site-packages/dvc/output/base.py", line 254, in save
    raise self.DoesNotExistError(self)
dvc.dependency.base.DependencyDoesNotExistError: dependency 'remote://dvc_file_import_test/test_folder' does not exist
------------------------------------------------------------
2020-10-09 10:57:29,790 DEBUG: Analytics is enabled.
2020-10-09 10:57:29,966 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/var/folders/v4/m3drx4jj7dgcg94kgkjwh6zc0000gp/T/tmp64qzqzwm']'
2020-10-09 10:57:29,968 DEBUG: Spawned '['daemon', '-q', 'analytics', '/var/folders/v4/m3drx4jj7dgcg94kgkjwh6zc0000gp/T/tmp64qzqzwm']'

Confirming that the files are in the remote:

~$ az storage blob list --account-name remote_storage_account --container dvc-file-import-test
There are no credentials provided in your command and environment, we will query for the account key inside your storage account.
Please provide --connection-string, --account-key or --sas-token as credentials, or use `--auth-mode login` if you have required RBAC roles in your command. For more information about RBAC roles in storage, visit https://docs.microsoft.com/en-us/azure/storage/common/storage-auth-aad-rbac-cli.
Setting the corresponding environment variables can avoid inputting credentials in your command. Please use --help to get more information.
[
  {
    "container": "dvc-file-import-test",
    "content": "",
    "deleted": null,
    "encryptedMetadata": null,
    "encryptionKeySha256": null,
    "encryptionScope": null,
    "isAppendBlobSealed": null,
    "isCurrentVersion": null,
    "metadata": {},
    "name": "test_file.txt", ----------------------------------------> This is the test file.
    "objectReplicationDestinationPolicy": null,
    "objectReplicationSourceProperties": [],
    "properties": {
      "appendBlobCommittedBlockCount": null,
      "blobTier": "Hot",
      "blobTierChangeTime": null,
      "blobTierInferred": true,
      "blobType": "BlockBlob",
      "contentLength": 2,
      "contentRange": null,
      "contentSettings": {
        "cacheControl": null,
        "contentDisposition": null,
        "contentEncoding": null,
        "contentLanguage": null,
        "contentMd5": "sCYyTGkEsqnLS4jW1hyB0Q==",
        "contentType": "text/plain"
      },
      "copy": {
        "completionTime": null,
        "destinationSnapshot": null,
        "id": null,
        "incrementalCopy": null,
        "progress": null,
        "source": null,
        "status": null,
        "statusDescription": null
      },
      "creationTime": "2020-10-09T17:28:37+00:00",
      "deletedTime": null,
      "etag": "0x8D86C78C1FE5BF7",
      "lastModified": "2020-10-09T17:28:37+00:00",
      "lease": {
        "duration": null,
        "state": "available",
        "status": "unlocked"
      },
      "pageBlobSequenceNumber": null,
      "pageRanges": null,
      "rehydrationStatus": null,
      "remainingRetentionDays": null,
      "serverEncrypted": true
    },
    "rehydratePriority": null,
    "requestServerEncrypted": null,
    "snapshot": null,
    "tagCount": null,
    "tags": null,
    "versionId": null
  },
  {
    "container": "dvc-file-import-test",
    "content": "",
    "deleted": null,
    "encryptedMetadata": null,
    "encryptionKeySha256": null,
    "encryptionScope": null,
    "isAppendBlobSealed": null,
    "isCurrentVersion": null,
    "metadata": {},
    "name": "test_folder/test.txt", -------------------------------------> This is a file in the test folder
    "objectReplicationDestinationPolicy": null,
    "objectReplicationSourceProperties": [],
    "properties": {
      "appendBlobCommittedBlockCount": null,
      "blobTier": "Hot",
      "blobTierChangeTime": null,
      "blobTierInferred": true,
      "blobType": "BlockBlob",
      "contentLength": 2,
      "contentRange": null,
      "contentSettings": {
        "cacheControl": null,
        "contentDisposition": null,
        "contentEncoding": null,
        "contentLanguage": null,
        "contentMd5": "HcyiM1UnIFbwT+i/IO384A==",
        "contentType": "text/plain"
      },
      "copy": {
        "completionTime": null,
        "destinationSnapshot": null,
        "id": null,
        "incrementalCopy": null,
        "progress": null,
        "source": null,
        "status": null,
        "statusDescription": null
      },
      "creationTime": "2020-10-09T17:28:23+00:00",
      "deletedTime": null,
      "etag": "0x8D86C78B942233A",
      "lastModified": "2020-10-09T17:28:23+00:00",
      "lease": {
        "duration": null,
        "state": "available",
        "status": "unlocked"
      },
      "pageBlobSequenceNumber": null,
      "pageRanges": null,
      "rehydrationStatus": null,
      "remainingRetentionDays": null,
      "serverEncrypted": true
    },
    "rehydratePriority": null,
    "requestServerEncrypted": null,
    "snapshot": null,
    "tagCount": null,
    "tags": null,
    "versionId": null
  },
  {
    "container": "dvc-file-import-test",
    "content": "",
    "deleted": null,
    "encryptedMetadata": null,
    "encryptionKeySha256": null,
    "encryptionScope": null,
    "isAppendBlobSealed": null,
    "isCurrentVersion": null,
    "metadata": {},
    "name": "test_folder/test_2.txt", ----------------------------------> This is the second file in the test folder
    "objectReplicationDestinationPolicy": null,
    "objectReplicationSourceProperties": [],
    "properties": {
      "appendBlobCommittedBlockCount": null,
      "blobTier": "Hot",
      "blobTierChangeTime": null,
      "blobTierInferred": true,
      "blobType": "BlockBlob",
      "contentLength": 2,
      "contentRange": null,
      "contentSettings": {
        "cacheControl": null,
        "contentDisposition": null,
        "contentEncoding": null,
        "contentLanguage": null,
        "contentMd5": "sCYyTGkEsqnLS4jW1hyB0Q==",
        "contentType": "text/plain"
      },
      "copy": {
        "completionTime": null,
        "destinationSnapshot": null,
        "id": null,
        "incrementalCopy": null,
        "progress": null,
        "source": null,
        "status": null,
        "statusDescription": null
      },
      "creationTime": "2020-10-09T17:28:23+00:00",
      "deletedTime": null,
      "etag": "0x8D86C78B93416F5",
      "lastModified": "2020-10-09T17:28:23+00:00",
      "lease": {
        "duration": null,
        "state": "available",
        "status": "unlocked"
      },
      "pageBlobSequenceNumber": null,
      "pageRanges": null,
      "rehydrationStatus": null,
      "remainingRetentionDays": null,
      "serverEncrypted": true
    },
    "rehydratePriority": null,
    "requestServerEncrypted": null,
    "snapshot": null,
    "tagCount": null,
    "tags": null,
    "versionId": null
  }
]

</details>
@jorgeorpinel
Copy link
Contributor

I suspect it's a bug and hopefully only with remote:// URL aliases. Please evaluate, engineering team 🙂

@pared
Copy link
Contributor

pared commented Oct 10, 2020

For the record, original discussion

@efiop
Copy link
Contributor

efiop commented Oct 14, 2020

Most likely we are lacking some logic for directories in AzureTree.exists, need to take a closer look.

@efiop efiop added bug Did we break something? p2-medium Medium priority, should be done, but less important research labels Oct 14, 2020
@isidentical isidentical added the upstream Issues which need to be resolved in an upstream dependency label Mar 15, 2021
@efiop efiop added the fs: azure Related to the Azure filesystem label May 3, 2021
@isidentical
Copy link
Contributor

This is finally fixed on upstream adlfs, so we only need to add a test and bump the adlfs requirements. See fsspec/adlfs#237

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? fs: azure Related to the Azure filesystem p2-medium Medium priority, should be done, but less important research upstream Issues which need to be resolved in an upstream dependency
Projects
None yet
5 participants