Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add --external: fails using Azure remote #8759

Closed
rmlopes opened this issue Jan 4, 2023 · 24 comments
Closed

add --external: fails using Azure remote #8759

rmlopes opened this issue Jan 4, 2023 · 24 comments
Labels
A: external-data Related to external dependencies and outputs feature request Requesting a new feature p2-medium Medium priority, should be done, but less important

Comments

@rmlopes
Copy link

rmlopes commented Jan 4, 2023

Bug Report

Description

I am trying to track existing data from a storage account in Azure following current documentation.

Reproduce

  1. dvc init
  2. dvc remote add azcore azure://core-container
  3. dvc remote add azdata azure://data-container
  4. dvc add --external remote://azdata/existing-data

Expected

I'm not sure what is expected but the output is:

ERROR: unexpected error - : 'azure'

Environment information

Output of dvc doctor:

DVC version: 2.38.1 (pip)
---------------------------------
Platform: Python 3.9.6 on macOS-13.1-x86_64-i386-64bit
Subprojects:
	dvc_data = 0.28.4
	dvc_objects = 0.14.0
	dvc_render = 0.0.15
	dvc_task = 0.1.8
	dvclive = 1.3.1
	scmrepo = 0.1.4
Supports:
	azure (adlfs = 2022.11.2, knack = 0.10.1, azure-identity = 1.12.0),
	http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s5s1
Caches: local
Remotes: azure, azure
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc, git

Additional Information:

2023-01-04 18:58:46,616 ERROR: unexpected error - : 'azure'
------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/odbmgr.py", line 65, in __getattr__
    return self._odb[name]
KeyError: 'azure'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/cli/__init__.py", line 185, in main
    ret = cmd.do_run()
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/cli/command.py", line 22, in do_run
    return self.run()
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/commands/add.py", line 53, in run
    self.repo.add(
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/utils/collections.py", line 164, in inner
    result = func(*ba.args, **ba.kwargs)
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/repo/__init__.py", line 48, in wrapper
    return f(repo, *args, **kwargs)
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/repo/scm_context.py", line 156, in run
    return method(repo, *args, **kw)
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/repo/add.py", line 190, in add
    stage.save(merge_versioned=True)
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/stage/__init__.py", line 469, in save
    self.save_outs(
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/stage/__init__.py", line 512, in save_outs
    out.save()
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/output.py", line 643, in save
    self.odb,
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/output.py", line 450, in odb
    odb = getattr(self.repo.odb, odb_name)
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/odbmgr.py", line 67, in __getattr__
    raise AttributeError from exc
AttributeError
------------------------------------------------------------
2023-01-04 18:58:46,711 DEBUG: Version info for developers:
DVC version: 2.38.1 (pip)
---------------------------------
Platform: Python 3.9.6 on macOS-13.1-x86_64-i386-64bit
Subprojects:
	dvc_data = 0.28.4
	dvc_objects = 0.14.0
	dvc_render = 0.0.15
	dvc_task = 0.1.8
	dvclive = 1.3.1
	scmrepo = 0.1.4
Supports:
	azure (adlfs = 2022.11.2, knack = 0.10.1, azure-identity = 1.12.0),
	http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: azure, azure
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc, git

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2023-01-04 18:58:46,714 DEBUG: Analytics is enabled.
2023-01-04 18:58:46,911 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/var/folders/st/05s6bkj55r9cw3hbrrdfvfqh0000gp/T/tmpoxhcmxev']'
2023-01-04 18:58:46,913 DEBUG: Spawned '['daemon', '-q', 'analytics', '/var/folders/st/05s6bkj55r9cw3hbrrdfvfqh0000gp/T/tmpoxhcmxev']'
@karajan1001
Copy link
Contributor

Hi,@rmlopes could you replace dvc add --external remote://azdata/existing-data with dvc add --external azure://azdata/existing-data and try it again?

@karajan1001 karajan1001 added the awaiting response we are waiting for your reply, please respond! :) label Jan 5, 2023
@rmlopes
Copy link
Author

rmlopes commented Jan 5, 2023

Hi @karajan1001

If I replace remote with azure it cannot authenticate.

@karajan1001
Copy link
Contributor

I looked into the documents and didn't find the azure support for the external data, Looks like we hadn't implemented it for now. So it's a new feature request instead of a bug? @dberenbaum

@dberenbaum
Copy link
Collaborator

Yes, unfortunately Azure is not currently supported for external outputs. It has come up before, but I can't find an existing issue, so let's keep this one as a feature request.

@dberenbaum dberenbaum added feature request Requesting a new feature p2-medium Medium priority, should be done, but less important and removed awaiting response we are waiting for your reply, please respond! :) labels Jan 5, 2023
@rmlopes
Copy link
Author

rmlopes commented Jan 5, 2023

Thanks for the clarification @karajan1001 and @dberenbaum. I guess everything pertaining that documentation section does not support Azure, like the external cache as well.

I think this may be loosely (or tightly, not sure) related to an old open issue of mine #5899, which has been linked to a feature request from CML as well.

@skshetry
Copy link
Member

skshetry commented Jan 5, 2023

We can't support azure external dependency, because iirc not all blobs get Content-MD5 set.
See #3540 (comment) for previous attempt.

@rmlopes
Copy link
Author

rmlopes commented Jan 5, 2023

It looks like there have been some changes in Azure cli over time related to setting the Content-MD5 in uploads (see for instance here and here).

Any ideas if this feature could be a possibility nowadays?

@efiop
Copy link
Contributor

efiop commented Jan 5, 2023

External outputs is an old experimental feature that we consider broken as a scenario. I don't think we will attempt implementing support for azure any time soon, only after rethinking this scenario fundamentally.

@rmlopes
Copy link
Author

rmlopes commented Jan 9, 2023

I am not sure what you mean by "old experimental feature that we consider broken as a scenario" for external outputs @efiop. But what about external cache? It would be nice to have support for it in Azure and not only AWS

@efiop
Copy link
Contributor

efiop commented Jan 9, 2023

@rmlopes It is the same thing https://dvc.org/doc/user-guide/data-management/managing-external-data#managing-external-data It is a feature we've developed a long time ago that is useful enough for us to keep it for people that need it, but it is definitely something that we don't usually recommend and new users get confused by it and try to use it when they don't really need to. There are no plans to support azure any time soon because it simply woundn't work as is and because the scenario needs to be reworked.

Please feel free to elaborate on your particular scenario, maybe there is something we could suggest instead.

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Jan 9, 2023
@rmlopes
Copy link
Author

rmlopes commented Jan 9, 2023

@efiop the reason I am insisting on this is because it seems the only viable scenario when you have very large data (it doesn't even need to be huge). For instance, when I posted #5899 we were working on a video analytics project. My local repo had grown to 100+ GiB of storage when the actual dataset was only about 20GiB. In the current project I am dabbling over external data and dependencies yet again because the dataset will grow and it would make sense to keep everything in storage instead of having local copies.

@shcheklein
Copy link
Member

@rmlopes it's a very interesting scenario for us. I would love to better understand your needs. Even if we fix the external outputs/deps scenario for Azure, it'll have an overhead of doing these checks, an overhead of caching files (time + duplication), an overlap if multiple people are using the same project.

Could you please describe a bit better your data lifecycle (who can add them, update, delete, etc), number of files. What are you trying to achieve with DVC in this, what kind of user scenarios workflows?

I'm asking not to push back on the feature, but rather to try to find a workaround for the system together (if possible).

@rmlopes
Copy link
Author

rmlopes commented Jan 10, 2023

@shcheklein I will describe my current use-case. For the other one I mentioned before I have it archived and it will take some time to get and revive the info on it (but it is the case where data is actually very large).

Use-Case: Integration with Azure Applied AI Services (Form Recogniser)

Notes on the Form Recogniser:

  • we need an account storage with a folder for the training data (mostly pdf files).
  • Once the files are annotated FormRecogniser will save the annotation files in the same folder (not configurable).

What we are trying to achieve:

  1. We have a storage folder for raw files from where we extract metadata, and a stage where we were trying to use import-url so that if a new file is added the metadata result is updated. This is processed locally so we can have the data locally as well (hence the import), but files must be added through the storage portal.
  2. We have a storage folder for the original files which are input to the models. We have a stage that will split these into train/val/test, thus we think we need an external dependency so that when a file is added we can re-run the split stage. There is no need to have these files locally as they are not processed locally (we use azure storage sdk), and again files must be added through the storage portal.
  3. Then, we want to track two things for the validation and/or testing stage: a local model registry (mostly a mapping from model type to modelId that will have to be updated manually, we will use a params file most probably) and the val/test folders so that if there are files added/removed we rerun the stage.

This is the essential for us right now. Currently the number of files is low, 30 for metadata, 120 for the model input files. The first takes only 80 MB and is not expected to grow significantly. the second is actually smaller at only 40MB but is expected to grow for at least maybe 20x (but as no processing is done locally there is no need to have a local copy).

Let me know if something is not clear or more info is needed.

@dberenbaum
Copy link
Collaborator

@rmlopes Do you/could you enable blob versioning? We have been developing support in DVC for this in Azure, so that the versioning is handled natively in the cloud, in which case DVC would only serve to record the metadata in your repo.

@rmlopes
Copy link
Author

rmlopes commented Jan 10, 2023

@dberenbaum yes, enabling blob versioning is a possibility (I don't see anything against it)

@daavoo daavoo removed the awaiting response we are waiting for your reply, please respond! :) label Jan 10, 2023
@rmlopes
Copy link
Author

rmlopes commented Jan 11, 2023

@dberenbaum I am trying the new feature, I have set the remote with version_aware = true. I am using import-url for step 1 (see edited previous post), and this is ok since we need the files locally hence we update (with download) and run the stage.
Now for step 2 I thought I could use an import-url --no-download with update --no-download, so I get an update to the .dvc, but how do I make my stage depend on this result (I cannot add the .dvcfile as a dependency and if I add the folder itself as a dependency it will download on repro)?

@shcheklein
Copy link
Member

Thanks @rmlopes ! A few questions to clarify:

We have a stage that will split these into train/val/test, thus we think we need an external dependency so that when a file is added we can re-run the split stage. There is no need to have these files locally as they are not processed locally (we use azure storage sdk), and again files must be added through the storage portal.

Is it a DVC stage? Do you copy files from the initial location to some subfolders? What does it mean and why do you need to be able to add files into that split manually (if that data comes and is generated by a stage)?

mostly a mapping from model type to modelId

what is model id in your case?

btw, how are you going to run this pipeline? on cron? or someone will be running it on their machine?

@rmlopes
Copy link
Author

rmlopes commented Jan 11, 2023

@shcheklein Thanks for the follow-up.

Is it a DVC stage? Do you copy files from the initial location to some subfolders?

It is a DVC stage where we copy files from an initial location (manually managed, it is a repository where someone - or an external service - may add files) into folders in the same container for train/val/test.

What does it mean and why do you need to be able to add files into that split manually (if that data comes and is generated by a stage)?

That data is not generated by a stage. Stage 1 is processing metadata files and the result is used during evaluation. Stage 2 is about splitting the files that will be input for the models during train/inference (not metadata files).

what is model id in your case?

The model id will be a manual input when training a model in Azure Form Recognizer Studio (we are using custom models)

how are you going to run this pipeline? on cron? or someone will be running it on their machine?

It will be run both by someone on their machine and by the CI/CD system.

@dberenbaum
Copy link
Collaborator

That data is not generated by a stage. Stage 1 is processing metadata files and the result is used during evaluation. Stage 2 is about splitting the files that will be input for the models during train/inference (not metadata files).

So stage 1 imports and operates locally, and everything works there as expected? And stage 2 should happen entirely on Azure, and that's where you encounter problems?

@dberenbaum I am trying the new feature, I have set the remote with version_aware = true. I am using import-url for step 1 (see edited previous post), and this is ok since we need the files locally hence we update (with download) and run the stage.
Now for step 2 I thought I could use an import-url --no-download with update --no-download, so I get an update to the .dvc, but how do I make my stage depend on this result (I cannot add the .dvcfile as a dependency and if I add the folder itself as a dependency it will download on repro)?

Sorry to get your hopes up @rmlopes. #8411 is in our backlog but not implemented yet. The hope is that external deps and outs will "just work" for you once this is implemented.

@shcheklein
Copy link
Member

Thanks again, @rmlopes !

How about this options (see question below though, may be I still don't understand the whole picture):

If input to the Stage 2 is immutable / append only (you don't remove / edit files) and split is happening based on metadata and the current set of files, instead of actually saving (duplicating) val/train/test folders, you can save the metadata and the list of files in the war dataset at the moment. That should be enough for stage to produce the same split as it was at the moment of building that version of a model. That would give you reproducibility.

Stage itself could use az command to copy files into val/train/test.

Would that work for you?

In general, this approach to my mind is better for large datasets: ideally you don't manipulate / move objects - you manipulate with metadata - list of files, their params, etc. It requires making sure (which is a good practice) making the storage immutable / append only.

Let me know if some of these ideas resonate with you or go in the right direction.


Some question / suggestions to keep clarifying:

So stage 1 imports and operates locally, and everything works there as expected? And stage 2 should happen entirely on Azure, and that's where you encounter problems?

So, these stages don't depend on each other (at least not in DVC-sense - where you have a pipeline of stages, etc)?

Do they run on the same data, the same raw storage?

We have a stage that will split these into train/val/test

Is it a physical split? Like actually copying / moving files? Is it a requirement from the Azure system?

It will be run both by someone on their machine and by the CI/CD system.

How would we prevent multiple ppl from running it simultaneously (if we use external outs, it means users will affect each other or we need a mechanism to separate them?). Or it's not a problem in your case?

mostly a mapping from model type to modelId that will have to be updated manually, we will use a params file most probably

I understand now, thanks. Can some API be used to fetch the ID and / or model? And this way it can be wrapped into a stage by itself in the DVC pipeline? (sorry, I'm not very familiar with Azure custom models).

@rmlopes
Copy link
Author

rmlopes commented Jan 12, 2023

Hi guys, thanks so much for this discussion,

@dberenbaum

So stage 1 imports and operates locally, and everything works there as expected?

Stage 1 import from cloud and operate locally, so as long as we do the dvc update it will be fine. As a side note from the logs it looks like it always downloads all the files when we do the update (I would expect it to only download what is new). Because of this, even if there are no new files stage 1 will always run after an update.

And stage 2 should happen entirely on Azure, and that's where you encounter problems? Sorry to get your hopes up (...)

It' ok, we will have to find some other way around it or download the files.

@shcheklein

If input to the Stage 2 is immutable / append only (you don't remove / edit files) and split is happening based on metadata and the current set of files, instead of actually saving (duplicating) val/train/test folders, you can save the metadata and the list of files in the war dataset at the moment. That should be enough for stage to produce the same split as it was at the moment of building that version of a model. That would give you reproducibility.

The input for stage2 is immutable indeed, it will only have appends. I already save the metadata and basically the list of files so it is reproducible. I can avoid the duplication of the val/test data as you suggest but I I do have to copy the train data because FormRecognizer will create annotation files (which may change later) and it needs a folder with the training data only (as many people will be labelling and we cannot have the other files in there). This is already being copied using az (sdk, not the cli but it is the same thing)

Would that work for you?

Summing up, kind of does, but I would still need to be able to track external dependencies (#8783) so that I know when files are added to the stage 2 input folder and so that DVC would be able to rerun the stage. It does solve the dependencies for stage 3 which will be on the saved list of files instead of needing the external dependency yet again.

So stage 1 imports and operates locally, and everything works there as expected? And stage 2 should happen entirely on Azure, and that's where you encounter problems?

So, these stages don't depend on each other (at least not in DVC-sense - where you have a pipeline of stages, etc)?

As I posted originally they didn't, I have changed it though (more or less in the direction you proposed) and now there is a pipeline of stages:

1a. Extract the metadata (local)
1b. Split files using info from metadata (local)
2. Copy the files according to split into train/val/test
3. Same as before

Do they run on the same data, the same raw storage?

No, stage 1 uses one set of files, stage 2 uses another set of files.

Is it a physical split? Like actually copying / moving files? Is it a requirement from the Azure system?

Yes, it can be done only for the training folder though "as FormRecognizer will create annotation files (which may change later) and it needs a folder with the training data only (as many people will be labelling and we cannot have the other files in there)".

How would we prevent multiple ppl from running it simultaneously (if we use external outs, it means users will affect each other or we need a mechanism to separate them?). Or it's not a problem in your case?

The outputs of stage 1a/b are regular outputs tracked by DVC, there are no external outputs in the sense that is described in DVC documentation (although there are external operations - stage 2). As I write this I think I get the point, if someone for instance changes the parameters for the split that will create a new version of the training folder. Would a functional integration with cloud versioning solve the problem in this case?

Can some API be used to fetch the ID and / or model? And this way it can be wrapped into a stage by itself in the DVC pipeline?

Yes, we will use the model registry as a params file precisely to map model type to model id (and version) and track this mapping, there is no need no fetch the model itself as it runs as a service in the cloud. The (manual) versioning is included in the model id (for instance, doc-type: doc-type-0.0.1)

Hopefully I was able to clarify everything and not make it worst :)

@shcheklein
Copy link
Member

it looks like it always downloads all the files when we do the update (I would expect it to only download what is new). Because of this, even if there are no new files stage 1 will always run after an update.

I think you can overcome this by introducing an extra stage, that would list files into a "list.txt" and make stage that downloads them locally depend on this list. If storage is append-only, immutable I would even prefer this way over import-url since it should be faster.

The bug still should be fixed though, for the import-url downloading things every time (cc @dberenbaum )?

Would a functional integration with cloud versioning solve the problem in this case?

No, as far as I understand it won't help. Let's imagine multiple people want to train something simultaneously, since Azure expects a specific layout in the folder and it's the same one folder, I can't find way in my head how to make two different splits simultaneously in the same location. A better way would an output folder on Azure a param (and you can use dvc.yaml templating to substitute is with a value when the pipeline is running) that person can specify in params.yaml? You then should be prepared that you will end up with many folders on Azure storage with different splits (they can be removed after training is done btw).

Let me know if that makes sense, I can explain better or show an example if needed.

Hopefully I was able to clarify everything and not make it worst :)

Yes, it's clear now what's going on. Thanks 🙏

@rmlopes
Copy link
Author

rmlopes commented Jan 13, 2023

Thanks for the input guys, it helped me settle the pipeline that we will use and work around some of the limitations using Azure. I think it makes sense to keep this issue open as a feature request but of course I'll leave that at your discretion.

@shcheklein
Copy link
Member

No, @rmlopes . Thanks, glad to see that we've settled on something after all :)

@daavoo daavoo added the A: external-data Related to external dependencies and outputs label Jan 16, 2023
@efiop efiop closed this as not planned Won't fix, can't repro, duplicate, stale Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: external-data Related to external dependencies and outputs feature request Requesting a new feature p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

7 participants