-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add --external: fails using Azure remote #8759
Comments
Hi,@rmlopes could you replace |
Hi @karajan1001 If I replace |
I looked into the documents and didn't find the azure support for the external data, Looks like we hadn't implemented it for now. So it's a new feature request instead of a bug? @dberenbaum |
Yes, unfortunately Azure is not currently supported for external outputs. It has come up before, but I can't find an existing issue, so let's keep this one as a feature request. |
Thanks for the clarification @karajan1001 and @dberenbaum. I guess everything pertaining that documentation section does not support Azure, like the external cache as well. I think this may be loosely (or tightly, not sure) related to an old open issue of mine #5899, which has been linked to a feature request from CML as well. |
We can't support azure external dependency, because iirc not all blobs get |
External outputs is an old experimental feature that we consider broken as a scenario. I don't think we will attempt implementing support for azure any time soon, only after rethinking this scenario fundamentally. |
I am not sure what you mean by "old experimental feature that we consider broken as a scenario" for external outputs @efiop. But what about external cache? It would be nice to have support for it in Azure and not only AWS |
@rmlopes It is the same thing https://dvc.org/doc/user-guide/data-management/managing-external-data#managing-external-data It is a feature we've developed a long time ago that is useful enough for us to keep it for people that need it, but it is definitely something that we don't usually recommend and new users get confused by it and try to use it when they don't really need to. There are no plans to support azure any time soon because it simply woundn't work as is and because the scenario needs to be reworked. Please feel free to elaborate on your particular scenario, maybe there is something we could suggest instead. |
@efiop the reason I am insisting on this is because it seems the only viable scenario when you have very large data (it doesn't even need to be huge). For instance, when I posted #5899 we were working on a video analytics project. My local repo had grown to 100+ GiB of storage when the actual dataset was only about 20GiB. In the current project I am dabbling over external data and dependencies yet again because the dataset will grow and it would make sense to keep everything in storage instead of having local copies. |
@rmlopes it's a very interesting scenario for us. I would love to better understand your needs. Even if we fix the external outputs/deps scenario for Azure, it'll have an overhead of doing these checks, an overhead of caching files (time + duplication), an overlap if multiple people are using the same project. Could you please describe a bit better your data lifecycle (who can add them, update, delete, etc), number of files. What are you trying to achieve with DVC in this, what kind of user scenarios workflows? I'm asking not to push back on the feature, but rather to try to find a workaround for the system together (if possible). |
@shcheklein I will describe my current use-case. For the other one I mentioned before I have it archived and it will take some time to get and revive the info on it (but it is the case where data is actually very large). Use-Case: Integration with Azure Applied AI Services (Form Recogniser) Notes on the Form Recogniser:
What we are trying to achieve:
This is the essential for us right now. Currently the number of files is low, 30 for metadata, 120 for the model input files. The first takes only 80 MB and is not expected to grow significantly. the second is actually smaller at only 40MB but is expected to grow for at least maybe 20x (but as no processing is done locally there is no need to have a local copy). Let me know if something is not clear or more info is needed. |
@rmlopes Do you/could you enable blob versioning? We have been developing support in DVC for this in Azure, so that the versioning is handled natively in the cloud, in which case DVC would only serve to record the metadata in your repo. |
@dberenbaum yes, enabling blob versioning is a possibility (I don't see anything against it) |
@dberenbaum I am trying the new feature, I have set the remote with |
Thanks @rmlopes ! A few questions to clarify:
Is it a DVC stage? Do you copy files from the initial location to some subfolders? What does it mean and why do you need to be able to add files into that split manually (if that data comes and is generated by a stage)?
what is model id in your case? btw, how are you going to run this pipeline? on cron? or someone will be running it on their machine? |
@shcheklein Thanks for the follow-up.
It is a DVC stage where we copy files from an initial location (manually managed, it is a repository where someone - or an external service - may add files) into folders in the same container for train/val/test.
That data is not generated by a stage. Stage 1 is processing metadata files and the result is used during evaluation. Stage 2 is about splitting the files that will be input for the models during train/inference (not metadata files).
The model id will be a manual input when training a model in Azure Form Recognizer Studio (we are using custom models)
It will be run both by someone on their machine and by the CI/CD system. |
So stage 1 imports and operates locally, and everything works there as expected? And stage 2 should happen entirely on Azure, and that's where you encounter problems?
Sorry to get your hopes up @rmlopes. #8411 is in our backlog but not implemented yet. The hope is that external deps and outs will "just work" for you once this is implemented. |
Thanks again, @rmlopes ! How about this options (see question below though, may be I still don't understand the whole picture): If input to the Stage 2 is immutable / append only (you don't remove / edit files) and split is happening based on metadata and the current set of files, instead of actually saving (duplicating) val/train/test folders, you can save the metadata and the list of files in the war dataset at the moment. That should be enough for stage to produce the same split as it was at the moment of building that version of a model. That would give you reproducibility. Stage itself could use Would that work for you? In general, this approach to my mind is better for large datasets: ideally you don't manipulate / move objects - you manipulate with metadata - list of files, their params, etc. It requires making sure (which is a good practice) making the storage immutable / append only. Let me know if some of these ideas resonate with you or go in the right direction. Some question / suggestions to keep clarifying:
So, these stages don't depend on each other (at least not in DVC-sense - where you have a pipeline of stages, etc)? Do they run on the same data, the same raw storage?
Is it a physical split? Like actually copying / moving files? Is it a requirement from the Azure system?
How would we prevent multiple ppl from running it simultaneously (if we use external outs, it means users will affect each other or we need a mechanism to separate them?). Or it's not a problem in your case?
I understand now, thanks. Can some API be used to fetch the ID and / or model? And this way it can be wrapped into a stage by itself in the DVC pipeline? (sorry, I'm not very familiar with Azure custom models). |
Hi guys, thanks so much for this discussion,
Stage 1 import from cloud and operate locally, so as long as we do the dvc update it will be fine. As a side note from the logs it looks like it always downloads all the files when we do the update (I would expect it to only download what is new). Because of this, even if there are no new files stage 1 will always run after an update.
It' ok, we will have to find some other way around it or download the files.
The input for stage2 is immutable indeed, it will only have appends. I already save the metadata and basically the list of files so it is reproducible. I can avoid the duplication of the val/test data as you suggest but I I do have to copy the train data because FormRecognizer will create annotation files (which may change later) and it needs a folder with the training data only (as many people will be labelling and we cannot have the other files in there). This is already being copied using az (sdk, not the cli but it is the same thing)
Summing up, kind of does, but I would still need to be able to track external dependencies (#8783) so that I know when files are added to the stage 2 input folder and so that DVC would be able to rerun the stage. It does solve the dependencies for stage 3 which will be on the saved list of files instead of needing the external dependency yet again.
As I posted originally they didn't, I have changed it though (more or less in the direction you proposed) and now there is a pipeline of stages: 1a. Extract the metadata (local)
No, stage 1 uses one set of files, stage 2 uses another set of files.
Yes, it can be done only for the training folder though "as FormRecognizer will create annotation files (which may change later) and it needs a folder with the training data only (as many people will be labelling and we cannot have the other files in there)".
The outputs of stage 1a/b are regular outputs tracked by DVC, there are no external outputs in the sense that is described in DVC documentation (although there are external operations - stage 2). As I write this I think I get the point, if someone for instance changes the parameters for the split that will create a new version of the training folder. Would a functional integration with cloud versioning solve the problem in this case?
Yes, we will use the model registry as a params file precisely to map model type to model id (and version) and track this mapping, there is no need no fetch the model itself as it runs as a service in the cloud. The (manual) versioning is included in the model id (for instance, doc-type: doc-type-0.0.1) Hopefully I was able to clarify everything and not make it worst :) |
I think you can overcome this by introducing an extra stage, that would list files into a "list.txt" and make stage that downloads them locally depend on this list. If storage is append-only, immutable I would even prefer this way over import-url since it should be faster. The bug still should be fixed though, for the import-url downloading things every time (cc @dberenbaum )?
No, as far as I understand it won't help. Let's imagine multiple people want to train something simultaneously, since Azure expects a specific layout in the folder and it's the same one folder, I can't find way in my head how to make two different splits simultaneously in the same location. A better way would an output folder on Azure a param (and you can use Let me know if that makes sense, I can explain better or show an example if needed.
Yes, it's clear now what's going on. Thanks 🙏 |
Thanks for the input guys, it helped me settle the pipeline that we will use and work around some of the limitations using Azure. I think it makes sense to keep this issue open as a feature request but of course I'll leave that at your discretion. |
No, @rmlopes . Thanks, glad to see that we've settled on something after all :) |
Bug Report
Description
I am trying to track existing data from a storage account in Azure following current documentation.
Reproduce
Expected
I'm not sure what is expected but the output is:
ERROR: unexpected error - : 'azure'
Environment information
Output of
dvc doctor
:Additional Information:
The text was updated successfully, but these errors were encountered: