-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
guide: consolidate external data mgmt guides #520
Comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I agree. After thinking a bit about it, it seems that that structure might be suitable for a page/section about "Remote Storage Types", like this:
Exactly. For example I am not sure when a remote cache might be useful. I can think of two cases:
Have I missed any other cases when it might be a preferred choice? The case that is used currently as an example (copying datafiles to a remote storage) does not seem convincing. |
I think that Remote Storage Types by itself can and should be part of the "Data Sharing" section you made.
In general you need all "external stuff" when you have some processing system that writes/read directly into/from it. E.g Spark that runs on top of S3 and crunches data that is too large to bring it to local machine. It means some mixed cases - e.g. raw data is large and you read it directly from S3 but intermediate artifacts, models, etc are stored locally. So, rule of thumb - think about script in any language. If it internally has a pointer to S3, or any other cloud/remote storage (relative to the location you run this script from) it means you need some external data management. |
This comment has been minimized.
This comment has been minimized.
I have a question. If you check the new section External Data and Remotes, you will notice that some remote types (SSH, Amazon S3, GS, HDFS) have subsections about external dependencies and outputs, while the others (S3 API, Azure, Aliyun OSS) do not have such subsections. This is simply because their information is missing from the pages that are being consolidated:
I suspect that external dependencies and outputs work the same way for these remote types as well. However I cannot test this for each remote type (I am just reorganizing and rewriting the information that is already available on these pages and on the man pages of Can someone confirm that the external dependencies and outputs work the same way for all the remote types (in particular for S3 API, Azure and Aliyun OSS)? |
@efiop has more up to date info on this in terms of specific remotes supporting specific features. But in general, that information is not present because on certain types of remotes we do not support external deps/outs yet. May be even some table would be beneficial with a summary of features supported by different remotes.
|
Added a checkbox about the external cache section (per #654 (comment)) |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Not sure where is best to add this, but we continue to see confusion about how to use DVC with externally stored data, especially when working in cloud-based notebook environments like Databricks, Sagemaker, and Colab. See the discussions below: |
Another report of this confusion: https://discord.com/channels/485586884165107732/485596304961962003/1095518763228610570. Making this p1. I will try to reword/reorganize the info here. |
done
External dependencies and outputs.next
remote modify
params #764.later?
The text was updated successfully, but these errors were encountered: