Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: consolidate external data mgmt guides #520

Closed
8 tasks
shcheklein opened this issue Aug 3, 2019 · 17 comments · Fixed by #4574
Closed
8 tasks

guide: consolidate external data mgmt guides #520

shcheklein opened this issue Aug 3, 2019 · 17 comments · Fixed by #4574
Assignees
Labels
A: docs Area: user documentation (gatsby-theme-iterative) C: guide Content of /doc/user-guide p1-important Active priorities to deal within next sprints

Comments

@shcheklein
Copy link
Member

shcheklein commented Aug 3, 2019

done

  • External dependencies and outputs.

next

later?

@shcheklein shcheklein added type: enhancement Something is not clear, small updates, improvement suggestions A: docs Area: user documentation (gatsby-theme-iterative) user-guide labels Aug 3, 2019
@dashohoxha dashohoxha mentioned this issue Oct 25, 2019
10 tasks
@dashohoxha dashohoxha self-assigned this Nov 14, 2019
@dashohoxha

This comment has been minimized.

@shcheklein

This comment has been minimized.

@dashohoxha
Copy link
Contributor

I think this structure looks too complicated and is not correct.

I agree. After thinking a bit about it, it seems that that structure might be suitable for a page/section about "Remote Storage Types", like this:

  • Remote Storage Types
    • Local Files and Directories
    • SSH
    • Amazon S3
    • Google Cloud Storage
    • HDFS
    • HTTP

Most important here is to explain when and why the external data management is needed.

Exactly. For example I am not sure when a remote cache might be useful. I can think of two cases:

  1. When the remote cache is actually a local directory.
  2. When the project is used just for dataset management (there are no stages).

Have I missed any other cases when it might be a preferred choice? The case that is used currently as an example (copying datafiles to a remote storage) does not seem convincing.

@shcheklein
Copy link
Member Author

I think that Remote Storage Types by itself can and should be part of the "Data Sharing" section you made.

For example I am not sure when a remote cache might be useful.

In general you need all "external stuff" when you have some processing system that writes/read directly into/from it. E.g Spark that runs on top of S3 and crunches data that is too large to bring it to local machine. It means some mixed cases - e.g. raw data is large and you read it directly from S3 but intermediate artifacts, models, etc are stored locally.

So, rule of thumb - think about script in any language. If it internally has a pointer to S3, or any other cloud/remote storage (relative to the location you run this script from) it means you need some external data management.

@dashohoxha

This comment has been minimized.

@dashohoxha
Copy link
Contributor

I have a question.

If you check the new section External Data and Remotes, you will notice that some remote types (SSH, Amazon S3, GS, HDFS) have subsections about external dependencies and outputs, while the others (S3 API, Azure, Aliyun OSS) do not have such subsections.

This is simply because their information is missing from the pages that are being consolidated:

I suspect that external dependencies and outputs work the same way for these remote types as well. However I cannot test this for each remote type (I am just reorganizing and rewriting the information that is already available on these pages and on the man pages of dvc remote).

Can someone confirm that the external dependencies and outputs work the same way for all the remote types (in particular for S3 API, Azure and Aliyun OSS)?

@shcheklein
Copy link
Member Author

@efiop has more up to date info on this in terms of specific remotes supporting specific features. But in general, that information is not present because on certain types of remotes we do not support external deps/outs yet. May be even some table would be beneficial with a summary of features supported by different remotes.

S3 API name is a bit confusing btw. It should be something like - S3 Compatible or may be there is a specific term for that.

@jorgeorpinel jorgeorpinel changed the title consolidate external data management guides properly user-guide: consolidate external data management guides properly Jan 20, 2020
@jorgeorpinel jorgeorpinel added the ✨ epic Placeholder ticket for multi-sprint direction, use story, improvement label May 7, 2020
@jorgeorpinel jorgeorpinel added the p2-nice-to-have Less of a priority at the moment. We don't usually deal with this immediately. label Jul 24, 2020
@jorgeorpinel jorgeorpinel added p1-important Active priorities to deal within next sprints and removed p2-nice-to-have Less of a priority at the moment. We don't usually deal with this immediately. type: enhancement Something is not clear, small updates, improvement suggestions ✨ epic Placeholder ticket for multi-sprint direction, use story, improvement labels Aug 22, 2020
@jorgeorpinel jorgeorpinel added the ✨ epic Placeholder ticket for multi-sprint direction, use story, improvement label Sep 8, 2020
@jorgeorpinel
Copy link
Contributor

Added a checkbox about the external cache section (per #654 (comment))

@jorgeorpinel jorgeorpinel added p2-nice-to-have Less of a priority at the moment. We don't usually deal with this immediately. and removed p1-important Active priorities to deal within next sprints labels May 13, 2021
@casperdcl

This comment has been minimized.

@shcheklein

This comment has been minimized.

@casperdcl

This comment has been minimized.

@shcheklein

This comment has been minimized.

@iesahin iesahin added the C: guide Content of /doc/user-guide label Oct 21, 2021
@jorgeorpinel jorgeorpinel added the ⌛ status: wait-core-merge Waiting for related product PR merge/release label Jan 14, 2022
@jorgeorpinel jorgeorpinel removed p2-nice-to-have Less of a priority at the moment. We don't usually deal with this immediately. ✨ epic Placeholder ticket for multi-sprint direction, use story, improvement labels Apr 27, 2022
@jorgeorpinel jorgeorpinel added the ✨ epic Placeholder ticket for multi-sprint direction, use story, improvement label Jul 28, 2022
@jorgeorpinel jorgeorpinel removed the ✨ epic Placeholder ticket for multi-sprint direction, use story, improvement label Jul 28, 2022
@dberenbaum
Copy link
Contributor

dberenbaum commented Mar 27, 2023

Not sure where is best to add this, but we continue to see confusion about how to use DVC with externally stored data, especially when working in cloud-based notebook environments like Databricks, Sagemaker, and Colab. See the discussions below:

@dberenbaum dberenbaum added p1-important Active priorities to deal within next sprints and removed ⌛ status: wait-core-merge Waiting for related product PR merge/release labels Apr 14, 2023
@dberenbaum
Copy link
Contributor

Another report of this confusion: https://discord.com/channels/485586884165107732/485596304961962003/1095518763228610570.

Making this p1. I will try to reword/reorganize the info here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: docs Area: user documentation (gatsby-theme-iterative) C: guide Content of /doc/user-guide p1-important Active priorities to deal within next sprints
Projects
No open projects
Archived in project
7 participants