guide: consolidate external data mgmt guides #520

shcheklein · 2019-08-03T02:28:56Z

done

~~External dependencies and outputs.~~

Add proper links through out other doc sections.
Explain external cache here or in the cache docs but not in the shared dev server use case, because shared cache isn't compatible with external data (see support adding/transfering data straight to cache/remote dvc#4520 (comment))
Clarify about cache link types per remote location as discussed in mention that cache options could be used besides remote modify params #764.
Decide whether how-to is also needed per how: use DVC when data is stored in an external drive #563 (comment)

later?

Move them under single Manage External Data section (rel docs: "definitive" organization #144) with some motivation in the index. [or rewrite completely?]
Clarify differences and use cases for external storage mechanisms (see user-guide: clarify differences and use cases for external storage mechanisms? #566) [terminology]
Add info about NFS (see guide: using NFS as a remote storage #103) [remote-related]
Probably also address guide: explain about encrypted buckets in external data mgmt #774 about encrypted buckets

The text was updated successfully, but these errors were encountered:

dashohoxha · 2019-11-14T15:35:16Z

I think this structure looks too complicated and is not correct.

I agree. After thinking a bit about it, it seems that that structure might be suitable for a page/section about "Remote Storage Types", like this:

Remote Storage Types
- Local Files and Directories
- SSH
- Amazon S3
- Google Cloud Storage
- HDFS
- HTTP

Most important here is to explain when and why the external data management is needed.

Exactly. For example I am not sure when a remote cache might be useful. I can think of two cases:

When the remote cache is actually a local directory.
When the project is used just for dataset management (there are no stages).

Have I missed any other cases when it might be a preferred choice? The case that is used currently as an example (copying datafiles to a remote storage) does not seem convincing.

shcheklein · 2019-11-16T00:06:53Z

I think that Remote Storage Types by itself can and should be part of the "Data Sharing" section you made.

For example I am not sure when a remote cache might be useful.

In general you need all "external stuff" when you have some processing system that writes/read directly into/from it. E.g Spark that runs on top of S3 and crunches data that is too large to bring it to local machine. It means some mixed cases - e.g. raw data is large and you read it directly from S3 but intermediate artifacts, models, etc are stored locally.

So, rule of thumb - think about script in any language. If it internally has a pointer to S3, or any other cloud/remote storage (relative to the location you run this script from) it means you need some external data management.

dashohoxha · 2019-11-22T21:55:17Z

I have a question.

If you check the new section External Data and Remotes, you will notice that some remote types (SSH, Amazon S3, GS, HDFS) have subsections about external dependencies and outputs, while the others (S3 API, Azure, Aliyun OSS) do not have such subsections.

This is simply because their information is missing from the pages that are being consolidated:

I suspect that external dependencies and outputs work the same way for these remote types as well. However I cannot test this for each remote type (I am just reorganizing and rewriting the information that is already available on these pages and on the man pages of dvc remote).

Can someone confirm that the external dependencies and outputs work the same way for all the remote types (in particular for S3 API, Azure and Aliyun OSS)?

shcheklein · 2019-11-22T23:37:44Z

@efiop has more up to date info on this in terms of specific remotes supporting specific features. But in general, that information is not present because on certain types of remotes we do not support external deps/outs yet. May be even some table would be beneficial with a summary of features supported by different remotes.

S3 API name is a bit confusing btw. It should be something like - S3 Compatible or may be there is a specific term for that.

jorgeorpinel · 2021-02-09T03:37:30Z

Added a checkbox about the external cache section (per #654 (comment))

dberenbaum · 2023-03-27T19:18:16Z

Not sure where is best to add this, but we continue to see confusion about how to use DVC with externally stored data, especially when working in cloud-based notebook environments like Databricks, Sagemaker, and Colab. See the discussions below:

dberenbaum · 2023-04-14T22:04:39Z

Another report of this confusion: https://discord.com/channels/485586884165107732/485596304961962003/1095518763228610570.

Making this p1. I will try to reword/reorganize the info here.

shcheklein added type: enhancement Something is not clear, small updates, improvement suggestions A: docs Area: user documentation (gatsby-theme-iterative) user-guide labels Aug 3, 2019

dashohoxha mentioned this issue Oct 25, 2019

user-guide: restructure #745

Closed

10 tasks

dashohoxha self-assigned this Nov 14, 2019

This comment has been minimized.

Sign in to view

dashohoxha mentioned this issue Nov 20, 2019

External Data and Remotes (user-guide/external-data) #807

Closed

jorgeorpinel changed the title ~~consolidate external data management guides properly~~ user-guide: consolidate external data management guides properly Jan 20, 2020

jorgeorpinel unassigned dashohoxha Jan 20, 2020

jorgeorpinel removed the user-guide label Jan 20, 2020

jorgeorpinel added the ✨ epic Placeholder ticket for multi-sprint direction, use story, improvement label May 7, 2020

This was referenced May 7, 2020

user-guide: clarify differences and use cases for external storage mechanisms? #566

Closed

how: new (sub)section (How To's) #899

Closed

dashohoxha mentioned this issue May 7, 2020

how: use DVC when data is stored in an external drive #563

Closed

jorgeorpinel added the p2-nice-to-have Less of a priority at the moment. We don't usually deal with this immediately. label Jul 24, 2020

jorgeorpinel added the ✨ epic Placeholder ticket for multi-sprint direction, use story, improvement label Sep 8, 2020

jorgeorpinel mentioned this issue Sep 8, 2020

mention that cache options could be used besides remote modify params #764

Closed

jorgeorpinel mentioned this issue Feb 11, 2021

concepts: create DVC Remote page #2174

Closed

jorgeorpinel mentioned this issue Feb 22, 2021

term: ambiguous use of "external" and "workspace" #1127

Closed

jorgeorpinel added p2-nice-to-have Less of a priority at the moment. We don't usually deal with this immediately. and removed p1-important Active priorities to deal within next sprints labels May 13, 2021

This was referenced May 17, 2021

how-to: setup a shared cache (extracted from use cases) #2482

Merged

docs: "definitive" organization #144

Closed

cmd: --to-cache/remote updates (add, import-url, update) #2121

Closed

shcheklein mentioned this issue Jun 4, 2021

clarify external data examples #2538

Closed

This comment has been minimized.

Sign in to view

0x2b3bfa0 mentioned this issue Jun 6, 2021

tutorial: NFS/volumes iterative/cml#561

Closed

2 tasks

casperdcl mentioned this issue Jun 7, 2021

guide: deobfuscate Managing External Data #2542

Closed

iesahin added the C: guide Content of /doc/user-guide label Oct 21, 2021

jorgeorpinel added the ⌛ status: wait-core-merge Waiting for related product PR merge/release label Jan 14, 2022

jorgeorpinel removed p2-nice-to-have Less of a priority at the moment. We don't usually deal with this immediately. ✨ epic Placeholder ticket for multi-sprint direction, use story, improvement labels Apr 27, 2022

jorgeorpinel added the ✨ epic Placeholder ticket for multi-sprint direction, use story, improvement label Jul 28, 2022

jorgeorpinel mentioned this issue Jul 28, 2022

guide: Data Management #2856

Closed

jorgeorpinel removed the ✨ epic Placeholder ticket for multi-sprint direction, use story, improvement label Jul 28, 2022

dberenbaum added p1-important Active priorities to deal within next sprints and removed ⌛ status: wait-core-merge Waiting for related product PR merge/release labels Apr 14, 2023

dberenbaum mentioned this issue May 3, 2023

3.0 checklist #4513

Closed

dberenbaum self-assigned this May 24, 2023

dberenbaum added this to DVC May 24, 2023

github-project-automation bot moved this to Backlog in DVC May 24, 2023

dberenbaum mentioned this issue Jun 7, 2023

Drops external outputs and updates external data guides #4574

Merged

efiop closed this as completed in #4574 Jun 8, 2023

github-project-automation bot moved this from Backlog to Done in DVC Jun 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

guide: consolidate external data mgmt guides #520

guide: consolidate external data mgmt guides #520

shcheklein commented Aug 3, 2019 •

edited by jorgeorpinel

Loading

This comment has been minimized.

This comment has been minimized.

dashohoxha commented Nov 14, 2019

shcheklein commented Nov 16, 2019

This comment has been minimized.

dashohoxha commented Nov 22, 2019

shcheklein commented Nov 22, 2019

jorgeorpinel commented Feb 9, 2021

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

dberenbaum commented Mar 27, 2023 •

edited

Loading

dberenbaum commented Apr 14, 2023

guide: consolidate external data mgmt guides #520

guide: consolidate external data mgmt guides #520

Comments

shcheklein commented Aug 3, 2019 • edited by jorgeorpinel Loading

This comment has been minimized.

This comment has been minimized.

dashohoxha commented Nov 14, 2019

shcheklein commented Nov 16, 2019

This comment has been minimized.

dashohoxha commented Nov 22, 2019

shcheklein commented Nov 22, 2019

jorgeorpinel commented Feb 9, 2021

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

dberenbaum commented Mar 27, 2023 • edited Loading

dberenbaum commented Apr 14, 2023

shcheklein commented Aug 3, 2019 •

edited by jorgeorpinel

Loading

dberenbaum commented Mar 27, 2023 •

edited

Loading