use-cases: improve Data Registry case #795

Suor · 2019-11-15T15:04:11Z

I took a look at the new Data Registry page. Congrats @jorgeorpinel on compiling it and please don't be angry if you'll get a bit more than you expected :)

Here go some considerations big and small as they occur in the text.

Furthermore, the version of the data file imported to B can be an older iteration than what's currently used in A.

Time is somewhat non-linear when branches are involved, so it might not be correct to talk about "older iteration". Should we just say that B might refer to any version of an artifact from A.

Keeping this in mind, we could build a

could -> can

This way we would have a repository that has all the metadata and change history for the project's data.

I don't understand this sentence. "we would ... change history", what does it mean? Why do we want to change any history? What does "history of projects data" mean?

Also, why do we push data-registry from the start? Can't one repo simply depend on the other repo? Registry is only one scenario. The majority of advantages listed do not require data registry. I understand that this use case is named "Data Registry", but maybe we can lay some trail to it?

features such as change history

what does this mean "change history feature"?

Example

This story about wget and the past brings confusion. May we just state the problem (the current state of it) and solve it? The whole story then starts from the middle, we present commands that do something else then expected and then say about that in some buble. I needed to reread it several times to get what is this about.

A dataset we use for several of our examples and tutorials is one containing

... tutorials contains ...

We partitioned the dataset in two ...

We split the dataset into two ... (See use-cases: improve Data Registry case #795 (comment))

jorgeorpinel · 2019-11-20T00:38:36Z

could -> can

I would agree if this was a tutorial or get started chapter with reproducible commands but here we are actually talking in a more hypothetical way.

jorgeorpinel · 2019-11-20T00:48:58Z

why do we push data-registry from the start? Can't one repo simply depend on the other repo? Registry is only one scenario. The majority of advantages listed do not require data registry.

Actually, we first mention that "Instead of adding it it to both projects, B can simply import it from A." (implying simple project dependency). Like you noticed, this use case is about data registries so that's why we focus on them.

As for the list of advantages. I had the exact same concern with @shcheklein at first haha. Most of them are not specific to data registries. The reason we have them all listed there though, is that we hope use cases can serve as a bit of marketing, since we imagine they can be the landing pages for some users, linked to the use case directly from a search engine (first web page they ever see in our website). So, we are selling DVC as a whole in here.

jorgeorpinel · 2019-11-20T00:57:06Z

We split the dataset into two ...

What's wrong with "partitioned"? We then use the word "parts" several times.

jorgeorpinel · 2019-11-20T01:02:39Z

@Suor please notice my comments on some of your feedback above. I've also starting addressing your comments in #805 but maybe wait a little before reviewing that, until we have some more agreements in the discussions here. (Only the larger point about the example I haven't addressed or replied to yet.)

Suor · 2019-11-20T17:08:15Z

I would agree if this was a tutorial or get started chapter with reproducible commands but here we are actually talking in a more abstract/hypothetical way.

"can" is used all other the place there, this "could" stands out.

Actually, we first mention that "Instead of adding it it to both projects, B can simply import it from A." (implying simple project dependency). Like you noticed, this use case is about data registries so that's why we focus on them.

It is presented in a too abstract way I guess, you read through it and have nothing to fix your mind on. And then you make a huge conceptual jump with "Keeping this in mind ...". There is no way keeping this, the mere possibility of import, in mind I would come to Date Registry in one step. So the whole Data Registry looks like a solution without a problem.

Maybe we should present/state a problem in-between, like a graph of dependencies becoming messy when you have lots of dvc repos depending on each other chaotically.
Or a need to have some data catalog if that is literally a data registry not say a model one.

What's wrong with "partitioned"? We then use the word "parts" several times.

Partitioning has complicated uses like partitioning disk or database tables. Maybe it's just me, but it brings all those connotations for no use.
The phrase train/test split is a common pattern in data science, e.g. it's called train_test_split() in sklearn.

…agraph per #795 (comment)

jorgeorpinel · 2019-11-21T00:31:03Z

"can" is used all other the place there, this "could" stands out.

OK. I'm changing the later "can" words in this same hypothetical context to "could". Notice that there's also a "would". I rephrased other parts of the paragraph too. See #805 (review).

jorgeorpinel · 2019-11-21T00:39:03Z

It is presented in a too abstract way I guess (nothing to fix your mind on). And then you make a huge conceptual jump with "Keeping this in mind... the whole Data Registry looks like a solution without a problem.

@Suor "Keeping this in mind" is supposed to refer to "DVC also includes the dvc get, dvc import, and dvc update commands." (Before the abstract project A and B example). I don't think its a huge jump, but I guess it could definitely be rewritten more clearly. We just want to avoid making this text too long, so a separate paragraph to give a full concrete example of simple project dependency (which is not the topic of the use case) is a bit problematic. Also because we want to insert a data registry diagram near the top but it may not make sense until the paragraph where it's actually introduced.

Perhaps we should just remove the abstract example altogether and find another place to talk about project dependency?

Or change the dependency mention so its' not confusing...

jorgeorpinel · 2019-11-21T00:45:58Z

I see you also left a clear suggestion for this @Suor... (I missed that comment before. 😅)

Maybe we should present/state a problem in-between, like a graph of dependencies becoming messy when you have lots of dvc repos depending on each other chaotically.

So yes, the problem here is that we don't want to talk about project dependency but about data registries. And keep it as short as possible. I think we're assuming people will know/understand the problems that can use this solution. What do you think @shcheklein? Also about my suggested way to address this concern:

Perhaps we should just remove the abstract example altogether and find another place to talk about project dependency? Or change the dependency mention so its' not confusing...

per #795 and #795 (comment)

…data reg per #795 (comment) and #795 (comment)

jorgeorpinel · 2019-11-21T01:02:12Z

Me again. OK, while yo guys think about it, I simplified the project inter-dependency mention per the following ideas above:

Maybe we should present/state a problem in-between, like a graph of dependencies becoming messy
change the dependency mention so its' not confusing

You may see the exact changes and continue this discussion on #805 (review).

jorgeorpinel · 2019-11-21T19:07:48Z

Last item pending here @Suor:

This story about wget and the past brings confusion... I needed to reread it several times to get what is this about. May we just state the problem (the current state of it) and solve it?

I agree that the intro to the example is a bit weird... It's similar to the old note project A and B example where we tried to just kind of mention something but in no more than one paragraph, so it ended up being too brief perhaps. I like how your suggestion of just stating problem and solving it sounds, but I'm not sure how exactly that would look. In a way this story is meant to state the problem. I'll think about this...

The whole story then starts from the middle

It actually starts from scratch though: 1) dataset split in 2 on a storage server, parts downloaded with wget 2) same files moved to dvc repo, to download with dvc get instead. Then the next paragraph goes into changing from 2 files to 2 versions instead (proper data registry).

we present commands that do something else then expected and then say about that in some buble

Again, since it's not a tutorial or get started chapter, we don't intend to provide end-to-end reproducible commands. We decided to add the expandable sections in case someone actually ran them and didn't get the expected results.

…logic and readability per #795 (comment)

jorgeorpinel · 2019-11-23T06:11:49Z

Alright. I tried to improve the logic of the example, not a major rewriting but significant rephrasing involved. Please review PR#805 and let's move this discussion over there. Please open reviews/comments there as needed.

shcheklein · 2019-11-25T00:05:06Z

Also, why do we push data-registry from the start? Can't one repo simply depend on the other repo? Registry is only one scenario. The majority of advantages listed do not require data registry. I understand that this use case is named "Data Registry", but maybe we can lay some trail to it?

@jorgeorpinel has answered this already, but I think the problem also comes from the way we introduce it. We go from using regular imports/gets to setting up a dedicated data registry. While we should be comparing no DVC at all for data management (it means - ad-hoc conventions and total mess on S3) with the DVC Data Registry which effectively provides some "meta" information for the same data on S3.

It means that I would not emphasize that it's a mess to chain multiple imports/gets. It's a mess to not use anything to organize data. And data registry just one of the ways to organize it.

jorgeorpinel · 2019-11-25T02:00:38Z

I would not emphasize that it's a mess to chain multiple imports/gets. It's a mess to not use anything to organize data.

Good catch. Will review.

per #795 (comment) (and #805 (review))

shcheklein changed the title ~~Notes on Data Registry page~~ notes on Data Registry page Nov 15, 2019

shcheklein added A: docs Area: user documentation (gatsby-theme-iterative) type: enhancement Something is not clear, small updates, improvement suggestions use-cases labels Nov 15, 2019

shcheklein changed the title ~~notes on Data Registry page~~ improve Data Registry use case Nov 15, 2019

shcheklein assigned jorgeorpinel Nov 15, 2019

weekly-digest bot mentioned this issue Nov 17, 2019

Weekly Digest (10 November, 2019 - 17 November, 2019) #798

Closed

This comment has been minimized.

Sign in to view

jorgeorpinel changed the title ~~improve Data Registry use case~~ use-cases: improve Data Registry case Nov 19, 2019

This comment has been minimized.

Sign in to view

jorgeorpinel added a commit that referenced this issue Nov 20, 2019

use-cases: address smaller points from review (#795)

c31d971

jorgeorpinel mentioned this issue Nov 20, 2019

use-cases: improvements to data-registry case per Alex' review #805

Closed

jorgeorpinel added a commit that referenced this issue Nov 21, 2019

use-cases: reinforce hypothetical phrasing in data registry intro par…

6002cba

…agraph per #795 (comment)

This comment has been minimized.

Sign in to view

jorgeorpinel added a commit that referenced this issue Nov 21, 2019

use-cases: partitioned->split in data registry case

47ebae5

per #795 and #795 (comment)

jorgeorpinel added a commit that referenced this issue Nov 21, 2019

use-cases: geatly simplify mention about project inter-dependency in …

a578c15

…data reg per #795 (comment) and #795 (comment)

jorgeorpinel added a commit that referenced this issue Nov 23, 2019

use-cases: rephrase much of the data registry example to improve its …

50b772e

…logic and readability per #795 (comment)

jorgeorpinel added a commit that referenced this issue Nov 25, 2019

use-cases: remove remark about imports getting messy

d125437

per #795 (comment) (and #805 (review))

jorgeorpinel mentioned this issue Nov 25, 2019

use-cases: second iteration of Data Registry case #818

Merged

3 tasks

shcheklein closed this as completed in #818 Dec 16, 2019

iesahin added the C: cases Content of /doc/use-cases label Oct 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use-cases: improve Data Registry case #795

use-cases: improve Data Registry case #795

Suor commented Nov 15, 2019 •

edited by jorgeorpinel

Loading

Example

This comment has been minimized.

jorgeorpinel commented Nov 20, 2019 •

edited

Loading

This comment has been minimized.

jorgeorpinel commented Nov 20, 2019 •

edited

Loading

jorgeorpinel commented Nov 20, 2019 •

edited

Loading

jorgeorpinel commented Nov 20, 2019

Suor commented Nov 20, 2019 •

edited by jorgeorpinel

Loading

jorgeorpinel commented Nov 21, 2019 •

edited

Loading

jorgeorpinel commented Nov 21, 2019 •

edited

Loading

jorgeorpinel commented Nov 21, 2019 •

edited

Loading

This comment has been minimized.

jorgeorpinel commented Nov 21, 2019 •

edited

Loading

jorgeorpinel commented Nov 21, 2019

jorgeorpinel commented Nov 23, 2019

shcheklein commented Nov 25, 2019 •

edited

Loading

jorgeorpinel commented Nov 25, 2019

use-cases: improve Data Registry case #795

use-cases: improve Data Registry case #795

Comments

Suor commented Nov 15, 2019 • edited by jorgeorpinel Loading

Example

This comment has been minimized.

jorgeorpinel commented Nov 20, 2019 • edited Loading

This comment has been minimized.

jorgeorpinel commented Nov 20, 2019 • edited Loading

jorgeorpinel commented Nov 20, 2019 • edited Loading

jorgeorpinel commented Nov 20, 2019

Suor commented Nov 20, 2019 • edited by jorgeorpinel Loading

jorgeorpinel commented Nov 21, 2019 • edited Loading

jorgeorpinel commented Nov 21, 2019 • edited Loading

jorgeorpinel commented Nov 21, 2019 • edited Loading

This comment has been minimized.

jorgeorpinel commented Nov 21, 2019 • edited Loading

jorgeorpinel commented Nov 21, 2019

jorgeorpinel commented Nov 23, 2019

shcheklein commented Nov 25, 2019 • edited Loading

jorgeorpinel commented Nov 25, 2019

Suor commented Nov 15, 2019 •

edited by jorgeorpinel

Loading

jorgeorpinel commented Nov 20, 2019 •

edited

Loading

jorgeorpinel commented Nov 20, 2019 •

edited

Loading

jorgeorpinel commented Nov 20, 2019 •

edited

Loading

Suor commented Nov 20, 2019 •

edited by jorgeorpinel

Loading

jorgeorpinel commented Nov 21, 2019 •

edited

Loading

jorgeorpinel commented Nov 21, 2019 •

edited

Loading

jorgeorpinel commented Nov 21, 2019 •

edited

Loading

jorgeorpinel commented Nov 21, 2019 •

edited

Loading

shcheklein commented Nov 25, 2019 •

edited

Loading