Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start: Data Management Trail #2894

Closed
wants to merge 4 commits into from
Closed

start: Data Management Trail #2894

wants to merge 4 commits into from

Conversation

iesahin
Copy link
Contributor

@iesahin iesahin commented Oct 5, 2021

Adds Data Management Trail to Get Started.

Closes #2856

@shcheklein shcheklein temporarily deployed to dvc-org-iesahin-issue28-1l1qfv October 5, 2021 16:03 Inactive
@iesahin iesahin changed the title Iesahin/issue2856 start: Data Management Trail Oct 5, 2021
@iesahin iesahin self-assigned this Oct 5, 2021
---

As its name implies, DVC is used to control versions of data. It enables to keep
track of multiple versions of your datasets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always better to include models ... in this case we might even include just "large files"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the trail name can be "Data and Model Management", BTW. Rename at this early stage?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me. Even though "model management" should be about metrics to some extent ...

@shcheklein shcheklein temporarily deployed to dvc-org-iesahin-issue28-1l1qfv October 6, 2021 15:48 Inactive
@@ -0,0 +1,272 @@
---
title: Data and Model Management Trail
Copy link
Contributor

@jorgeorpinel jorgeorpinel Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model Management seems like a very different thing e.g. https://www.dominodatalab.com/solutions/model-management/

I'd say either keep it simple with "Data Management" (well known and understood term) or use another word like "Artifact".

Copy link
Contributor Author

@iesahin iesahin Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current titles include "Model" as well. We consider models as another kind of file. The link you shared adds some more stuff to it, but most of those aspects of model management are covered in dvc exp show or deployment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the phrase "model management" is in that title and by itself has a different meaning, which may be confusing for readers and search engines.

@shcheklein shcheklein temporarily deployed to dvc-org-iesahin-issue28-1l1qfv October 11, 2021 10:14 Inactive
As its name implies, DVC is used to control versions of data. It enables to keep
track of multiple versions of your datasets.

## Initialize a DVC project
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iesahin please, let's not for now do any substantial large changes to the existing data management trail. Let's keep the previous project, keep the structure and wrap it up in the section (trail). Maybe remove some parts - like experiments.

It's not the largest priority to rewrite it at the moment to use MNIST or include stuff like remove/gc (which can be even too much for get started to my mind)

I would rather focus on expanding experiments trails with the next steps - metrics, etc. Connecting trails properly, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. That's fine with me.

Before seeing this comment, I began to write #2919 as a replacement for data-pipelines and it updates the underlying project as well. Should I revert it as well?

I thought our initial decision was to create projects suitable for each of these trails.

What's the scope of changes in your mind? @shcheklein @jorgeorpinel

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought our initial decision was to create projects suitable for each of these trails.

yes, and I'm not opposed to this. I would just try to go from some simple steps - like wrap up the existing project into the trail, move metrics properly to the experiments (or keep them here as well - I'm fine with that either, wrap up the experiments trail.

I wish we can try to keep two projects at most - deep learning (experiments, checkpoints, live metrics) and pipelines (nlp / some data processing is a better fit here probably).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we can try to keep two projects at most - deep learning (experiments, checkpoints, live metrics) and pipelines (nlp / some data processing is a better fit here probably).

I believe we can get away with a single project mostly. example-dvc-experiments already has a 2-stage pipeline suitable for telling the pipelines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that would be fine. But let's first keep it as is as much as possible in terms of content/projects. Just rename/move existing sections under the "Data (and Model?) Management Trail", and keep iterating on the experiments for now. It doesn't look that data management lacks any content or needs any immediate rewrite to be honest.

@shcheklein shcheklein changed the title start: Data Management Trail fix #2856: Data Management Trail Oct 13, 2021
@iesahin iesahin changed the title fix #2856: Data Management Trail guide: Data Management Trail Oct 14, 2021
@iesahin iesahin changed the title guide: Data Management Trail start: Data Management Trail Oct 14, 2021
@iesahin
Copy link
Contributor Author

iesahin commented Oct 19, 2021

I'm closing this. I'll make a quick review to the current docs instead.

@iesahin iesahin closed this Oct 19, 2021
@jorgeorpinel jorgeorpinel deleted the iesahin/issue2856 branch July 29, 2022 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

guide: Data Management
3 participants