-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
start: Data Management Trail #2894
Conversation
--- | ||
|
||
As its name implies, DVC is used to control versions of data. It enables to keep | ||
track of multiple versions of your datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
always better to include models ... in this case we might even include just "large files"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the trail name can be "Data and Model Management", BTW. Rename at this early stage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me. Even though "model management" should be about metrics to some extent ...
@@ -0,0 +1,272 @@ | |||
--- | |||
title: Data and Model Management Trail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Model Management seems like a very different thing e.g. https://www.dominodatalab.com/solutions/model-management/
I'd say either keep it simple with "Data Management" (well known and understood term) or use another word like "Artifact".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current titles include "Model" as well. We consider models as another kind of file. The link you shared adds some more stuff to it, but most of those aspects of model management are covered in dvc exp show
or deployment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the phrase "model management" is in that title and by itself has a different meaning, which may be confusing for readers and search engines.
As its name implies, DVC is used to control versions of data. It enables to keep | ||
track of multiple versions of your datasets. | ||
|
||
## Initialize a DVC project |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iesahin please, let's not for now do any substantial large changes to the existing data management trail. Let's keep the previous project, keep the structure and wrap it up in the section (trail). Maybe remove some parts - like experiments.
It's not the largest priority to rewrite it at the moment to use MNIST or include stuff like remove/gc (which can be even too much for get started to my mind)
I would rather focus on expanding experiments trails with the next steps - metrics, etc. Connecting trails properly, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. That's fine with me.
Before seeing this comment, I began to write #2919 as a replacement for data-pipelines
and it updates the underlying project as well. Should I revert it as well?
I thought our initial decision was to create projects suitable for each of these trails.
What's the scope of changes in your mind? @shcheklein @jorgeorpinel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought our initial decision was to create projects suitable for each of these trails.
yes, and I'm not opposed to this. I would just try to go from some simple steps - like wrap up the existing project into the trail, move metrics properly to the experiments (or keep them here as well - I'm fine with that either, wrap up the experiments trail.
I wish we can try to keep two projects at most - deep learning (experiments, checkpoints, live metrics) and pipelines (nlp / some data processing is a better fit here probably).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish we can try to keep two projects at most - deep learning (experiments, checkpoints, live metrics) and pipelines (nlp / some data processing is a better fit here probably).
I believe we can get away with a single project mostly. example-dvc-experiments
already has a 2-stage pipeline suitable for telling the pipelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that would be fine. But let's first keep it as is as much as possible in terms of content/projects. Just rename/move existing sections under the "Data (and Model?) Management Trail", and keep iterating on the experiments for now. It doesn't look that data management lacks any content or needs any immediate rewrite to be honest.
I'm closing this. I'll make a quick review to the current docs instead. |
Adds Data Management Trail to Get Started.
Closes #2856