Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: Data Management #2856

Closed
iesahin opened this issue Sep 27, 2021 · 15 comments
Closed

guide: Data Management #2856

iesahin opened this issue Sep 27, 2021 · 15 comments
Labels
A: docs Area: user documentation (gatsby-theme-iterative) C: guide Content of /doc/user-guide ✨ epic Placeholder ticket for multi-sprint direction, use story, improvement

Comments

@iesahin
Copy link
Contributor

iesahin commented Sep 27, 2021

UPDATE: #2856 (comment)


This is the plan for data management trail that focuses on:

Adding data to DVC projects

  • Initialize a DVC repository and use dvc add to add files.

  • We'll assume MNIST data exist in a folder and will add it.

Versioning data in DVC projects

  • Overwrite Fashion-MNIST data on top of MNIST and update the dataset.
  • Go back and forth in Git history to get different datasets in the same folder.

Creating remotes

  • Add a Google Drive folder as a remote.

  • Make it default

Pushing to/pulling from remotes

  • Push the cache to the remote we created
  • Clone the repository to somewhere (e.g. ssh or local folder)
  • Pull the cache

Accessing public datasets and registries

  • Get the Fashion MNIST data from dataset-registry

Removing data from DVC projects

  • Remove certain folders from workspace
  • Delete the corresponding cache files

UPDATE: start with a reorg, see #2856 (comment) below (may be enough).

@iesahin iesahin self-assigned this Sep 27, 2021
@dberenbaum

This comment was marked as resolved.

@jorgeorpinel jorgeorpinel added the A: docs Area: user documentation (gatsby-theme-iterative) label Sep 28, 2021
@iesahin iesahin added the p1-important Active priorities to deal within next sprints label Sep 29, 2021
@iesahin iesahin mentioned this issue Oct 5, 2021
7 tasks
@iesahin

This comment was marked as outdated.

iesahin added a commit that referenced this issue Oct 5, 2021
@iesahin

This comment was marked as resolved.

@shcheklein

This comment was marked as resolved.

@iesahin

This comment was marked as resolved.

@shcheklein

This comment was marked as resolved.

@jorgeorpinel
Copy link
Contributor

re title I'd just avoid the phrase "model management" since it has a specific meaning (not what this is about) but if you want to include "model" maybe use "data and model file management". I don't think we need to include "model" in the title but "model file(s)" can be included in the content.

@shcheklein
Copy link
Member

since it has a specific meaning

I'm not sure we have it written somewhere? :) what kind of meaning do you have? what is so different between model management and data management?

(I can see for example that if we include model management we could keep some parts about metrics for example - which is also fine)

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Oct 15, 2021

Like, related to the ML model lifecycle? I mentioned this in the PR (https://www.dominodatalab.com/solutions/model-management/) and it wasn't contested so I assumed I was correct but you guys are the experts! If it doesn't have a special meaning then it doesn't matter. But if it does users and search engines could get confused.

@iesahin
Copy link
Contributor Author

iesahin commented Oct 15, 2021

I'm not an expert on naming things :) I put "model" to the title because the current docs have it, and we put "model" after a user requested it. I understand the meaning described in https://www.dominodatalab.com/solutions/model-management/ and how it differs from the way we use it, but in this new domain usually people use the same words to mean different things.

I have no strong opinion here, and honestly writing a specific "model management" document to the UG might be more appropriate. But until then, we can have "model" in the title and we can say that "models are files that can be tracked by DVC" in the text.

@iesahin iesahin added the C: start Content of /doc/start label Oct 20, 2021
@jorgeorpinel jorgeorpinel added the ✨ epic Placeholder ticket for multi-sprint direction, use story, improvement label Mar 10, 2022
@jorgeorpinel jorgeorpinel removed the p1-important Active priorities to deal within next sprints label Mar 30, 2022
@jorgeorpinel

This comment was marked as resolved.

@jorgeorpinel jorgeorpinel removed the C: start Content of /doc/start label Mar 30, 2022
@jorgeorpinel jorgeorpinel changed the title start: Data Management Trail start: Data Managemen Apr 27, 2022
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Apr 27, 2022

Data Mgmt is simple enough that covering it in the Get Started and Command Reference has thus far been enough. But having a group of existing content under this "Category" could achieve some goals:

So just doing that reorg of existing content could be a good and quick first step, I think. Then we reconsider all the material proposed above. WDYT?

@jorgeorpinel jorgeorpinel added the C: guide Content of /doc/user-guide label Apr 27, 2022
@jorgeorpinel jorgeorpinel changed the title start: Data Managemen guide: Data Managemen Apr 27, 2022
@jorgeorpinel jorgeorpinel changed the title guide: Data Managemen guide: Data Management Apr 27, 2022
@jorgeorpinel
Copy link
Contributor

re Help reorg existing content

Based on the OP here's what we currently have for all the topics mentioned:

Adding data to DVC projects
& Versioning data in DVC projects

The cache (local)

& Shared cache (external)

Removing data from DVC projects

Creating remotes (link to config/remotes)
& Sync with remotes

Accessing public datasets and data registries (get, import)

External data topics

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Oct 14, 2022

Given that all that is already covered (albeit maybe disorganized) and not really the general goal of the UG (explanation-type docs), here's a new plan for the Data Management user guide:

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Oct 14, 2022

Idea: Emphasize the value by contrasting typical/ ad hoc methods vs DVC project structures (before/after)

My only problem with this idea is that we should drive the value of the product and feature earlier than the user guide. This should be in use cases or in the Get Started if needed, even in README as well. OK to repeat in the UG as well, but as a quick recap. @shcheklein

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: docs Area: user documentation (gatsby-theme-iterative) C: guide Content of /doc/user-guide ✨ epic Placeholder ticket for multi-sprint direction, use story, improvement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants