Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: Basic Operations (Data Mgmt) #4053

Closed
wants to merge 39 commits into from

Conversation

jorgeorpinel
Copy link
Contributor

@jorgeorpinel jorgeorpinel commented Oct 19, 2022

To finish addressing the 2nd check box in #2856 (comment). Planned structure:

  • Tracking
  • Synchronizing
    • Other forms of access? (list, gets, imports & update)
  • Versioning (unifying aspect)

Main file to review: UG/data-management/track-sync-version.md

In review app: https://dvc-org-guide-data-mgmt-epdkkq.herokuapp.com/doc/user-guide/data-management/track-sync-data

UPDATE: Above done ✅



@jorgeorpinel jorgeorpinel added A: docs Area: user documentation (gatsby-theme-iterative) C: guide Content of /doc/user-guide labels Oct 19, 2022
@shcheklein shcheklein temporarily deployed to dvc-org-guide-data-mgmt-yguzqx October 19, 2022 02:27 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-guide-data-mgmt-yguzqx October 19, 2022 02:29 Inactive
@github-actions
Copy link
Contributor

github-actions bot commented Oct 19, 2022

dba203e

Link Check Report

All 32 links passed!

CML watermark

@shcheklein shcheklein temporarily deployed to dvc-org-guide-data-mgmt-yguzqx October 19, 2022 04:42 Inactive
comments with the cmmands related to Sync'ing and Versioning
@shcheklein shcheklein temporarily deployed to dvc-org-guide-data-mgmt-yguzqx October 19, 2022 04:43 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-guide-data-mgmt-yguzqx October 19, 2022 06:44 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-guide-data-mgmt-yguzqx October 19, 2022 07:33 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-guide-data-mgmt-yguzqx October 19, 2022 07:34 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-guide-data-mgmt-yguzqx October 20, 2022 05:31 Inactive
figure placeholder in Basic Ops page
@shcheklein shcheklein temporarily deployed to dvc-org-guide-data-mgmt-yguzqx October 20, 2022 06:29 Inactive
@jorgeorpinel jorgeorpinel marked this pull request as ready for review October 20, 2022 06:55
@shcheklein shcheklein temporarily deployed to dvc-org-guide-data-mgmt-yguzqx October 20, 2022 06:55 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-guide-data-mgmt-yguzqx October 20, 2022 18:07 Inactive
@jorgeorpinel jorgeorpinel removed the p1-important Active priorities to deal within next sprints label Feb 18, 2023
[more details].

Putting it all together, we can get an overview of the data in a project with
`dvc data status`. This will list changes to DVC-tracked data as well as files
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shcheklein shcheklein temporarily deployed to dvc-org-guide-data-mgmt-epdkkq February 20, 2023 02:32 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-guide-data-mgmt-epdkkq February 20, 2023 02:35 Inactive
@jorgeorpinel jorgeorpinel marked this pull request as draft February 20, 2023 02:37
@jorgeorpinel
Copy link
Contributor Author

Alright. So this one is done as originally intended (see OP).

Questions:

I'll leave it up to you @dberenbaum @shcheklein . Turned it into a draft for now. Thanks

@jorgeorpinel

This comment was marked as resolved.

jorgeorpinel added a commit that referenced this pull request Feb 20, 2023
* guide: draft structure of Data Mgmt and
some updates around the topic in existing docs

* guide: full text for draft intro to DM

* guide: hide cloud versioning info
per #4042 (review)

* guide: clarify Data Mgmt parts and
add prospective figure titles

* guide: add figure drafts to Data Mgmt

* guide: SCM->VC (Data Mgmt)

* guide: update 2 figs and add 1 more (Data Mgmt)

* guide: roll back unrelated changes
per #4042 (review)

* guide: mention clouds first (DM) and

and update fig. 1
per #4042 (review)

* guide: flatten DM index
per #4042 (review)

* guide: udpates to DM/ DV
moved from #4053 (review)

* guide: add DM/ Data Versioning page

per #4042 (comment)

* guide: update outdated link

* guide: revert more unrelatedly chaqnged files

per #4042 (review)

* guide: remove unused ref link

* guide: DM/ Remote Storage (not just Setup) and

and some links from cmd refs
and avoid term "data remote"
and some admons nearby...

* guide: remove a comment

* guide: draft for DM/ Remote Storage content

* ref: expand config.remote and link to/from Remotes guide

* ref: fix remote config file examples

* guide: complete Remote Config section and

and add Project config section to DM/ DV guide

* ref: rewrite remote add and modify Descs

* guide: complete list of supported storage types

* ref: rewrite remote index page from

extracted from #4053

* guide: clarify `remote modify` phrase in

in the Remote config section of DM/ Remote Storage

* Update content/docs/user-guide/data-management/data-versioning.md

* guide: update versioning config

per #4058 (review)

* guide: don't call remote storage "additional" here

(in the DM/ Remote Storage guide)
per #4058 (review)

Co-authored-by: Dave Berenbaum <[email protected]>

* guide: pull -> download (DM/ RS intro)

* guide: remove "optional" from Remote Storage nav & title

per #4058 (review)

* guide: splits and notes around Data Mgmt index page

rel. #4042 (comment)

* guide: Data Mgmt intro + note updates

* guide: draft of all contents +

+ remove comments

* guide: small impros to Data Mgmt

in prep for #4042 (review)

* guide: rewrite Data Mgmt index in before/after form

per #4042 (review)

* guide: add draft figure for Data Mgmt

* guide: simplify/refocus data mgmt index

per #4042 (review)

* work around commented header bug

* guide: drop DM/ DV page

* guide: rewrite DM intro and

- hide benefits (for now)
- remove codification comment block

* guide: use DM table instead of figure for now

* guide: rewrite Data Mgmt story

* guide: add draft figures to Data Mgmt

* guide: simplify Data Mgmt story and benefits

* guide: remove unused images (DM)

* guide: update Data Mgmt figures (v1)

* guide: rewrite text of Data Mgmt index

* guide: update Data Mgmt figures

* guide: iterate on Data Mgmt again

* guide: update Data Mgmt figs

* guide: more supporting info about Data Mgmt

* guide: update figures (much more concrete) and

and matching text updates

* guide: edits to How it works (Data Mgmt)

* guide: update Data Mgmt figures

Rel. #4042 (comment)

* guide: emphaisze dataset versions in UG fig 1

Rel. #4042 (comment)

* guide: update Data Mgmt figures (with notes),

expand img captions,
and update text accordingly.

* guide: more updates to text and figure styles,

esp. to the first half
and comment some stuff out (temporary)

* guide: update figures and text (Data Mgmt) ...

Using a tabs toggle for the 2nd fig.

* guide: Data Management text (section 1)

finalized for this version of figures

* guide: Data Management (main text)

finalized for this version of figures

* guide: Data Management (secondary text)

pending diagram and code sample(s)

* guide: add DVC data mgmt technical diagram &

dummy sample CLI blocks

* guide: update Data Mgmt text

* guide: udpate text and 2nd figure (Data Mgmt)

* guide: draft 2nd and 3rd figures

* guide: rewrite Data Mgmt/ How it works &

and Benefits/ Tradeoffs

Probably still unfinished... Missing more data versioning info? See HTML comments.

* guide: update drafts of Data Mgmt figures 2, 3

* guide: Data Mgmt improvements and

hide the benefits list for now

* guide: separate from Data Mgmt work

Rel. #4042

* Apply suggestions from code review

* Merge branch main +

* other: links to Remotes guide

* install: Remote Storage guide links

* start: Remote Storage guide links +

* guide: links to Remote Storage page

* Restyled by prettier (#4323)

Co-authored-by: Restyled.io <[email protected]>

---------

Co-authored-by: Dave Berenbaum <[email protected]>
Co-authored-by: rogermparent <[email protected]>
Co-authored-by: restyled-io[bot] <32688539+restyled-io[bot]@users.noreply.github.com>
Co-authored-by: Restyled.io <[email protected]>
@shcheklein

This comment was marked as resolved.

@jorgeorpinel

This comment was marked as resolved.

@shcheklein

This comment was marked as resolved.

@jorgeorpinel

This comment was marked as resolved.

@shcheklein shcheklein temporarily deployed to dvc-org-guide-data-mgmt-epdkkq February 21, 2023 20:33 Inactive
@jorgeorpinel

This comment was marked as resolved.

@shcheklein shcheklein temporarily deployed to dvc-org-guide-data-mgmt-epdkkq February 21, 2023 20:53 Inactive
Copy link
Contributor Author

@jorgeorpinel jorgeorpinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright the PR looks good now. See the pending questions above though. Summary of changes thus far:

Comment on lines +1 to +4
# Track and Sync Versioned Data & Models

The fundamental workflow of most <abbr>DVC projects</abbr> includes the
following **basic operations**. These can be performed directly (as we cover
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main file being contributed.

Comment on lines +37 to +47
and [data sync operations], and provides data de-duplication at the file level.
However, this comes with the drawback of losing human-readable filenames without
the use of the DVC CLI (`dvc get --show-url`) or API (`dvc.api.get_url()`).

When using cloud versioning, DVC does not provide de-duplication, and certain
remote storage performance optimizations will be unavailable.

[content-addressable storage]:
/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory
[data sync operations]:
/doc/user-guide/data-management/track-sync-data#synchronizing-data
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most other changes just add links to the new page (mainly to the Sync section).

## Basic workflow: store as peristent commits
## Basic workflow: store as persistent commits
Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unrelated fix, oops. Found it when looking for places to link from (but didn't end up linking in this file).

shcheklein pushed a commit that referenced this pull request Mar 9, 2023
* guide: draft structure of Data Mgmt and
some updates around the topic in existing docs

* guide: full text for draft intro to DM

* guide: hide cloud versioning info
per #4042 (review)

* guide: clarify Data Mgmt parts and
add prospective figure titles

* guide: add figure drafts to Data Mgmt

* guide: SCM->VC (Data Mgmt)

* guide: update 2 figs and add 1 more (Data Mgmt)

* guide: roll back unrelated changes
per #4042 (review)

* guide: mention clouds first (DM) and

and update fig. 1
per #4042 (review)

* guide: flatten DM index
per #4042 (review)

* guide: udpates to DM/ DV
moved from #4053 (review)

* guide: add DM/ Data Versioning page

per #4042 (comment)

* guide: update outdated link

* guide: revert more unrelatedly chaqnged files

per #4042 (review)

* guide: remove unused ref link

* guide: DM/ Remote Storage (not just Setup) and

and some links from cmd refs
and avoid term "data remote"
and some admons nearby...

* guide: remove a comment

* guide: draft for DM/ Remote Storage content

* ref: expand config.remote and link to/from Remotes guide

* ref: fix remote config file examples

* guide: complete Remote Config section and

and add Project config section to DM/ DV guide

* ref: rewrite remote add and modify Descs

* guide: complete list of supported storage types

* ref: rewrite remote index page from

extracted from #4053

* guide: clarify `remote modify` phrase in

in the Remote config section of DM/ Remote Storage

* Update content/docs/user-guide/data-management/data-versioning.md

* guide: update versioning config

per #4058 (review)

* guide: don't call remote storage "additional" here

(in the DM/ Remote Storage guide)
per #4058 (review)

Co-authored-by: Dave Berenbaum <[email protected]>

* guide: pull -> download (DM/ RS intro)

* guide: remove "optional" from Remote Storage nav & title

per #4058 (review)

* guide: splits and notes around Data Mgmt index page

rel. #4042 (comment)

* guide: Data Mgmt intro + note updates

* guide: draft of all contents +

+ remove comments

* guide: small impros to Data Mgmt

in prep for #4042 (review)

* guide: rewrite Data Mgmt index in before/after form

per #4042 (review)

* guide: add draft figure for Data Mgmt

* guide: simplify/refocus data mgmt index

per #4042 (review)

* work around commented header bug

* guide: drop DM/ DV page

* guide: rewrite DM intro and

- hide benefits (for now)
- remove codification comment block

* guide: use DM table instead of figure for now

* guide: rewrite Data Mgmt story

* guide: add draft figures to Data Mgmt

* guide: simplify Data Mgmt story and benefits

* guide: remove unused images (DM)

* guide: update Data Mgmt figures (v1)

* guide: rewrite text of Data Mgmt index

* guide: update Data Mgmt figures

* guide: iterate on Data Mgmt again

* guide: update Data Mgmt figs

* guide: more supporting info about Data Mgmt

* guide: update figures (much more concrete) and

and matching text updates

* guide: edits to How it works (Data Mgmt)

* guide: update Data Mgmt figures

Rel. #4042 (comment)

* guide: emphaisze dataset versions in UG fig 1

Rel. #4042 (comment)

* guide: update Data Mgmt figures (with notes),

expand img captions,
and update text accordingly.

* guide: more updates to text and figure styles,

esp. to the first half
and comment some stuff out (temporary)

* guide: update figures and text (Data Mgmt) ...

Using a tabs toggle for the 2nd fig.

* guide: Data Management text (section 1)

finalized for this version of figures

* guide: Data Management (main text)

finalized for this version of figures

* guide: Data Management (secondary text)

pending diagram and code sample(s)

* guide: add DVC data mgmt technical diagram &

dummy sample CLI blocks

* guide: update Data Mgmt text

* guide: udpate text and 2nd figure (Data Mgmt)

* guide: draft 2nd and 3rd figures

* guide: rewrite Data Mgmt/ How it works &

and Benefits/ Tradeoffs

Probably still unfinished... Missing more data versioning info? See HTML comments.

* guide: update drafts of Data Mgmt figures 2, 3

* guide: Data Mgmt improvements and

hide the benefits list for now

* guide: separate from Data Mgmt work

Rel. #4042

* Apply suggestions from code review

* Merge branch main +

* ref: update links from API to Remotes guide

* guide: update links around Remote Storage and

and other updates to nearby Markdown (e.g. proper admons)

* Roll back unrelated changes

* Restyled by prettier (#4261)

Co-authored-by: Restyled.io <[email protected]>

* ref: bring cloud versioning copy edits of import-url

from
https://github.com/iterative/dvc.org/pull/4260/files#diff-ef95e18c4bd039757695065a23946dc27e28b4727ce07c670cdc096e34dbe3b3

* ref: clarify import-url with cloud versioning

per #4142 (review)

* ref: updates to import-url --version-aware and

update --rev

* ref: add import-url --version aware to Synopsis

per #4089 (comment)

* Restyled by prettier (#4266)

Co-authored-by: Restyled.io <[email protected]>

* Restyled by prettier (#4322)

Co-authored-by: Restyled.io <[email protected]>

* Update content/docs/command-reference/remote/modify.md

Co-authored-by: Oded Messer <[email protected]>

* Update content/docs/command-reference/remote/modify.md

Co-authored-by: Oded Messer <[email protected]>

* Update content/docs/command-reference/push.md

Co-authored-by: Oded Messer <[email protected]>

* yarn format-all

---------

Co-authored-by: Dave Berenbaum <[email protected]>
Co-authored-by: rogermparent <[email protected]>
Co-authored-by: restyled-io[bot] <32688539+restyled-io[bot]@users.noreply.github.com>
Co-authored-by: Restyled.io <[email protected]>
Co-authored-by: Oded Messer <[email protected]>
@dberenbaum
Copy link
Collaborator

My take on this one: I don't see anywhere else in the data management guide that we have the basic workflow explained, so I think this one is useful for that. Happy to try and get it polished if you agree @shcheklein.

I would consider incorporating https://dvc.org/doc/user-guide/how-to/update-tracked-data and/or https://dvc.org/doc/user-guide/how-to/stop-tracking-data.

@shcheklein
Copy link
Member

@dberenbaum sound good to me!

@dberenbaum
Copy link
Collaborator

@jorgeorpinel Do you want to finish this one, or would you rather I take it over?

@jorgeorpinel
Copy link
Contributor Author

Let me try to wrap it up first @dberenbaum 🙂 (Sorry for the delay)

@dberenbaum dberenbaum mentioned this pull request May 3, 2023
@tapadipti
Copy link
Contributor

@dberenbaum is this PR still relevant? If yes, maybe you could complete it as you suggested?

@dberenbaum
Copy link
Collaborator

@tapadipti It is relevant but it doesn't seem like we are able to prioritize it right now, so I'll close.

@dberenbaum dberenbaum closed this Jun 19, 2023
@yathomasi yathomasi deleted the guide/data-mgmt/basic-ops branch July 11, 2023 07:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: docs Area: user documentation (gatsby-theme-iterative) C: guide Content of /doc/user-guide ⌛ status: wait-core-merge Waiting for related product PR merge/release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants