-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ref: document import-url
cloud versioning changes
#4142
Conversation
file. By default, DVC will automatically capture cloud versioning information | ||
if the URL contains a cloud versioning ID. When `--version-aware` is provided | ||
along with a URL that does not contain a cloud versioning ID, DVC will capture | ||
the latest version of the file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also explain that dvc will pull that version from the source location even if it's overwritten, and will not push another copy of it to the remote.
cc @jorgeorpinel Is there somewhere in the data management user guide we want to this info also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll def. need UG updates to go over cloud versioning (feel free to make a separate docs issue) -- can't explain everything in an option text. For now I'd focus on what the flag does, and put some explanations in the Description (which in this case is already super long and should be rewritten/ moved to UG eventually).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p.s. Specifically, there's is some draft content about this in https://github.com/iterative/dvc.org/pull/4119/files#diff-d01612907e4ab14238625d537eaf42852d8566901d1bfe2fd3f2d4406a2d1dfc ATM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Committing...
Co-authored-by: Restyled.io <[email protected]>
@@ -265,6 +265,7 @@ These include a subset of the fields in `.dvc` file | |||
| `persist` | Whether the output file/dir should remain in place during `dvc repro` (`false` by default: outputs are deleted when `dvc repro` starts) | | |||
| `checkpoint` | (Optional) Set to `true` to let DVC know that this output is associated with [checkpoint experiments](/doc/user-guide/experiment-management/checkpoints). These outputs are reverted to their last cached version at `dvc exp run` and also `persist` during the stage execution. | | |||
| `desc` | (Optional) User description for this output. This doesn't affect any DVC operations. | | |||
| `push` | Whether or not this file or directory, when previously <abbr>cached</abbr>, is uploaded to remote storage by `dvc push` (`true` by default). | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Echoing iterative/dvc#8581 (comment):
Should we plan to recommend this a lot in Data Pipeline docs? Specifically for intermediate pipeline outputs. Assuming the happy path out there is to push only raw data and likely final ML model files (everything else may be best to dvc repro
when needed).
If we don't at least emphasize the possibility, users may realize too late they have pushed a bunch of intermediate output versions and they are pretty difficult to clean up with dvc gc
(support example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that not pushing is the right default behavior, even for intermediate outputs. If the user wants to take advantage of run-cache to not re-run stages that have already been reproduced, they still need to push/pull intermediate outs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorgeorpinel Thinking about it some more, I like the suggestion and think it makes sense as a possible product direction to make it easier to get started with pipelines, so let's brainstorm more on it.
Related to #4089 |
fbb8a47
to
e3255f2
Compare
For stages created with `dvc import-url` and a cloud-versioned URL, `--rev` | ||
can be used to specify a object version ID to use. By default, the import will | ||
be updated to the latest version from cloud storage. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: link back to the --version-aware
flag in import-url
?
@dberenbaum @pmrowla qq - what is the status of this, and the cloud versioning in general? |
It's on my plate to review and merge the outstanding docs PRs (here and #4165, and tracked more broadly in #4089). We discussed in sprint meeting and agreed we could merge now with an admon that it's experimental. In Q1, the plan is to address performance issues and publicly promote it. Initial thoughts on blog post are in Notion. |
* guide: draft structure of Data Mgmt and some updates around the topic in existing docs * guide: full text for draft intro to DM * guide: hide cloud versioning info per #4042 (review) * guide: clarify Data Mgmt parts and add prospective figure titles * guide: add figure drafts to Data Mgmt * guide: SCM->VC (Data Mgmt) * guide: update 2 figs and add 1 more (Data Mgmt) * guide: roll back unrelated changes per #4042 (review) * guide: mention clouds first (DM) and and update fig. 1 per #4042 (review) * guide: flatten DM index per #4042 (review) * guide: udpates to DM/ DV moved from #4053 (review) * guide: add DM/ Data Versioning page per #4042 (comment) * guide: update outdated link * guide: revert more unrelatedly chaqnged files per #4042 (review) * guide: remove unused ref link * guide: DM/ Remote Storage (not just Setup) and and some links from cmd refs and avoid term "data remote" and some admons nearby... * guide: remove a comment * guide: draft for DM/ Remote Storage content * ref: expand config.remote and link to/from Remotes guide * ref: fix remote config file examples * guide: complete Remote Config section and and add Project config section to DM/ DV guide * ref: rewrite remote add and modify Descs * guide: complete list of supported storage types * ref: rewrite remote index page from extracted from #4053 * guide: clarify `remote modify` phrase in in the Remote config section of DM/ Remote Storage * Update content/docs/user-guide/data-management/data-versioning.md * guide: update versioning config per #4058 (review) * guide: don't call remote storage "additional" here (in the DM/ Remote Storage guide) per #4058 (review) Co-authored-by: Dave Berenbaum <[email protected]> * guide: pull -> download (DM/ RS intro) * guide: remove "optional" from Remote Storage nav & title per #4058 (review) * guide: splits and notes around Data Mgmt index page rel. #4042 (comment) * guide: Data Mgmt intro + note updates * guide: draft of all contents + + remove comments * guide: small impros to Data Mgmt in prep for #4042 (review) * guide: rewrite Data Mgmt index in before/after form per #4042 (review) * guide: add draft figure for Data Mgmt * guide: simplify/refocus data mgmt index per #4042 (review) * work around commented header bug * guide: drop DM/ DV page * guide: rewrite DM intro and - hide benefits (for now) - remove codification comment block * guide: use DM table instead of figure for now * guide: rewrite Data Mgmt story * guide: add draft figures to Data Mgmt * guide: simplify Data Mgmt story and benefits * guide: remove unused images (DM) * guide: update Data Mgmt figures (v1) * guide: rewrite text of Data Mgmt index * guide: update Data Mgmt figures * guide: iterate on Data Mgmt again * guide: update Data Mgmt figs * guide: more supporting info about Data Mgmt * guide: update figures (much more concrete) and and matching text updates * guide: edits to How it works (Data Mgmt) * guide: update Data Mgmt figures Rel. #4042 (comment) * guide: emphaisze dataset versions in UG fig 1 Rel. #4042 (comment) * guide: update Data Mgmt figures (with notes), expand img captions, and update text accordingly. * guide: more updates to text and figure styles, esp. to the first half and comment some stuff out (temporary) * guide: update figures and text (Data Mgmt) ... Using a tabs toggle for the 2nd fig. * guide: Data Management text (section 1) finalized for this version of figures * guide: Data Management (main text) finalized for this version of figures * guide: Data Management (secondary text) pending diagram and code sample(s) * guide: add DVC data mgmt technical diagram & dummy sample CLI blocks * guide: update Data Mgmt text * guide: udpate text and 2nd figure (Data Mgmt) * guide: draft 2nd and 3rd figures * guide: rewrite Data Mgmt/ How it works & and Benefits/ Tradeoffs Probably still unfinished... Missing more data versioning info? See HTML comments. * guide: update drafts of Data Mgmt figures 2, 3 * guide: Data Mgmt improvements and hide the benefits list for now * guide: separate from Data Mgmt work Rel. #4042 * Apply suggestions from code review * Merge branch main + * ref: bring cloud versioning copy edits of import-url from https://github.com/iterative/dvc.org/pull/4260/files#diff-ef95e18c4bd039757695065a23946dc27e28b4727ce07c670cdc096e34dbe3b3 * ref: clarify import-url with cloud versioning per #4142 (review) * ref: updates to import-url --version-aware and update --rev * ref: add import-url --version aware to Synopsis per #4089 (comment) * Restyled by prettier (#4266) Co-authored-by: Restyled.io <[email protected]> * ref: updates around worktree updates (cloud versioning) * ref: link from `remote` (index) to storage types * guide: roll back changes to dvc.yaml `rev` field spec * Update content/docs/command-reference/update.md * guide: link refs in .dvc file spec * Restyled by prettier (#4319) Co-authored-by: Restyled.io <[email protected]> * Update content/docs/command-reference/update.md --------- Co-authored-by: Dave Berenbaum <[email protected]> Co-authored-by: rogermparent <[email protected]> Co-authored-by: restyled-io[bot] <32688539+restyled-io[bot]@users.noreply.github.com> Co-authored-by: Restyled.io <[email protected]>
* guide: draft structure of Data Mgmt and some updates around the topic in existing docs * guide: full text for draft intro to DM * guide: hide cloud versioning info per #4042 (review) * guide: clarify Data Mgmt parts and add prospective figure titles * guide: add figure drafts to Data Mgmt * guide: SCM->VC (Data Mgmt) * guide: update 2 figs and add 1 more (Data Mgmt) * guide: roll back unrelated changes per #4042 (review) * guide: mention clouds first (DM) and and update fig. 1 per #4042 (review) * guide: flatten DM index per #4042 (review) * guide: udpates to DM/ DV moved from #4053 (review) * guide: add DM/ Data Versioning page per #4042 (comment) * guide: update outdated link * guide: revert more unrelatedly chaqnged files per #4042 (review) * guide: remove unused ref link * guide: DM/ Remote Storage (not just Setup) and and some links from cmd refs and avoid term "data remote" and some admons nearby... * guide: remove a comment * guide: draft for DM/ Remote Storage content * ref: expand config.remote and link to/from Remotes guide * ref: fix remote config file examples * guide: complete Remote Config section and and add Project config section to DM/ DV guide * ref: rewrite remote add and modify Descs * guide: complete list of supported storage types * ref: rewrite remote index page from extracted from #4053 * guide: clarify `remote modify` phrase in in the Remote config section of DM/ Remote Storage * Update content/docs/user-guide/data-management/data-versioning.md * guide: update versioning config per #4058 (review) * guide: don't call remote storage "additional" here (in the DM/ Remote Storage guide) per #4058 (review) Co-authored-by: Dave Berenbaum <[email protected]> * guide: pull -> download (DM/ RS intro) * guide: remove "optional" from Remote Storage nav & title per #4058 (review) * guide: splits and notes around Data Mgmt index page rel. #4042 (comment) * guide: Data Mgmt intro + note updates * guide: draft of all contents + + remove comments * guide: small impros to Data Mgmt in prep for #4042 (review) * guide: rewrite Data Mgmt index in before/after form per #4042 (review) * guide: add draft figure for Data Mgmt * guide: simplify/refocus data mgmt index per #4042 (review) * work around commented header bug * guide: drop DM/ DV page * guide: rewrite DM intro and - hide benefits (for now) - remove codification comment block * guide: use DM table instead of figure for now * guide: rewrite Data Mgmt story * guide: add draft figures to Data Mgmt * guide: simplify Data Mgmt story and benefits * guide: remove unused images (DM) * guide: update Data Mgmt figures (v1) * guide: rewrite text of Data Mgmt index * guide: update Data Mgmt figures * guide: iterate on Data Mgmt again * guide: update Data Mgmt figs * guide: more supporting info about Data Mgmt * guide: update figures (much more concrete) and and matching text updates * guide: edits to How it works (Data Mgmt) * guide: update Data Mgmt figures Rel. #4042 (comment) * guide: emphaisze dataset versions in UG fig 1 Rel. #4042 (comment) * guide: update Data Mgmt figures (with notes), expand img captions, and update text accordingly. * guide: more updates to text and figure styles, esp. to the first half and comment some stuff out (temporary) * guide: update figures and text (Data Mgmt) ... Using a tabs toggle for the 2nd fig. * guide: Data Management text (section 1) finalized for this version of figures * guide: Data Management (main text) finalized for this version of figures * guide: Data Management (secondary text) pending diagram and code sample(s) * guide: add DVC data mgmt technical diagram & dummy sample CLI blocks * guide: update Data Mgmt text * guide: udpate text and 2nd figure (Data Mgmt) * guide: draft 2nd and 3rd figures * guide: rewrite Data Mgmt/ How it works & and Benefits/ Tradeoffs Probably still unfinished... Missing more data versioning info? See HTML comments. * guide: update drafts of Data Mgmt figures 2, 3 * guide: Data Mgmt improvements and hide the benefits list for now * guide: separate from Data Mgmt work Rel. #4042 * Apply suggestions from code review * Merge branch main + * ref: update links from API to Remotes guide * guide: update links around Remote Storage and and other updates to nearby Markdown (e.g. proper admons) * Roll back unrelated changes * Restyled by prettier (#4261) Co-authored-by: Restyled.io <[email protected]> * ref: bring cloud versioning copy edits of import-url from https://github.com/iterative/dvc.org/pull/4260/files#diff-ef95e18c4bd039757695065a23946dc27e28b4727ce07c670cdc096e34dbe3b3 * ref: clarify import-url with cloud versioning per #4142 (review) * ref: updates to import-url --version-aware and update --rev * ref: add import-url --version aware to Synopsis per #4089 (comment) * Restyled by prettier (#4266) Co-authored-by: Restyled.io <[email protected]> * Restyled by prettier (#4322) Co-authored-by: Restyled.io <[email protected]> * Update content/docs/command-reference/remote/modify.md Co-authored-by: Oded Messer <[email protected]> * Update content/docs/command-reference/remote/modify.md Co-authored-by: Oded Messer <[email protected]> * Update content/docs/command-reference/push.md Co-authored-by: Oded Messer <[email protected]> * yarn format-all --------- Co-authored-by: Dave Berenbaum <[email protected]> Co-authored-by: rogermparent <[email protected]> Co-authored-by: restyled-io[bot] <32688539+restyled-io[bot]@users.noreply.github.com> Co-authored-by: Restyled.io <[email protected]> Co-authored-by: Oded Messer <[email protected]>
Per
files
entry for cloud versioned dir dependencies dvc#8528 ?