Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 3.0 #7093

Closed
17 of 18 tasks
efiop opened this issue Dec 5, 2021 · 64 comments
Closed
17 of 18 tasks

Release 3.0 #7093

efiop opened this issue Dec 5, 2021 · 64 comments
Assignees
Labels
epic Umbrella issue (high level). Include here: list of tasks/PRs, details, Qs, timelines, etc p1-important Important, aka current backlog of things to do
Milestone

Comments

@efiop
Copy link
Contributor

efiop commented Dec 5, 2021

Data Management

  1. p1-important
    pmrowla
  2. pmrowla

Experiments and Pipelines

  1. A: experiments discussion
    daavoo dberenbaum

Deprecations

Other release blockers

Studio Readiness for 3.0 release

@efiop efiop changed the title 3.0 checklist 3.0 release checklist Dec 5, 2021
@dmpetrov

This comment was marked as resolved.

@efiop

This comment was marked as resolved.

@pared

This comment was marked as resolved.

@dberenbaum

This comment was marked as resolved.

@pared

This comment was marked as resolved.

@daavoo

This comment was marked as resolved.

@pared

This comment was marked as resolved.

@dberenbaum

This comment was marked as resolved.

@karajan1001

This comment was marked as resolved.

@skshetry

This comment was marked as resolved.

@dberenbaum

This comment was marked as duplicate.

@dtrifiro

This comment was marked as resolved.

@dberenbaum

This comment was marked as resolved.

@skshetry

This comment was marked as resolved.

@skshetry

This comment was marked as resolved.

@skshetry

This comment was marked as resolved.

@jorgeorpinel

This comment was marked as resolved.

@skshetry

This comment was marked as resolved.

@daavoo

This comment was marked as resolved.

@dberenbaum

This comment was marked as resolved.

@dberenbaum

This comment was marked as resolved.

@skshetry

This comment was marked as resolved.

@dberenbaum
Copy link
Collaborator

Some possible ideas to add looking through dvc add options:

  • Deprecate dvc add --file (obscure and more likely to confuse than help)
  • Rename dvc add --jobs (frequently gets confused with core.checksum_jobs)

@skshetry
Copy link
Member

skshetry commented May 26, 2023

Two more proposals from my side:

  • Remove Stage-level vars. I haven't seen this being used and makes implementation complex.
  • Make dvc exp push work by default, or remove the need for explicit remote argument.
    (If we make dvc exp push work by default, what should it push?)

@dberenbaum
Copy link
Collaborator

* [Remove Stage-level vars](https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#:~:text=Stage%2Dspecific%20values%20are%20also%20supported). I haven't seen this being used and makes implementation complex.

😅 https://discord.com/channels/485586884165107732/1111557378735865856/1111636564183883876

It's not clear to me whether the stage-level vars are needed there. @skshetry Maybe you have an idea for how else to handle that use case?

* Make `dvc exp push` work by default, or remove the need for explicit `remote` argument.
  (If we make `dvc exp push` work by default, what should it push?)

I've come around to not minding the explicit remote requirement since it has occasionally saved me from accidentally pushing to origin when it was an upstream remote (like when I clone a demo project but don't want to mess with the studio demo). Especially now that we have VS Code to make pushing easier, I don't mind keeping the CLI explicit.

@dberenbaum

This comment was marked as off-topic.

@dberenbaum
Copy link
Collaborator

* [Remove Stage-level vars](https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#:~:text=Stage%2Dspecific%20values%20are%20also%20supported). I haven't seen this being used and makes implementation complex.

😅 https://discord.com/channels/485586884165107732/1111557378735865856/1111636564183883876

It's not clear to me whether the stage-level vars are needed there. @skshetry Maybe you have an idea for how else to handle that use case?

@skshetry It turns out they aren't needed in that case, so I'm on board to drop them.

@aschuh-hf

This comment was marked as off-topic.

@dberenbaum

This comment was marked as off-topic.

@aschuh-hf

This comment was marked as off-topic.

@efiop
Copy link
Contributor Author

efiop commented Jun 4, 2023

One last thing we forgot to include is to stop validating cache on every minor operation. This is a leftover from the past where symlink/hardlink were default and you ran a greater risk to corrupt your cache. These days we default to reflink/copy, so there is no reason really to validate cache on every operation like checkout. We should instead only validate cache where we really need to (before push?) and just introduce dvc cache check (or something like that) for manual validation.

In practical terms, I don't think we document that certain operations check cache, so we are not really tied to 3.0 release and can just stop doing it whenever.

@dberenbaum
Copy link
Collaborator

In practical terms, I don't think we document that certain operations check cache, so we are not really tied to 3.0 release and can just stop doing it whenever.

Why don't we create a separate issue for it then and just make it high priority so we get to it soon?

@skshetry
Copy link
Member

skshetry commented Jun 7, 2023

Stage-level vars was created to support loading different parameters for different stages. Take following as an example:

stages:
  train_model:
    foreach: [us, uk]
    do:
      wdir: ${item}
      vars: [params.yaml]  # uk/params.yaml, us/params.yaml, etc.
      cmd: cat params.yaml

But I don't find it very elegant, is cumbersome, has a lot of complexity and haven't seen anyone use this. Hydra support may have made this redundant.

@sjawhar
Copy link
Contributor

sjawhar commented Jun 7, 2023

Stage-level vars was created to support loading different parameters for different stages. Take following as an example:

stages:
  train_model:
    foreach: [us, uk]
    do:
      wdir: ${item}
      vars: [params.yaml]  # uk/params.yaml, us/params.yaml, etc.
      cmd: cat params.yaml

But I don't find it very elegant, is cumbersome, has a lot of complexity and haven't seen anyone use this. Hydra support may have made this redundant.

We actually use this feature extensively. Our piplines tend to have a lot of this:

stages:
  preprocess_feature_one:
    foreach: ${sessions}
    do:
      vars:
        - preprocessing_key: feature_one
          input_dir: load
      <<: &stage_preprocess
        cmd: >-
          python src/preprocess.py
          --params-file=params.yaml
          ${preprocessing_key}
          data/interim/${input_dir}/${key}
          data/interim/${preprocessing_key}/${key}
        params:
          - preprocess.${preprocessing_key}
        deps:
          - ../../poetry.lock
          - data/interim/${input_dir}/${key}
          - src/preprocess.py
        outs:
          - data/interim/${preprocessing_key}/${key}

  preprocess_feature_two:
    foreach: ${sessions}
    do:
      vars:
        - preprocessing_key: feature_two
          input_dir: feature_one
      <<: *stage_preprocess

While removing this functionality might not completely block us from upgrading to DVC 3.x—one could copy/paste the stage definition with small changes—it would definitely increase the maintenance burden of our pipelines and make mistakes much easier to make (oops, forgot to change X in all the stages). It's possible I'm overlooking some functionality made possible by hydra?

I might have even gloated about our use of this pattern in my last DVC office hours 😅

@efiop
Copy link
Contributor Author

efiop commented Jun 7, 2023

@dberenbaum Created #9561

@dberenbaum
Copy link
Collaborator

Thanks @efiop! Does #9561 relate to 3.0 items?

@dberenbaum
Copy link
Collaborator

@sjawhar That's a really creative way to do things! If we were to implement #5172, do you think that would solve the same use case? From what I can tell, you are limited by needing to loop over both sessions and features.

@skshetry WDYT about the comment from @sjawhar above? What do you think would be the best way to do it?

@efiop
Copy link
Contributor Author

efiop commented Jun 8, 2023

@dberenbaum I will try to create dvc cache/remote check if I have time this week and if I will - it will be nice to drop cache checks in a few places right away. If not - i think we can do that later, even though that's not ideal.

@skshetry
Copy link
Member

skshetry commented Jun 13, 2023

@sjawhar, that's an interesting way to do nested looping with merge-key.

How many sessions do you run? If they aren't that many, I'll suggest using a (duplicated) map/dictionary and use that loop.

# params.yaml
sessions:
  feature1_session1:
    key: feature_one
    input_dir: feature_one
  feature1_session2:
    key: feature_one
    input_dir: feature_one
  feature2_session1:
    key: feature_two
    input_dir: feature_two
  feature2_session2:
    key: feature_two
    input_dir: feature_two

This might be more readable, although you can still use the same merge-key to avoid duplications.

sessions:
  feature1_session1: &feature1
    key: feature_one
    input_dir: feature_one
  feature1_session2: *feature1
  ...
  feature2_session1: &feature2
    key: feature_two
    input_dir: feature_two
  feature2_session2: *feature2
  ...

#5172 might be a better solution to fix this.

@sjawhar
Copy link
Contributor

sjawhar commented Jun 13, 2023

@sjawhar, that's an interesting way to do nested looping with merge-key.

How many sessions do you run? If they aren't that many, I'll suggest using a (duplicated) map/dictionary and use that loop.

@skshetry When we use DVC in a data registry pattern, there could be hundreds or thousands of sessions. Yes, though, you're right that we could have an additional step the one runs before running dvc commands which generates this params.yaml, but supporting it natively in DVC seems more logical (to me at least)

@daavoo daavoo closed this as completed Jun 14, 2023
@dberenbaum dberenbaum unpinned this issue Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic Umbrella issue (high level). Include here: list of tasks/PRs, details, Qs, timelines, etc p1-important Important, aka current backlog of things to do
Projects
No open projects
Archived in project
Development

No branches or pull requests