Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc.yaml: add matrix-do #9725

Merged
merged 4 commits into from
Aug 4, 2023
Merged

dvc.yaml: add matrix-do #9725

merged 4 commits into from
Aug 4, 2023

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Jul 12, 2023

Introduces matrix:do, fixes #5172.

stages:
  stage1:
    matrix:
      os: [windows, linux, macos]
      pyv: ["3.9", "3.10", "3.11"]
      dict:
      - arg1: 1
      - arg2: 2
      lst:
      - [1, 2]
      - [3, 4]
    cmd: echo ${item.os} ${item.pyv} ${item.dict} ${item.lst.0} ${item.lst.1}

DVC will build the name for the generated stages based on their variables.
If the variables are numbers, strings, booleans, etc, they will use their string representation, and for others, dvc will use the variable name and index.

For example, for above, os and pyv will use their respective values, whereas dict/lst will use their variable name plus index. For example, one of the instance will look like follows;

This name will be it's address (i.e. how you invoke it in dvc repro and other places), and it's identity. Like foreach:do, you can invoke the whole group with just dvc repro stage1 as well.

Example output of above

dvc stage list --all
[email protected]   No outputs or dependencies
[email protected]   No outputs or dependencies
[email protected]   No outputs or dependencies
[email protected]   No outputs or dependencies
[email protected]  No outputs or dependencies
[email protected]  No outputs or dependencies
[email protected]  No outputs or dependencies
[email protected]  No outputs or dependencies
[email protected]  No outputs or dependencies
[email protected]  No outputs or dependencies
[email protected]  No outputs or dependencies
[email protected]  No outputs or dependencies
[email protected]     No outputs or dependencies
[email protected]     No outputs or dependencies
[email protected]     No outputs or dependencies
[email protected]     No outputs or dependencies
[email protected]    No outputs or dependencies
[email protected]    No outputs or dependencies
[email protected]    No outputs or dependencies
[email protected]    No outputs or dependencies
[email protected]    No outputs or dependencies
[email protected]    No outputs or dependencies
[email protected]    No outputs or dependencies
[email protected]    No outputs or dependencies
[email protected]     No outputs or dependencies
[email protected]     No outputs or dependencies
[email protected]     No outputs or dependencies
[email protected]     No outputs or dependencies
[email protected]    No outputs or dependencies
[email protected]    No outputs or dependencies
[email protected]    No outputs or dependencies
[email protected]    No outputs or dependencies
[email protected]    No outputs or dependencies
[email protected]    No outputs or dependencies
[email protected]    No outputs or dependencies
[email protected]    No outputs or dependencies

@skshetry skshetry requested a review from a team July 12, 2023 06:45
@skshetry skshetry self-assigned this Jul 12, 2023
@skshetry skshetry added the feature is a feature label Jul 12, 2023
@codecov
Copy link

codecov bot commented Jul 12, 2023

Codecov Report

Patch coverage: 91.57% and project coverage change: -0.01% ⚠️

Comparison is base (bc86fa0) 90.45% compared to head (bf6b0c2) 90.45%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9725      +/-   ##
==========================================
- Coverage   90.45%   90.45%   -0.01%     
==========================================
  Files         482      483       +1     
  Lines       36719    36797      +78     
  Branches     5299     5309      +10     
==========================================
+ Hits        33213    33283      +70     
- Misses       2900     2907       +7     
- Partials      606      607       +1     
Files Changed Coverage Δ
dvc/parsing/__init__.py 96.72% <90.00%> (-2.44%) ⬇️
dvc/repo/stage.py 94.44% <100.00%> (ø)
dvc/schema.py 100.00% <100.00%> (ø)
dvc/stage/loader.py 97.56% <100.00%> (ø)
tests/func/parsing/test_matrix.py 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dberenbaum
Copy link
Collaborator

Do you think it makes sense to keep foreach or should this eventually replace it?

@skshetry
Copy link
Member Author

skshetry commented Jul 12, 2023

Do you think it makes sense to keep foreach or should this eventually replace it?

We did not implement matrix initially, because we thought foreach..do would be more intuitive to users, and simple compared to matrix or even nested foreach..do.

I am not sure if a lot has changed in the past 2 years. Maybe GitHub Actions has standardized the syntax/concepts now? So I am not sure about foreach in the short term. But in the future, if the users don't have difficulty picking it up, we can think of deprecating foreach.

@dberenbaum
Copy link
Collaborator

Thanks @skshetry! Looks really nice, and appreciate the example. It might be good to also include an example of accessing subitems like model below (just a thought for docs):

stages:
  stage1:
    matrix:
      os: [windows, linux, macos]
      pyv: ["3.9", "3.10", "3.11"]
      dict:
      - arg1: 1
      - arg2: 2
      lst:
      - [1, 2]
      - [3, 4]
      model:
      - epochs: 3
        thresh: 10
      - epochs: 10
        thresh: 15
    do:
      cmd: echo ${item.os} ${item.pyv} ${item.dict} ${item.lst.0} ${item.lst.1} --model epochs=${item.model.epochs},thresh=${item.model.thresh}

@dberenbaum
Copy link
Collaborator

We did not implement matrix initially, because we thought foreach..do would be more intuitive to users, and simple compared to matrix or even nested foreach..do.

My feeling is that having both will be more complex and confusing than asking users to learn matrix. The simple case of foreach that accepts a list instead means two different ways to use it, and I would rather just have a single way of doing things (matrix that expects a dict).

@daavoo
Copy link
Contributor

daavoo commented Jul 12, 2023

We did not implement matrix initially, because we thought foreach..do would be more intuitive to users, and simple compared to matrix or even nested foreach..do.

My feeling is that having both will be more complex and confusing than asking users to learn matrix. The simple case of foreach that accepts a list instead means two different ways to use it, and I would rather just have a single way of doing things (matrix that expects a dict).

I would personally prefer to only document matrix in the docs once this is merged

@skshetry
Copy link
Member Author

I would prefer waiting a couple of months before we do that. If we do want to promote matrix, and deprecate foreach soon, let's scrutinize the syntax a bit more, as we don't really need to do things the way foreach does?

  1. Should matrix inject item variable? Or, should it be named matrix? Or, should variables be directly addressed? i.e. ${os}, etc.
  2. We did not know anything better than foreach .. do (we started with foreach .. in which I still prefer which binds variables in-side a block). Is there a better structure than this that we can do for matrix?

Maybe the current way is fine, but I do want to touch on these issues and discuss explicitly.

@dberenbaum
Copy link
Collaborator

Great points, thanks for raising them.

  1. Or, should variables be directly addressed? i.e. ${os}, etc.

Since matrix requires a dict, I like the simplicity of this, although we would need to handle conflicts with other vars.

${matrix.os} follows GitHub but GitLab directly addresses the variables, and I don't think GitHub Actions syntax is so pervasive that we need to follow it if there's a simpler option available.

2. We did not know anything better than foreach .. do (we started with foreach .. in which I still prefer which binds variables in-side a block). Is there a better structure than this that we can do for matrix?

The simplest solution would be to get rid of the do/in block altogether like:

stages:
  stage1:
    matrix:
      os: [windows, linux, macos]
      pyv: ["3.9", "3.10", "3.11"]
      dict:
      - arg1: 1
      - arg2: 2
      lst:
      - [1, 2]
      - [3, 4]
    cmd: echo ${item.os} ${item.pyv} ${item.dict} ${item.lst.0} ${item.lst.1}

@sjawhar
Copy link
Contributor

sjawhar commented Jul 12, 2023

Wanted to chime in with a random thought that came to me just now. Imagine something like this:

# params.yaml
foo:
  bar: [a,b,c,d,e]
  baz: [1,2,3]
# dvc.yaml
stages:
  foo_bar:
    matrix:
      first: ${foo.bar}
      second: [x, y, z]
    <<: &stage_foo
      cmd: ...
      ...

  foo_baz:
    matrix:
      first: ${foo.baz}
      second: [x, y, z]
    <<: *stage_foo

Could there would be a way to specify such a "ragged matrix" as a single foo stage, where different slices have different numbers of elements? i.e. the full list of stages would be

  • foo@firsta-secondx
  • foo@firsta-secondy
  • foo@firsta-secondz
  • foo@firstb-secondx
  • foo@firstb-secondy
  • ...
  • foo@firste-secondz
  • foo@first1-secondx
  • foo@first1-secondy
  • foo@first1-secondz
  • foo@first2-secondx
  • ...
  • foo@first3-secondz

@dberenbaum
Copy link
Collaborator

Hey @sjawhar, I think it's not possible even with this PR because you want to treat foo.bar and foo.baz like a single concatenated list, right? It's outside the scope of this PR, but it is something that's been discussed in #7151 and #9148.

@sjawhar
Copy link
Contributor

sjawhar commented Jul 13, 2023

Good point that this is essentially just concatenating the two lists. I wonder if it could be accomplished using hydra.

dvc/parsing/__init__.py Outdated Show resolved Hide resolved
data = {
"matrix": matrix,
"do": {
"cmd": "echo ${item.os} ${item.pyv} ${item.dict}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q) Is dict unpacking expected/supposed to work here (i.e. echo ${item])?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a blocker, I think would be good to add a test for that

@dberenbaum
Copy link
Collaborator

@skshetry Are you okay with the suggestions in #9725 (comment)? Once we agree that, it's ready from a product perspective.

@skshetry
Copy link
Member Author

skshetry commented Jul 14, 2023

On both variable addressing, and matrix..do block, one advantage that I see with what you proposed, is that it makes it easy to convert a templated stage to a matrix.

Since matrix requires a dict, I like the simplicity of this, although we would need to handle conflicts with other vars.

item or matrix variable logically groups them, and avoids conflicts. Not only that, I think this logical grouping may come useful in the future if we ever implement env. and secrets. like namespaces. They are not so that complicated, either. If we implement directly addressing variables, we'd probably spend more time explaining "shadowing of variables".

I think the arguments are weak on both sides, but the grouping under item is a bit safer, so I am leaning towards it unless I hear more (strong) arguments for or against this.

The simplest solution would be to get rid of the do/in block altogether like:

matrix..do makes it a bit clear on what is a data source and what is being templated.
Does that structure help? I don't have a strong opinion here, though. Syntactically, we are also enforcing order, as otherwise, the flat map in yaml can be defined in any order.

On both of these issues, I don't have a strong opinion. But I do want to hear pros and cons of those changes before deciding on this.

Copy link
Collaborator

@dberenbaum dberenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @skshetry, agreed neither are true blockers, so I'm approving.

On both variable addressing, and matrix..do block, one advantage that I see with what you proposed, is that it makes it easy to convert a templated stage to a matrix.

Yes, I think it's enough reason to at least drop do since it saves you from having to make significant modifications to even a non-templated stage.

@daavoo
Copy link
Contributor

daavoo commented Jul 14, 2023

I think the arguments are weak on both sides, but the grouping under item is a bit safer, so I am leaning towards it unless I hear more (strong) arguments for or against this.

Maybe is just a plus for me, but on the side of grouping I like the possibility of using dict unpacking on item. Allows me to write less verbose cmd and to update the matrix entry without having to update cmd

@dberenbaum
Copy link
Collaborator

@daavoo I assume you would be fine with either item or matrix for that purpose?

@daavoo
Copy link
Contributor

daavoo commented Jul 14, 2023

@daavoo I assume you would be fine with either item or matrix for that purpose?

yes

@skshetry
Copy link
Member Author

skshetry commented Jul 14, 2023

One argument in favour of matrix is symmetry.

matrix:
  os: [windows, linux, macos]
  pyv: ["3.9", "3.10", "3.11"]

You access os variables with matrix.os.

Is that enough of a reason to move away from item and make it confusing with foreach? That's something I am not quite sure of.

Copy link
Contributor

@daavoo daavoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's agree on the name/syntax (no strong opinion from me) and remember to update https://github.com/iterative/dvcyaml-schema

cc @iterative/vs-code for awareness of something new worth adding to snippets

@dberenbaum
Copy link
Collaborator

Personally I always found item hard to remember since it's completely unrelated to foreach.

@dberenbaum
Copy link
Collaborator

dberenbaum commented Jul 14, 2023

Dropping do also significantly simplifies the schema. Snippets and suggestions will be easier to follow.

@mattseddon
Copy link
Member

I think the simplest thing to do in VS Code is to drop the dvc-pipeline-foreach-stage snippet and not add a dvc-pipeline-matrix-stage one. This is because the new matrix syntax will only be supported for the most recent versions of DVC.

I think we can do this because foreach/matrix is a more advanced usage which people can find through the docs.

We can add the new snippet once some more time has passed. LMK what you think.

@skshetry
Copy link
Member Author

I'll leave this up for comments for a few days, and will merge by the end of this week.

@skshetry skshetry merged commit d1160a2 into iterative:main Aug 4, 2023
@skshetry skshetry deleted the matrix-do branch August 4, 2023 07:44
@dberenbaum
Copy link
Collaborator

Thanks @skshetry!

skshetry added a commit to iterative/dvc.org that referenced this pull request Aug 14, 2023
skshetry added a commit to iterative/dvc.org that referenced this pull request Aug 14, 2023
skshetry added a commit to iterative/dvc.org that referenced this pull request Aug 14, 2023
skshetry added a commit to iterative/dvc.org that referenced this pull request Aug 14, 2023
skshetry added a commit to iterative/dvc.org that referenced this pull request Aug 17, 2023
* guide: document matrix in dvc.yaml

fixes #4741
upstream PR: iterative/dvc#9725
available since https://github.com/iterative/dvc/releases/3.12.0

* Apply suggestions from code review

Co-authored-by: Dave Berenbaum <[email protected]>

* Restyled by prettier (#4765)

Co-authored-by: Restyled.io <[email protected]>

* redirect from foreach to matrix; add a complete example for matrix with templating

* format matrix stage list

---------

Co-authored-by: Dave Berenbaum <[email protected]>
Co-authored-by: restyled-io[bot] <32688539+restyled-io[bot]@users.noreply.github.com>
Co-authored-by: Restyled.io <[email protected]>
skshetry added a commit to iterative/dvcyaml-schema that referenced this pull request Aug 17, 2023
skshetry added a commit to iterative/dvcyaml-schema that referenced this pull request Aug 18, 2023
* schema for matrix

See iterative/dvc#9725.

* add invalid examples
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature is a feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nested Looping in Parameterization
5 participants