Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repro: fallback order of execution for independent stages #5181

Closed
jorgeorpinel opened this issue Dec 30, 2020 · 15 comments
Closed

repro: fallback order of execution for independent stages #5181

jorgeorpinel opened this issue Dec 30, 2020 · 15 comments
Labels
discussion requires active participation to reach a conclusion question I have a question?

Comments

@jorgeorpinel
Copy link
Contributor

Extracted from #5165
Maybe related to dvc stages - see #3743 (comment)

I understand that DVC needs to build a DAG in order to establish an execution order for stages. And that otherwise, "we don't guarantee any order". That's all good but

  1. What is the underlying order, is it random or unknown to us? and
  2. Should there be a "fallback" exec order for non-connected stages? Esp. considering new features like multiple cmd per stage and 2.0 parameterization of dvc.yaml (foreach stages).

⌛ I'll bring some comments that can be discussed from other tickets shortly... ⌛

@jorgeorpinel jorgeorpinel added question I have a question? discussion requires active participation to reach a conclusion labels Dec 30, 2020
@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Dec 30, 2020

Other than that (DAG order), it might be difficult to promise to order (and, not that it matters).
Lock file has its own way of ordering things wherever it matters, it does not have to work the way you expect.

@skshetry what do you mean the default order doesn't matter? How does DVC determine when it matters and when it doesn't? I imagine that kind of semantics can only be decided by each user.

Lock file has its own way of ordering

So there is an order? May I know what that is please? I've noticed that changing stage attributes (cmd, outs, etc.) can result in a different exec orders, for example:

stages:
  a:
    cmd: echo a
  b:
    cmd: echo b

Runs a, then b (intuitive)

stages:
  b:
    cmd: echo b
  a:
    cmd: echo a

Runs b, then a (also intuitive)

But for more complex dvc.yaml files sometimes the order changes unintuitively based on the stage properties (which seems unexpected).

@jorgeorpinel

This comment has been minimized.

@shcheklein
Copy link
Member

@jorgeorpinel DVC does have some order of course internally. But it's an implementation detail. We don't specify any particular order for execution. I doubt that we specify any for dvc.lock (may be only trying to keep it stable to make diffs less painful, but even that should not be a defined/guaranteed behavior).

What is the underlying order, is it random or unknown to us? and

so, this is not "important" in a sense that we should not document it

Should there be a "fallback" exec order for non-connected stages?

as I mentioned, I would wait for a good use case for this. But even in that case we might decide to make ordering explicit via some language constructs.

Esp. considering new features like multiple cmd per stage and 2.0 parameterization of dvc.yaml (foreach stages).

not sure, how is it related tbh. Could you clarify your point please?

@jorgeorpinel jorgeorpinel added the awaiting response we are waiting for your reply, please respond! :) label Dec 30, 2020
@skshetry
Copy link
Member

skshetry commented Dec 30, 2020

what do you mean the default order doesn't matter?

If they are independent nodes, DVC is free to run them in any order it likes. Internally we use set when collecting stages and when running graph checks. So, it might not be easy to support this. By writing stages in dvc.yaml, user does not imply the ordering, as it is implicit based on DAG.

So there is an order?

Lockfile's order should always be independent of the user-facing declarations. Users can easily change the order of deps and outs in the dvc.yaml file. It should not affect the lockfile unless the actual contents were changed.

DVC always prefers to dump alphabetically in the lock file. If they are a list, they are sorted. In the case of params section in lock file, we always keep the params.yaml file at the top and other files are ordered alphabetically. The key, value pairs are also alphabetically sorted (not recursively though). And, the order of keys cmd/deps/params/outs are always the same.

This will be clear if you look at the example-get-started's dvc.lock.

I doubt that we specify any for dvc.lock

@shcheklein, we guarantee the ordering of the entries for the stage and treat it as a bug if they are not stable on repeated generation (except deeply-nested params). But, as you said, it's more for not making diff painful than actual functionality.

But, regarding the ordering of stages' entries, it does not matter and I don't see why it should. As I said, lock file has it's own way of ordering things, it does not have to reflect how you have written in dvc.yaml file.

Also, by having a foreach data, you are not implying any order. What if the user changed the order of the list later? Should we regenerate the ordering of the lock entries? What if user moved the entry in dvc.yaml to somewhere else? Should we reorder the lock entries then too?

@skshetry
Copy link
Member

skshetry commented Jan 4, 2021

Closing as I don't think there's anything for us to do here.

@skshetry skshetry closed this as completed Jan 4, 2021
@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Jan 6, 2021

I doubt that we specify any (order) for dvc.lock

we guarantee the ordering of the entries for the stage

@shcheklein also for other things like loading vars we respect the order in which the user wrote the file. Same for multiple commands (cmd) and (maybe?) foreach items (that's how those things are related to this discussion, it's kind of about consistency/usability).

trying to keep it stable to make diffs less painful

Agree on this. I think that stability can be an important benefit or even an assumed expectation by most users.

I would wait for a good use case for this

I'm not saying this should be part of 2.0 release of anything like that. It could even be a nice-to-have to my mind. More than a specific use case, I'm saying that it's inconsistent or at least unintuitive to respect the order of some lists in dvc.yaml but not others. I'm assuming (maybe wrong) that users will expect otherwise.

Of course there's deps and outs where it doesn't matter but in that case it IS intuitive IMO (as there can NOT be any order for deps/outs even if we wanted). Cc @skshetry

If they are independent nodes, DVC is free to run them in any order it likes

@skshetry yes, I understand that is the current view. But I'm trying to put myself in the user's shoes and some of them may expect otherwise (part of QA is hypothetical remember? 🙂)

DVC always prefers to dump alphabetically in the lock file. If they are a list, they are sorted..

OK good to know, thanks for the info. Per Ivan's comment I won't be documenting that but at least people can find that detail here now, if they really need it.

by having a foreach data, you are not implying any order. What if the user changed the order of the list later? Should we regenerate the ordering of the lock entries?

Yes, I think so. dvc.lock is completely regenerated by repro anyway right? So ther order of foreach items isn't guaranteed? That seems quite unintuitive to me.

What if user moved the entry in dvc.yaml to somewhere else?

That's a good Q. Multiple dvc.yaml files definitely complicate this matter. I'm not sure.

@skshetry
Copy link
Member

skshetry commented Jan 6, 2021

Multiple dvc.yaml files definitely complicate this matter. I'm not sure.

What if you move/reorder it within the dvc.yaml?

@skshetry skshetry reopened this Jan 6, 2021
@skshetry skshetry removed the awaiting response we are waiting for your reply, please respond! :) label Jan 6, 2021
@jorgeorpinel
Copy link
Contributor Author

I argue that the order in which the user enters stages (As well as foreach items, which become stages) could easily be respected by default, yes.

@skshetry
Copy link
Member

skshetry commented Jan 6, 2021

also for other things like loading vars we respect the order

vars is a different story. And, note that the ordering of vars does not matter at all.

It is true that internally we try to load it as such, but if there's no overlap/overwriting, externally it can be considered a single atomic loading, which has no relation to the ordering.

And, of course, overlap/overwrite is an error.

@pmrowla
Copy link
Contributor

pmrowla commented Jan 6, 2021

We should not guarantee order of execution for independent stages. If/when we eventually support parallel execution for stages within a pipeline, we will not be able to guarantee order of execution and completion for independent stages at all. This is still the case even for foreach items (jobs may be queued "in order", but there is no guarantee that they would complete in order).

If users are conditioned to rely on undefined behavior for order of execution in current DVC releases, it would almost certainly break things for them in the future.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Jan 22, 2021

vars is a different story.

That may be different (internally). But what about multiple commands (cmd)? (Rhetorical) The point is that we respect the user's ordering for some things.

If users are conditioned to rely on undefined behavior for order of execution in current DVC releases...

If the fallback exec order shouldn't matter then changing it shouldn't break anything. The implication that users may indeed care about independent stage exec order seems to support my case 🙂

If/when we eventually support parallel execution for stages within a pipeline, we will not be able to guarantee order of execution and completion

I hear you. I think that the key is indeed in distinguishing between queueing vs. completing execution. I agree that we shouldn't aim to guarantee the order of completion for independent stages. That's not my proposal.

To wrap this up, all I'm saying is that:

Fact: We have a certain (mostly alphabetical but kind of obscure) fallback stage "queuing" order for indie stages (really the order in which they're written to dvc.lock — detailed in #5181 (comment)).

Suggestion: Can we try to use the user's explicit order instead/first? It may be expected or even important for some users, especially in serial (non-parallel) mode. This includes foreach items.

Cc @dberenbaum not sure if you're interested on this one but PING just in case.


All that said, if users haven't shown confusion over this, I understand we may want to leave it for later or close this until that happens. Up to you. Thanks!

@jorgeorpinel jorgeorpinel changed the title repro: default order of execution? repro: fallback order of execution for independent stages Jan 22, 2021
@pmrowla
Copy link
Contributor

pmrowla commented Jan 22, 2021

If the fallback exec order shouldn't matter then changing it shouldn't break anything. The implication that users may indeed care about independent stage exec order seems to support my case 🙂

I hear you. I think that the key is indeed in distinguishing between queueing vs. completing execution. I agree that we shouldn't aim to guarantee the order of completion for independent stages. That's not my proposal.

Queuing order is an implementation detail that should not matter for independent stages.

Suggestion: Can we try to use the user's explicit order instead/first? It may be expected or even important for some users, especially in serial (non-parallel) mode. This includes foreach items.

This should not be expected at all though.

The point is that there should not be a reason for users to depend on specific execution order for independent stages. If the order of execution is important for a user, it means that the stages are not actually independent, and the later stage is dependent on the earlier stage. In this case the user should be directed to build their stage dependency graph properly rather than relying on undefined implementation specific behavior in DVC.

@efiop
Copy link
Contributor

efiop commented Jan 23, 2021

I agree with @skshetry and @pmrowla , this is an implementation detail. Closing for now.

@efiop efiop closed this as completed Jan 23, 2021
@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Feb 9, 2021

To me this seems like we're being kind of opinionated as to how users should employ DVC, while my impression at least is that DVC tries to avoid that in principle. Why are we determining what should or not matter to users?

Queuing order is an implementation detail that should not matter at all for independent stages.

I assume people will care about how independent stages are queued. Especially in foreach stages etc. (as discussed). But I may be wrong.

It's already implemented like this so OK, I agree we need a better reason to reconsider. Thanks!

@jorgeorpinel
Copy link
Contributor Author

For the record, a scenario this could help in is actually an old issue that has been open for years, see #2378 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion requires active participation to reach a conclusion question I have a question?
Projects
None yet
Development

No branches or pull requests

5 participants