-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
repro: fallback order of execution for independent stages #5181
Comments
@skshetry what do you mean the default order doesn't matter? How does DVC determine when it matters and when it doesn't? I imagine that kind of semantics can only be decided by each user.
So there is an order? May I know what that is please? I've noticed that changing stage attributes ( stages:
a:
cmd: echo a
b:
cmd: echo b Runs a, then b (intuitive) stages:
b:
cmd: echo b
a:
cmd: echo a Runs b, then a (also intuitive) But for more complex dvc.yaml files sometimes the order changes unintuitively based on the stage properties (which seems unexpected). |
This comment has been minimized.
This comment has been minimized.
@jorgeorpinel DVC does have some order of course internally. But it's an implementation detail. We don't specify any particular order for execution. I doubt that we specify any for
so, this is not "important" in a sense that we should not document it
as I mentioned, I would wait for a good use case for this. But even in that case we might decide to make ordering explicit via some language constructs.
not sure, how is it related tbh. Could you clarify your point please? |
If they are independent nodes, DVC is free to run them in any order it likes. Internally we use
Lockfile's order should always be independent of the user-facing declarations. Users can easily change the order of DVC always prefers to dump alphabetically in the lock file. If they are a list, they are sorted. In the case of This will be clear if you look at the
@shcheklein, we guarantee the ordering of the entries for the stage and treat it as a bug if they are not stable on repeated generation (except deeply-nested params). But, as you said, it's more for not making But, regarding the ordering of stages' entries, it does not matter and I don't see why it should. As I said, lock file has it's own way of ordering things, it does not have to reflect how you have written in Also, by having a |
Closing as I don't think there's anything for us to do here. |
@shcheklein also for other things like loading
Agree on this. I think that stability can be an important benefit or even an assumed expectation by most users.
I'm not saying this should be part of 2.0 release of anything like that. It could even be a Of course there's
@skshetry yes, I understand that is the current view. But I'm trying to put myself in the user's shoes and some of them may expect otherwise (part of QA is hypothetical remember? 🙂)
OK good to know, thanks for the info. Per Ivan's comment I won't be documenting that but at least people can find that detail here now, if they really need it.
Yes, I think so. dvc.lock is completely regenerated by repro anyway right? So ther order of foreach items isn't guaranteed? That seems quite unintuitive to me.
That's a good Q. Multiple dvc.yaml files definitely complicate this matter. I'm not sure. |
What if you move/reorder it within the dvc.yaml? |
I argue that the order in which the user enters stages (As well as foreach items, which become stages) could easily be respected by default, yes. |
It is true that internally we try to load it as such, but if there's no overlap/overwriting, externally it can be considered a single atomic loading, which has no relation to the ordering. And, of course, overlap/overwrite is an error. |
We should not guarantee order of execution for independent stages. If/when we eventually support parallel execution for stages within a pipeline, we will not be able to guarantee order of execution and completion for independent stages at all. This is still the case even for If users are conditioned to rely on undefined behavior for order of execution in current DVC releases, it would almost certainly break things for them in the future. |
That may be different (internally). But what about multiple commands (
If the fallback exec order shouldn't matter then changing it shouldn't break anything. The implication that users may indeed care about independent stage exec order seems to support my case 🙂
I hear you. I think that the key is indeed in distinguishing between queueing vs. completing execution. I agree that we shouldn't aim to guarantee the order of completion for independent stages. That's not my proposal. To wrap this up, all I'm saying is that:Fact: We have a certain (mostly alphabetical but kind of obscure) fallback stage "queuing" order for indie stages (really the order in which they're written to dvc.lock — detailed in #5181 (comment)). Suggestion: Can we try to use the user's explicit order instead/first? It may be expected or even important for some users, especially in serial (non-parallel) mode. This includes Cc @dberenbaum not sure if you're interested on this one but PING just in case. All that said, if users haven't shown confusion over this, I understand we may want to leave it for later or close this until that happens. Up to you. Thanks! |
Queuing order is an implementation detail that should not matter for independent stages.
This should not be expected at all though. The point is that there should not be a reason for users to depend on specific execution order for independent stages. If the order of execution is important for a user, it means that the stages are not actually independent, and the later stage is dependent on the earlier stage. In this case the user should be directed to build their stage dependency graph properly rather than relying on undefined implementation specific behavior in DVC. |
To me this seems like we're being kind of opinionated as to how users should employ DVC, while my impression at least is that DVC tries to avoid that in principle. Why are we determining what should or not matter to users?
I assume people will care about how independent stages are queued. Especially in foreach stages etc. (as discussed). But I may be wrong. It's already implemented like this so OK, I agree we need a better reason to reconsider. Thanks! |
For the record, a scenario this could help in is actually an old issue that has been open for years, see #2378 (comment). |
I understand that DVC needs to build a DAG in order to establish an execution order for stages. And that otherwise, "we don't guarantee any order". That's all good but
cmd
per stage and 2.0 parameterization of dvc.yaml (foreach
stages).⌛ I'll bring some comments that can be discussed from other tickets shortly... ⌛
The text was updated successfully, but these errors were encountered: