-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloud execution #3
base: main
Are you sure you want to change the base?
Conversation
1. **Run exp.** Allocate a cloud istance (with a specified configuration) and | ||
execute an ML experiment `dvc exp run` (or `dvc repro`) on it. | ||
2. **Run stage.** Allocate a cloud istance and execute a particular stage of | ||
an experiment on it. | ||
3. **Hot instance.** Execute on an existing running instance without instance | ||
allocation. | ||
4. **Remote machine.** Execute in a remote machine (cloud or on-premise one). | ||
5. **Instance pool.** Execute on one of the existing running instances. | ||
6. **Exp queue** Execute a set of experiments (from exp queue for example) in | ||
an instance pool or a single instance. | ||
7. **Web run.** Execute an experiment from SaaS\Web using one of the methods | ||
from above and DVC machinery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there different categories of scenarios? Some of these overlap. For example, a web run could apply to any of the other scenarios. To support all of these, it seems like these specs would be needed:
Compute resource spec
Provisioning: New vs. existing
Location: Cloud vs. on-premises
Count: Single instance vs pool
DVC execution spec
Stages: Single stage vs. full pipeline
Experiments: Single vs. multiple
Entrypoint: Local vs. SaaS/web
Does that seem accuratre? And is that a useful way to organize it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dberenbaum I think the scenarios can def. be grouped. This is probably a non-exhaustive list of use cases.
I like the categories you summarized but wouldn't see them as specs (if you meant specs to implement the feature). The user will prob care about these distinctions, but they may not be relevant for the tool except perhaps whether running on a single or several instances (IF this includes parallel execution).
This is great for how to teach this, though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dberenbaum some regrouping might be needed. But in some cases, such grouping might be not relevant as @jorgeorpinel mentioned. An example - on-premise run is a prerequisite of cloud. I'd keep the whole list for now and try to cut the scope when we start implementing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem, this just helps me organize it better in my head.
1. **Spot instances.** | ||
2. **Transparent spot instance.** Recover the execution if a spot instance | ||
was terminated. DVC checkpoints and pipeline stages should be used for | ||
preserving the state. | ||
3. **Volumes.** Volumes can be attached and reused in instances to minimize | ||
data cache synchronization time. | ||
4. **Shared volumes.** Separate cloud services (such as Multi-Attach EBS or | ||
EFS) might be needed for sharing data cache between multiple instances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we separate compute resource types (1 and 2) and storage resource types (3 and 4)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. But for now, the list of optimizations seems is very small to make separation or re-grouping. Also, some optimizations might be related - like re-attaching a volume when a spot instance fails.
1. AWS | ||
2. Azure | ||
3. GCP (optional) | ||
4. Remote over SSH | ||
5. Kubernetes (K8S) | ||
6. HashiCorp Nomad (optional) | ||
7. Container services (ECS and Azure/GCP analogs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like there are a few different categories here also:
- Instance provisioning (1-3)
- Managing existing resources (4)
- Cluster provisioning (5-7)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I list them in priority order from the user's point of view.
Do you think the categories will help? There might be many ways of categorizing. From an implementation point of view:
- (4) is a prerequisite for all others
- among the clouds (1-3), GCP (3) is "special"
- (5)/K8S should be similar to clouds (1-3) with some "specialty"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These categories are all helpful for me, especially since these lists have items of different types, which confuses me if I don't break them down. Again, not important as long as all the scenarios are clear to everyone.
Great stuff. A few unresolved questions I have:
|
Related feature request regarding exp queue management: iterative/dvc#5615 |
We need to introduce a concept **Executor** to DVC. | ||
This convept should not break the compatibility and should not require | ||
significant changes in the docs (except command options).v | ||
However, it will require a new section in the docs to explain the cloud | ||
execution for the users who'd like to use this feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are to abstract the execution environment, maybe we could start education from "local" executor - the machine user is using at the moment. I feel like explaining remotes starting from local ones made it easier for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start education from "local" executor
Good idea unless that implies significant changes to docs (e.g. having to use the local executor concept everywhere repro/run
are mentioned). BTW I don't think we should avoid such big changes at all costs, just not sure it's necessary here. It may make things confusing for people looking for a basic local workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "local" executor already exists (exp run --temp
), but we have already run into potential problems with making this a "default" behavior. Namely that debugging your pipeline becomes very difficult when the changes are not happening in your normal workspace.
So I think we would need to clarify what we want to do regarding educating the user about non-workspace execution environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "local" executor already exists (exp run --temp)
Good point. Local (to the env) but external (to the project)? Like "local remotes".
UPDATE: Actually I'm not sure. Maybe the local machine itself IS the executor, whether on a regular repro
or an exp run --temp
. Otherwise what exactly is an executor? 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pared could you please clarify? Do you mean abstracting environments like pip/conda packages or just directory (with --temp
as @pmrowla mentioned)?
Maybe the local machine itself IS the executor
@jorgeorpinel yes, it is the default executor. @pmrowla point is that we can abstract it out to a separate directory. But this is what exp run --queue
does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmpetrov Not quite, I was rather thinking that we are already executing locally. In a way, our local environment is an executor, so what I though would be good is to teach this concept by saying "you already are using (local) executor - when using --temp
". So as analogy - when you will be using remote executors, you will do similar thing as with --temp
and local one - with all its pros and ceveats.
cloud-exec.md
Outdated
# Drawbacks | ||
|
||
This feature opens new user scenarios which increase the complexity of DVC, | ||
it will require new sections in the docs, etc. | ||
It might be beneficial to extract as much of this functionality as possible to | ||
external products and tools (terraform providers for example). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc opens with "Users need to run ML experiments in cloud instances or remote machines. DVC should support this out of the box." but this part seems to question that. My first Q is to clarify or decide on that. Do we want dvc
CLI to have remote execution?
If it was possible to elaborate a little on a hypothetical external tool as alternative (e.g. Terraform provider) that could help. And what about putting it into dvcx
(once installed could extend the dvc
CLI) or in another (perhaps GUI-based) tool from Iterative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified the intro and this section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'd resolve this (and others) but I can't on this repo.
It can look like this: | ||
```bash | ||
$ dvc exp run --executor my-gpu-tesla | ||
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would executors get setup like remotes?
$ dvc executor add my-gpu-tesla ssh://[email protected]
$ dvc executor modify my-gpu-tesla pword 'myp4$$'
$ dvc exp run --on my-gpu-tesla
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are resulting outputs synchronized back into the workspace? Or does the executor keep it's own repo clone with separate state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are resulting outputs synchronized back into the workspace? Or does the executor keep it's own repo clone with separate state?
My assumption here would be that the results of an experiment are synced back into the workspace ("results" meaning git-tracked repo state). Current --temp
exp runs are already (local) executors, and experiment execution is already structured so that it would work with any machine we can talk to via git+ssh. So the git-tracked results of an executor run will be retrieved as experiment refs (like with existing local --temp
runs).
For DVC-tracked (cached) outputs, they would be fetched into the local cache. This could be done by either fetching the data directly from the executor machine, or by using an intermediate DVC remote (so after a run, the executor does dvc push
to a DVC remote, and then the user's local machine does dvc pull
). The final dvc pull
could actually be optional here, since the user may not need the cache data on their local machine at all.
Using the intermediate remote seems like it would fit better to me, but I think that needs some more discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm... exp run --temp
seems to behave more like a remote executor to me. Except that "remote" in this case means "external" (again, like "local remotes" which has never been a great name BTW). But what's the point here? That we can reuse that implementation? If so, sounds good!
Also agree about the assumption and proposed mechanics, except that what if the user hasn't setup any remote storage to use as the intermediate step? Will there be some default ones maintained by us (similar to the default Google Cloud Project)? In any case, a direct transfer may have better performance?
5. **Instance pool.** Execute on one of the existing running instances. | ||
6. **Exp queue** Execute a set of experiments (from exp queue for example) in | ||
an instance pool or a single instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To what extent would DVC utilize pools? Are we thinking of parallel execution/ distributed computing? Or for now focusing on single instances (which may be obtained from a pool)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Single instance or a pool of instances for parallel execution. But it is not about distributed learning. (if I understood the question correctly).
@dberenbaum all questions are great!
push data constantly 😄 We might need some optimizations in the future. Like tmp remote dir.
An option - commit
DVC should support a queue. A more interesting question - how to support a shared queue among multiple users. K8S/Nomad can be an answer but it seems a bit heavy.
🤷♂️ I'd not define API here. |
I tried to address some of the questions. Please take a look at the answers and the new version. |
Thanks for the responses! What are your thoughts on copying the proposal to Notion? I can see as a reviewer now how it's a bit difficult to view the text and keep up with changes! Ideally, we are editing the document as we resolve comments to either incorporate the feedback or address why it wasn't incorporated (unless suggestions are minor or formatting issues like the list categories above). |
The executor definition (`p3.large` AWS instance in `us-west` with `abcdef` | ||
volume attached) should be decoupled from pipeline definition the same way | ||
as remotes are: stage `train` runs on `my-gpu-tesla`, not executor | ||
definition with `p3.large`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we define in dvc.yaml
, it makes it hard to run elsewhere (like locally), and it couples experiment history and infrastructure (similar to problems with changing remotes today). Maybe having a local config option or having an exp run --executor
flag is sufficient flexibility here? What would we do differently with remotes if we were starting from scratch?
Hi, the option to run (queued) experiments in parallel in the cloud would be amazing. Looking forward to it! Seeing the discussion of executor definition including AWS EC2 instance type, I wonder if you have considered supporting AWS Batch and/or Ray Clusters as executors? These are the two means we use in our research group for running experiments in the cloud. An integration with DVC would be great, and furthermore would separate management of AWS resources from DVC. Otherwise, we may have to have yet another set of EC2 roles specifically for DVC cloud executors, which is undesirable. |
No description provided.