Cloud execution #3

dmpetrov · 2021-03-28T00:19:32Z

No description provided.

dberenbaum · 2021-03-29T16:52:34Z

cloud-exec.md

+1. **Run exp.** Allocate a cloud istance (with a specified configuration) and
+  execute an ML experiment `dvc exp run` (or `dvc repro`) on it.
+2. **Run stage.** Allocate a cloud istance and execute a particular stage of
+   an experiment on it.
+3. **Hot instance.** Execute on an existing running instance without instance
+   allocation.
+4. **Remote machine.** Execute in a remote machine (cloud or on-premise one).
+5. **Instance pool.** Execute on one of the existing running instances.
+6. **Exp queue** Execute a set of experiments (from exp queue for example) in
+   an instance pool or a single instance.
+7. **Web run.** Execute an experiment from SaaS\Web using one of the methods
+   from above and DVC machinery.


Are there different categories of scenarios? Some of these overlap. For example, a web run could apply to any of the other scenarios. To support all of these, it seems like these specs would be needed:

Compute resource spec

Provisioning: New vs. existing
Location: Cloud vs. on-premises
Count: Single instance vs pool

DVC execution spec

Stages: Single stage vs. full pipeline
Experiments: Single vs. multiple
Entrypoint: Local vs. SaaS/web

Does that seem accuratre? And is that a useful way to organize it?

@dberenbaum I think the scenarios can def. be grouped. This is probably a non-exhaustive list of use cases.

I like the categories you summarized but wouldn't see them as specs (if you meant specs to implement the feature). The user will prob care about these distinctions, but they may not be relevant for the tool except perhaps whether running on a single or several instances (IF this includes parallel execution).

This is great for how to teach this, though!

@dberenbaum some regrouping might be needed. But in some cases, such grouping might be not relevant as @jorgeorpinel mentioned. An example - on-premise run is a prerequisite of cloud. I'd keep the whole list for now and try to cut the scope when we start implementing this.

No problem, this just helps me organize it better in my head.

dberenbaum · 2021-03-29T16:53:31Z

cloud-exec.md

+1. **Spot instances.**
+2. **Transparent spot instance.** Recover the execution if a spot instance
+   was terminated. DVC checkpoints and pipeline stages should be used for
+   preserving the state.
+3. **Volumes.** Volumes can be attached and reused in instances to minimize
+   data cache synchronization time.
+4. **Shared volumes.** Separate cloud services (such as Multi-Attach EBS or
+   EFS) might be needed for sharing data cache between multiple instances.


Should we separate compute resource types (1 and 2) and storage resource types (3 and 4)?

Sure. But for now, the list of optimizations seems is very small to make separation or re-grouping. Also, some optimizations might be related - like re-attaching a volume when a spot instance fails.

dberenbaum · 2021-03-29T18:43:12Z

cloud-exec.md

+1. AWS
+2. Azure
+3. GCP (optional)
+4. Remote over SSH
+5. Kubernetes (K8S)
+6. HashiCorp Nomad (optional)
+7. Container services (ECS and Azure/GCP analogs)


Seems like there are a few different categories here also:

Instance provisioning (1-3)

Managing existing resources (4)

Cluster provisioning (5-7)

I list them in priority order from the user's point of view.

Do you think the categories will help? There might be many ways of categorizing. From an implementation point of view:

(4) is a prerequisite for all others

among the clouds (1-3), GCP (3) is "special"

(5)/K8S should be similar to clouds (1-3) with some "specialty"

These categories are all helpful for me, especially since these lists have items of different types, which confuses me if I don't break them down. Again, not important as long as all the scenarios are clear to everyone.

dberenbaum · 2021-03-29T23:53:57Z

Great stuff. A few unresolved questions I have:

How to share data between driver and executors?
How to handle dvc.lock data?
What happens when multiple jobs are submitted to the same executor (dvc exp run -S n=5 --executor my_exec; dvc exp run -S n=10 --executor my_exec)?
Should provisioning and teardown happen through DVC commands, or should they happen separately?
How to save executors for stages (i.e. always run on this executor)?

pmrowla · 2021-03-30T05:39:35Z

Related feature request regarding exp queue management: iterative/dvc#5615

pared · 2021-03-30T09:12:02Z

cloud-exec.md

+We need to introduce a concept **Executor** to DVC.
+This convept should not break the compatibility and should not require
+significant changes in the docs (except command options).v
+However, it will require a new section in the docs to explain the cloud
+execution for the users who'd like to use this feature.


If we are to abstract the execution environment, maybe we could start education from "local" executor - the machine user is using at the moment. I feel like explaining remotes starting from local ones made it easier for me.

start education from "local" executor

Good idea unless that implies significant changes to docs (e.g. having to use the local executor concept everywhere repro/run are mentioned). BTW I don't think we should avoid such big changes at all costs, just not sure it's necessary here. It may make things confusing for people looking for a basic local workflow.

The "local" executor already exists (exp run --temp), but we have already run into potential problems with making this a "default" behavior. Namely that debugging your pipeline becomes very difficult when the changes are not happening in your normal workspace.

So I think we would need to clarify what we want to do regarding educating the user about non-workspace execution environments.

The "local" executor already exists (exp run --temp)

Good point. Local (to the env) but external (to the project)? Like "local remotes".

UPDATE: Actually I'm not sure. Maybe the local machine itself IS the executor, whether on a regular repro or an exp run --temp. Otherwise what exactly is an executor? 😅

@pared could you please clarify? Do you mean abstracting environments like pip/conda packages or just directory (with --temp as @pmrowla mentioned)?

Maybe the local machine itself IS the executor

@jorgeorpinel yes, it is the default executor. @pmrowla point is that we can abstract it out to a separate directory. But this is what exp run --queue does.

@dmpetrov Not quite, I was rather thinking that we are already executing locally. In a way, our local environment is an executor, so what I though would be good is to teach this concept by saying "you already are using (local) executor - when using --temp". So as analogy - when you will be using remote executors, you will do similar thing as with --temp and local one - with all its pros and ceveats.

jorgeorpinel · 2021-03-31T05:26:38Z

cloud-exec.md

+# Drawbacks
+
+This feature opens new user scenarios which increase the complexity of DVC,
+it will require new sections in the docs, etc.
+It might be beneficial to extract as much of this functionality as possible to
+external products and tools (terraform providers for example). 


The doc opens with "Users need to run ML experiments in cloud instances or remote machines. DVC should support this out of the box." but this part seems to question that. My first Q is to clarify or decide on that. Do we want dvc CLI to have remote execution?

If it was possible to elaborate a little on a hypothetical external tool as alternative (e.g. Terraform provider) that could help. And what about putting it into dvcx (once installed could extend the dvc CLI) or in another (perhaps GUI-based) tool from Iterative?

Modified the intro and this section.

Thanks. I'd resolve this (and others) but I can't on this repo.

jorgeorpinel · 2021-03-31T05:30:54Z

cloud-exec.md

+It can look like this:
+```bash
+$ dvc exp run --executor my-gpu-tesla
+...


Would executors get setup like remotes?

$ dvc executor add my-gpu-tesla ssh://[email protected] $ dvc executor modify my-gpu-tesla pword 'myp4$$' $ dvc exp run --on my-gpu-tesla

Are resulting outputs synchronized back into the workspace? Or does the executor keep it's own repo clone with separate state?

Are resulting outputs synchronized back into the workspace? Or does the executor keep it's own repo clone with separate state?

My assumption here would be that the results of an experiment are synced back into the workspace ("results" meaning git-tracked repo state). Current --temp exp runs are already (local) executors, and experiment execution is already structured so that it would work with any machine we can talk to via git+ssh. So the git-tracked results of an executor run will be retrieved as experiment refs (like with existing local --temp runs).

For DVC-tracked (cached) outputs, they would be fetched into the local cache. This could be done by either fetching the data directly from the executor machine, or by using an intermediate DVC remote (so after a run, the executor does dvc push to a DVC remote, and then the user's local machine does dvc pull). The final dvc pull could actually be optional here, since the user may not need the cache data on their local machine at all.

Using the intermediate remote seems like it would fit better to me, but I think that needs some more discussion.

Hmmmm... exp run --temp seems to behave more like a remote executor to me. Except that "remote" in this case means "external" (again, like "local remotes" which has never been a great name BTW). But what's the point here? That we can reuse that implementation? If so, sounds good!

Also agree about the assumption and proposed mechanics, except that what if the user hasn't setup any remote storage to use as the intermediate step? Will there be some default ones maintained by us (similar to the default Google Cloud Project)? In any case, a direct transfer may have better performance?

jorgeorpinel · 2021-03-31T06:06:01Z

cloud-exec.md

+5. **Instance pool.** Execute on one of the existing running instances.
+6. **Exp queue** Execute a set of experiments (from exp queue for example) in
+   an instance pool or a single instance.


To what extent would DVC utilize pools? Are we thinking of parallel execution/ distributed computing? Or for now focusing on single instances (which may be obtained from a pool)?

Single instance or a pool of instances for parallel execution. But it is not about distributed learning. (if I understood the question correctly).

dmpetrov · 2021-04-13T06:56:52Z

Great stuff. A few unresolved questions I have:

@dberenbaum all questions are great!

How to share data between driver and executors?

push data constantly 😄 We might need some optimizations in the future. Like tmp remote dir.

How to handle dvc.lock data?

An option - commit dvc.lock to an experiment and push data.

What happens when multiple jobs are submitted to the same executor (dvc exp run -S n=5 --executor my_exec; dvc exp run -S n=10 --executor my_exec)?

DVC should support a queue. A more interesting question - how to support a shared queue among multiple users. K8S/Nomad can be an answer but it seems a bit heavy.

Should provisioning and teardown happen through DVC commands, or should they happen separately?

How to save executors for stages (i.e. always run on this executor)?

🤷‍♂️ I'd not define API here.
Ideally, a user should be able to define or overwrite executors in config (dvc.yml). dvc exp run should execute, provision/terminate if needed. Adding more details in Executors and naming section.

dmpetrov · 2021-04-13T08:20:01Z

I tried to address some of the questions. Please take a look at the answers and the new version.

dberenbaum · 2021-04-13T20:50:48Z

@dberenbaum all questions are great!

Thanks for the responses! What are your thoughts on copying the proposal to Notion? I can see as a reviewer now how it's a bit difficult to view the text and keep up with changes! Ideally, we are editing the document as we resolve comments to either incorporate the feedback or address why it wasn't incorporated (unless suggestions are minor or formatting issues like the list categories above).

dberenbaum · 2021-04-13T20:54:20Z

cloud-exec.md

+The executor definition (`p3.large` AWS instance in `us-west` with `abcdef`
+volume attached) should be decoupled from pipeline definition the same way
+as remotes are: stage `train` runs on `my-gpu-tesla`, not executor
+definition with `p3.large`.


If we define in dvc.yaml, it makes it hard to run elsewhere (like locally), and it couples experiment history and infrastructure (similar to problems with changing remotes today). Maybe having a local config option or having an exp run --executor flag is sufficient flexibility here? What would we do differently with remotes if we were starting from scratch?

aschuh-hf · 2022-11-22T10:06:49Z

Hi, the option to run (queued) experiments in parallel in the cloud would be amazing. Looking forward to it! Seeing the discussion of executor definition including AWS EC2 instance type, I wonder if you have considered supporting AWS Batch and/or Ray Clusters as executors? These are the two means we use in our research group for running experiments in the cloud. An integration with DVC would be great, and furthermore would separate management of AWS resources from DVC. Otherwise, we may have to have yet another set of EC2 roles specifically for DVC cloud executors, which is undesirable.

dmpetrov added 3 commits March 27, 2021 17:18

copy template

c4fdb3b

First version

25e567f

Ready to read

46db9c0

dmpetrov changed the title ~~[WIP] Cloud execution~~ Cloud execution Mar 29, 2021

dberenbaum reviewed Mar 29, 2021

View reviewed changes

pared reviewed Mar 30, 2021

View reviewed changes

jorgeorpinel reviewed Mar 31, 2021

View reviewed changes

pmrowla mentioned this pull request Apr 13, 2021

exp run: new code (or any untracked files) can't be queued [QA] iterative/dvc#5801

Closed

After review modifications

fef581f

dberenbaum reviewed Apr 13, 2021

View reviewed changes

efiop mentioned this pull request May 3, 2021

run/repro: Execute a command on another machine iterative/dvc#1490

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloud execution #3

Cloud execution #3

dmpetrov commented Mar 28, 2021

dberenbaum Mar 29, 2021

jorgeorpinel Mar 31, 2021

dmpetrov Apr 13, 2021

dberenbaum Apr 13, 2021

dberenbaum Mar 29, 2021

dmpetrov Apr 13, 2021

dberenbaum Mar 29, 2021

dmpetrov Apr 13, 2021

dberenbaum Apr 13, 2021

dberenbaum commented Mar 29, 2021

pmrowla commented Mar 30, 2021

pared Mar 30, 2021

jorgeorpinel Mar 31, 2021 •

edited

Loading

pmrowla Mar 31, 2021

jorgeorpinel Apr 1, 2021 •

edited

Loading

dmpetrov Apr 13, 2021

pared Apr 13, 2021

jorgeorpinel Mar 31, 2021 •

edited

Loading

dmpetrov Apr 13, 2021

jorgeorpinel Apr 19, 2021

jorgeorpinel Mar 31, 2021 •

edited

Loading

jorgeorpinel Mar 31, 2021

pmrowla Mar 31, 2021

jorgeorpinel Apr 1, 2021 •

edited

Loading

jorgeorpinel Mar 31, 2021

dmpetrov Apr 13, 2021

dmpetrov commented Apr 13, 2021 •

edited

Loading

dmpetrov commented Apr 13, 2021

dberenbaum commented Apr 13, 2021

dberenbaum Apr 13, 2021

aschuh-hf commented Nov 22, 2022

Cloud execution #3

Are you sure you want to change the base?

Cloud execution #3

Conversation

dmpetrov commented Mar 28, 2021

Choose a reason for hiding this comment

Compute resource spec

DVC execution spec

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dberenbaum commented Mar 29, 2021

pmrowla commented Mar 30, 2021

Choose a reason for hiding this comment

jorgeorpinel Mar 31, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorgeorpinel Apr 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorgeorpinel Mar 31, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorgeorpinel Mar 31, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorgeorpinel Apr 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmpetrov commented Apr 13, 2021 • edited Loading

dmpetrov commented Apr 13, 2021

dberenbaum commented Apr 13, 2021

Choose a reason for hiding this comment

aschuh-hf commented Nov 22, 2022

jorgeorpinel Mar 31, 2021 •

edited

Loading

jorgeorpinel Apr 1, 2021 •

edited

Loading

jorgeorpinel Mar 31, 2021 •

edited

Loading

jorgeorpinel Mar 31, 2021 •

edited

Loading

jorgeorpinel Apr 1, 2021 •

edited

Loading

dmpetrov commented Apr 13, 2021 •

edited

Loading