Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud execution #3

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Cloud execution #3

wants to merge 4 commits into from

Conversation

dmpetrov
Copy link
Member

No description provided.

@dmpetrov dmpetrov changed the title [WIP] Cloud execution Cloud execution Mar 29, 2021
Comment on lines +27 to +38
1. **Run exp.** Allocate a cloud istance (with a specified configuration) and
execute an ML experiment `dvc exp run` (or `dvc repro`) on it.
2. **Run stage.** Allocate a cloud istance and execute a particular stage of
an experiment on it.
3. **Hot instance.** Execute on an existing running instance without instance
allocation.
4. **Remote machine.** Execute in a remote machine (cloud or on-premise one).
5. **Instance pool.** Execute on one of the existing running instances.
6. **Exp queue** Execute a set of experiments (from exp queue for example) in
an instance pool or a single instance.
7. **Web run.** Execute an experiment from SaaS\Web using one of the methods
from above and DVC machinery.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there different categories of scenarios? Some of these overlap. For example, a web run could apply to any of the other scenarios. To support all of these, it seems like these specs would be needed:

Compute resource spec

Provisioning: New vs. existing
Location: Cloud vs. on-premises
Count: Single instance vs pool

DVC execution spec

Stages: Single stage vs. full pipeline
Experiments: Single vs. multiple
Entrypoint: Local vs. SaaS/web

Does that seem accuratre? And is that a useful way to organize it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dberenbaum I think the scenarios can def. be grouped. This is probably a non-exhaustive list of use cases.

I like the categories you summarized but wouldn't see them as specs (if you meant specs to implement the feature). The user will prob care about these distinctions, but they may not be relevant for the tool except perhaps whether running on a single or several instances (IF this includes parallel execution).

This is great for how to teach this, though!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dberenbaum some regrouping might be needed. But in some cases, such grouping might be not relevant as @jorgeorpinel mentioned. An example - on-premise run is a prerequisite of cloud. I'd keep the whole list for now and try to cut the scope when we start implementing this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, this just helps me organize it better in my head.

Comment on lines +44 to +51
1. **Spot instances.**
2. **Transparent spot instance.** Recover the execution if a spot instance
was terminated. DVC checkpoints and pipeline stages should be used for
preserving the state.
3. **Volumes.** Volumes can be attached and reused in instances to minimize
data cache synchronization time.
4. **Shared volumes.** Separate cloud services (such as Multi-Attach EBS or
EFS) might be needed for sharing data cache between multiple instances.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we separate compute resource types (1 and 2) and storage resource types (3 and 4)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. But for now, the list of optimizations seems is very small to make separation or re-grouping. Also, some optimizations might be related - like re-attaching a volume when a spot instance fails.

Comment on lines +64 to +70
1. AWS
2. Azure
3. GCP (optional)
4. Remote over SSH
5. Kubernetes (K8S)
6. HashiCorp Nomad (optional)
7. Container services (ECS and Azure/GCP analogs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there are a few different categories here also:

  • Instance provisioning (1-3)
  • Managing existing resources (4)
  • Cluster provisioning (5-7)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I list them in priority order from the user's point of view.

Do you think the categories will help? There might be many ways of categorizing. From an implementation point of view:

  • (4) is a prerequisite for all others
  • among the clouds (1-3), GCP (3) is "special"
  • (5)/K8S should be similar to clouds (1-3) with some "specialty"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These categories are all helpful for me, especially since these lists have items of different types, which confuses me if I don't break them down. Again, not important as long as all the scenarios are clear to everyone.

@dberenbaum
Copy link
Contributor

Great stuff. A few unresolved questions I have:

  • How to share data between driver and executors?
  • How to handle dvc.lock data?
  • What happens when multiple jobs are submitted to the same executor (dvc exp run -S n=5 --executor my_exec; dvc exp run -S n=10 --executor my_exec)?
  • Should provisioning and teardown happen through DVC commands, or should they happen separately?
  • How to save executors for stages (i.e. always run on this executor)?

@pmrowla
Copy link

pmrowla commented Mar 30, 2021

Related feature request regarding exp queue management: iterative/dvc#5615

Comment on lines +128 to +132
We need to introduce a concept **Executor** to DVC.
This convept should not break the compatibility and should not require
significant changes in the docs (except command options).v
However, it will require a new section in the docs to explain the cloud
execution for the users who'd like to use this feature.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are to abstract the execution environment, maybe we could start education from "local" executor - the machine user is using at the moment. I feel like explaining remotes starting from local ones made it easier for me.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start education from "local" executor

Good idea unless that implies significant changes to docs (e.g. having to use the local executor concept everywhere repro/run are mentioned). BTW I don't think we should avoid such big changes at all costs, just not sure it's necessary here. It may make things confusing for people looking for a basic local workflow.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "local" executor already exists (exp run --temp), but we have already run into potential problems with making this a "default" behavior. Namely that debugging your pipeline becomes very difficult when the changes are not happening in your normal workspace.

So I think we would need to clarify what we want to do regarding educating the user about non-workspace execution environments.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Apr 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "local" executor already exists (exp run --temp)

Good point. Local (to the env) but external (to the project)? Like "local remotes".

UPDATE: Actually I'm not sure. Maybe the local machine itself IS the executor, whether on a regular repro or an exp run --temp. Otherwise what exactly is an executor? 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pared could you please clarify? Do you mean abstracting environments like pip/conda packages or just directory (with --temp as @pmrowla mentioned)?

Maybe the local machine itself IS the executor

@jorgeorpinel yes, it is the default executor. @pmrowla point is that we can abstract it out to a separate directory. But this is what exp run --queue does.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmpetrov Not quite, I was rather thinking that we are already executing locally. In a way, our local environment is an executor, so what I though would be good is to teach this concept by saying "you already are using (local) executor - when using --temp". So as analogy - when you will be using remote executors, you will do similar thing as with --temp and local one - with all its pros and ceveats.

cloud-exec.md Outdated
Comment on lines 134 to 139
# Drawbacks

This feature opens new user scenarios which increase the complexity of DVC,
it will require new sections in the docs, etc.
It might be beneficial to extract as much of this functionality as possible to
external products and tools (terraform providers for example).
Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc opens with "Users need to run ML experiments in cloud instances or remote machines. DVC should support this out of the box." but this part seems to question that. My first Q is to clarify or decide on that. Do we want dvc CLI to have remote execution?

If it was possible to elaborate a little on a hypothetical external tool as alternative (e.g. Terraform provider) that could help. And what about putting it into dvcx (once installed could extend the dvc CLI) or in another (perhaps GUI-based) tool from Iterative?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the intro and this section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'd resolve this (and others) but I can't on this repo.

Comment on lines +117 to +120
It can look like this:
```bash
$ dvc exp run --executor my-gpu-tesla
...
Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would executors get setup like remotes?

$ dvc executor add my-gpu-tesla ssh://[email protected]
$ dvc executor modify my-gpu-tesla pword 'myp4$$'
$ dvc exp run --on my-gpu-tesla

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are resulting outputs synchronized back into the workspace? Or does the executor keep it's own repo clone with separate state?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are resulting outputs synchronized back into the workspace? Or does the executor keep it's own repo clone with separate state?

My assumption here would be that the results of an experiment are synced back into the workspace ("results" meaning git-tracked repo state). Current --temp exp runs are already (local) executors, and experiment execution is already structured so that it would work with any machine we can talk to via git+ssh. So the git-tracked results of an executor run will be retrieved as experiment refs (like with existing local --temp runs).

For DVC-tracked (cached) outputs, they would be fetched into the local cache. This could be done by either fetching the data directly from the executor machine, or by using an intermediate DVC remote (so after a run, the executor does dvc push to a DVC remote, and then the user's local machine does dvc pull). The final dvc pull could actually be optional here, since the user may not need the cache data on their local machine at all.

Using the intermediate remote seems like it would fit better to me, but I think that needs some more discussion.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Apr 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm... exp run --temp seems to behave more like a remote executor to me. Except that "remote" in this case means "external" (again, like "local remotes" which has never been a great name BTW). But what's the point here? That we can reuse that implementation? If so, sounds good!

Also agree about the assumption and proposed mechanics, except that what if the user hasn't setup any remote storage to use as the intermediate step? Will there be some default ones maintained by us (similar to the default Google Cloud Project)? In any case, a direct transfer may have better performance?

Comment on lines +34 to +36
5. **Instance pool.** Execute on one of the existing running instances.
6. **Exp queue** Execute a set of experiments (from exp queue for example) in
an instance pool or a single instance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To what extent would DVC utilize pools? Are we thinking of parallel execution/ distributed computing? Or for now focusing on single instances (which may be obtained from a pool)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single instance or a pool of instances for parallel execution. But it is not about distributed learning. (if I understood the question correctly).

@dmpetrov
Copy link
Member Author

dmpetrov commented Apr 13, 2021

Great stuff. A few unresolved questions I have:

@dberenbaum all questions are great!

  • How to share data between driver and executors?

push data constantly 😄 We might need some optimizations in the future. Like tmp remote dir.

  • How to handle dvc.lock data?

An option - commit dvc.lock to an experiment and push data.

  • What happens when multiple jobs are submitted to the same executor (dvc exp run -S n=5 --executor my_exec; dvc exp run -S n=10 --executor my_exec)?

DVC should support a queue. A more interesting question - how to support a shared queue among multiple users. K8S/Nomad can be an answer but it seems a bit heavy.

  • Should provisioning and teardown happen through DVC commands, or should they happen separately?
  • How to save executors for stages (i.e. always run on this executor)?

🤷‍♂️ I'd not define API here.
Ideally, a user should be able to define or overwrite executors in config (dvc.yml). dvc exp run should execute, provision/terminate if needed. Adding more details in Executors and naming section.

@dmpetrov
Copy link
Member Author

I tried to address some of the questions. Please take a look at the answers and the new version.

@dberenbaum
Copy link
Contributor

@dberenbaum all questions are great!

Thanks for the responses! What are your thoughts on copying the proposal to Notion? I can see as a reviewer now how it's a bit difficult to view the text and keep up with changes! Ideally, we are editing the document as we resolve comments to either incorporate the feedback or address why it wasn't incorporated (unless suggestions are minor or formatting issues like the list categories above).

Comment on lines +121 to +124
The executor definition (`p3.large` AWS instance in `us-west` with `abcdef`
volume attached) should be decoupled from pipeline definition the same way
as remotes are: stage `train` runs on `my-gpu-tesla`, not executor
definition with `p3.large`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we define in dvc.yaml, it makes it hard to run elsewhere (like locally), and it couples experiment history and infrastructure (similar to problems with changing remotes today). Maybe having a local config option or having an exp run --executor flag is sufficient flexibility here? What would we do differently with remotes if we were starting from scratch?

@aschuh-hf
Copy link

Hi, the option to run (queued) experiments in parallel in the cloud would be amazing. Looking forward to it! Seeing the discussion of executor definition including AWS EC2 instance type, I wonder if you have considered supporting AWS Batch and/or Ray Clusters as executors? These are the two means we use in our research group for running experiments in the cloud. An integration with DVC would be great, and furthermore would separate management of AWS resources from DVC. Otherwise, we may have to have yet another set of EC2 roles specifically for DVC cloud executors, which is undesirable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants