Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp run: support naive remote execution via --machine #7173

Merged
merged 7 commits into from
Jan 13, 2022

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Dec 20, 2021

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Related to #6267

  • Adds dvc exp run --machine for running an experiment on the specified dvc machine instance
  • Adds machine setup_script config option for runtime environment setup (this is separate from the existing startup_script which installs system packages at machine boot/startup).

Note: Until the next release, in order to actually use dvc exp run --machine you will need to override the default startup_script to install DVC from source with your own script:

#!/bin/bash
sudo add-apt-repository --yes ppa:deadsnakes/ppa 
sudo apt-get update
# NOTE: deadsnakes PPA python requires debian/ubuntu python3-pip
sudo apt-get install --yes python3.9 python3.9-dev python3.9-venv python3-pip
sudo -u ubuntu python3.9 -m pip install --upgrade pip --user
sudo -u ubuntu python3.9 -m pip install --upgrade setuptools --user
sudo -u ubuntu python3.9 -m pip install "git+https://github.com/iterative/dvc.git@refs/pull/7173/head#egg=dvc[all]" --user
echo "OK" | sudo tee /var/log/dvc-machine-init.log

Once this PR is merged you can drop the specific refs/pull/... from the pip install command. Also note that you have to install some python yourself, the system python in the default CML machine image is python 3.6 which is no longer supported in DVC.

asciicast

@pmrowla pmrowla force-pushed the exp-run-machine branch 5 times, most recently from dad6dff to e6de537 Compare December 21, 2021 09:42
@pmrowla pmrowla self-assigned this Dec 21, 2021
@pmrowla pmrowla added A: executors Related to the executors feature A: experiments Related to dvc exp labels Dec 21, 2021
@lgtm-com

This comment has been minimized.

@lgtm-com

This comment has been minimized.

@pmrowla pmrowla force-pushed the exp-run-machine branch 2 times, most recently from ee6f6fe to d44a4f5 Compare December 27, 2021 08:07
@lgtm-com
Copy link

lgtm-com bot commented Dec 27, 2021

This pull request introduces 2 alerts and fixes 1 when merging d44a4f56adf97c7632bc251a2317c52a367d18e2 into 04e93da - view on LGTM.com

new alerts:

  • 1 for `__eq__` not overridden when adding attributes
  • 1 for `__init__` method calls overridden method

fixed alerts:

  • 1 for `__eq__` not overridden when adding attributes

@pmrowla pmrowla force-pushed the exp-run-machine branch 2 times, most recently from 19ed112 to 3aadd6f Compare January 11, 2022 07:18
- prerequisite for remote execution
- allow explicitly pushing imports which would normally be skipped
  (internal-use only, does not apply to `dvc push`)
- log startup script finished message
- add setup_script option for runtime env setup
@pmrowla pmrowla changed the title [WIP] exp run: support naive remote execution via --machine exp run: support naive remote execution via --machine Jan 11, 2022
@pmrowla pmrowla marked this pull request as ready for review January 11, 2022 08:07
@pmrowla pmrowla requested a review from a team as a code owner January 11, 2022 08:07
@pmrowla pmrowla requested a review from pared January 11, 2022 08:07
Copy link
Contributor

@pared pared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far LGTM, though I would like to read it one more time tomorrow.

As of now, I have questions more regarding usage, not necessary to be addressed in this PR:

  1. Waiting for startup script to complete:
    Maybe we could provide some "loading circle" placeholder?

  2. For the example: we had to modify the cmd to use python3.9 {rest of command} right?
    The ubuntu machine defaults to 2.7. I wonder how to handle it, but probably the best thing is to leave it to the user. But we should probably consider changing default script to create virtualenv and activate it (if its possible).

  3. In setup.sh you only had pip install -r requirements, right?

@pmrowla
Copy link
Contributor Author

pmrowla commented Jan 13, 2022

  1. Waiting for startup script to complete:
    Maybe we could provide some "loading circle" placeholder?

Yeah eventually we could have some kind of spinner in the UI, but preferably this should really be handled on the CML side (the issue is that the CML/terraform resource works does not actually wait for the boot/startup script to finish on terraform create and the user has to check it themselves)

  1. For the example: we had to modify the cmd to use python3.9 {rest of command} right?
    The ubuntu machine defaults to 2.7. I wonder how to handle it, but probably the best thing is to leave it to the user. But we should probably consider changing default script to create virtualenv and activate it (if its possible).

This is also just the way the default CML image works (it uses ubuntu 18.04). It's possible for the user to specify alternate AWS images that would use a newer base operating system (that come with newer pythons).

  1. In setup.sh you only had pip install -r requirements, right?

The setup script creates a virtualenv and activates it first

#!/bin/bash                                                                                                                                         
python3.9 -m venv .venv                                                                                                                             
source .venv/bin/activate                                                                                                                           
pip install -U pip                                                                                                                                  
pip install -r src/requirements.txt   

@pmrowla pmrowla merged commit dd5d999 into iterative:main Jan 13, 2022
@pmrowla pmrowla deleted the exp-run-machine branch January 13, 2022 07:34
Copy link
Contributor

@pared pared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Post-merge LGTM.

@efiop efiop added the feature is a feature label Jan 14, 2022
@dberenbaum
Copy link
Collaborator

dberenbaum commented Jan 21, 2022

@pmrowla

Great to see this actually working!

I tried it out, but I hit some issues:

  • Agree with @pared that we will definitely need a better solution for handling Python versions and defaulting to something that is at least still supported. Having to update every cmd to python3.9 is not great. I understand this is an issue with the default image, just wanted to note it.
  • It was way slower from my machine than your demo above. Besides the transfer time, there was a long time before the transfer even started, and it wasn't clear what was happening during that time. My progress bars are also a mess, but that's a separate issue.
  • The experiment ran, but it failed at the end and wasn't collected locally. I got the error: ERROR: Failed to reproduce experiment 'a5bf249': 'ExpRefInfo' object is not iterable.

asciicast

@pmrowla
Copy link
Contributor Author

pmrowla commented Jan 22, 2022

  • Agree with @pared that we will definitely need a better solution for handling Python versions and defaulting to something that is at least still supported. Having to update every cmd to python3.9 is not great. I understand this is an issue with the default image, just wanted to note it.

The explicit python3.9 requirement should only be needed in your startup/setup scripts.

If the env setup script includes source <path_to_venv>/bin/activate you shouldn't need to update cmd in your dvc.yaml at all (just python will work as long as the venv is activated).

  • The experiment ran, but it failed at the end and wasn't collected locally. I got the error: ERROR: Failed to reproduce experiment 'a5bf249': 'ExpRefInfo' object is not iterable.

Not sure what happened here, I'll have to look into it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: executors Related to the executors feature A: experiments Related to dvc exp feature is a feature
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants