Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Use of CPU and GPU Queues #668

Merged
merged 4 commits into from
Feb 1, 2019
Merged

Conversation

jamesmcclain
Copy link
Contributor

@jamesmcclain jamesmcclain commented Jan 25, 2019

Overview

Allows jobs to be run on both CPU and GPU instances on AWS.

Checklist

  • Updated docs/changelog.rst
  • Added needs-backport label if PR is bug fix that applies to previous minor release
  • Ran scripts/format_code and committed any changes
  • Documentation updated if needed
  • PR has a name that won't get you publicly shamed for vagueness

Closes #634
Closes #649

See also https://github.com/azavea/pfb-network-connectivity/blob/0.8.1/src/django/pfb_analysis/models.py#L712-L716 and https://github.com/azavea/pfb-network-connectivity/blob/0.8.1/src/django/pfb_analysis/models.py#L745-L756

Testing

Tested with Vegas SpaceNet, using this command line:

rastervision run aws_batch -e spacenet.vegas -a test True -a use_remote_data True -a root_uri s3://bucket/prefix -a target buildings -a task_type semantic_segmentation

and this patch on top of this branch

diff --git a/rastervision/runner/aws_batch_experiment_runner.py b/rastervision/runner/aws_batch_experiment_runner.py
index b365b04..20d998b 100644
--- a/rastervision/runner/aws_batch_experiment_runner.py
+++ b/rastervision/runner/aws_batch_experiment_runner.py
@@ -51,6 +51,9 @@ class AwsBatchExperimentRunner(OutOfProcessExperimentRunner):
                 cpu_job_definition = job_definition
         self.cpu_job_definition = cpu_job_definition
 
+        self.job_definition = 'jamesmcclain-dockerhub-gpu'
+        self.cpu_job_definition = 'jamesmcclain-dockerhub-cpu'
+
         self.submit = self.batch_submit
         self.execution_environment = 'Batch'
 

@jamesmcclain jamesmcclain requested a review from lewfish January 28, 2019 10:54
Copy link
Contributor

@lewfish lewfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested using azavea/raster-vision-aws#8 and updating ~/.rastervision/default to contain:

[AWS_BATCH]
job_queue=lewfishRasterVisionGpuJobQueue
job_definition=lewfishRasterVisionCustomGpuJobDefinition
cpu_job_queue=lewfishRasterVisionCpuJobQueue
cpu_job_definition=lewfishRasterVisionCustomCpuJobDefinition

The only requested change is to update the docs at: https://github.com/azavea/raster-vision/blob/develop/docs/setup.rst#L203-L219 with the new fields.

@lewfish
Copy link
Contributor

lewfish commented Jan 29, 2019

I thought this worked but when I looked at the Batch console I noticed that the first job is stuck in Runnable. This could be because there's something messed up with the new Batch resources I just created using the new CloudFormation setup. But it also looks like what happened in the past when we had jobs with cross-queue dependencies. When you tested whether this was possible, did you notice if the jobs were actually completed?

screen shot 2019-01-28 at 7 09 13 pm

@jamesmcclain
Copy link
Contributor Author

did you notice if the jobs were actually completed?

All completed.

@jamesmcclain
Copy link
Contributor Author

rastervision run aws_batch -e spacenet.vegas -a test True -a use_remote_data True -a root_uri s3://bucket/prefix -a target buildings -a task_type semantic_segmentation

The first screenshot was taken before the job was submitted.

screenshot_2019-01-30_06-34-21

screenshot_2019-01-30_06-34-33

screenshot_2019-01-30_06-38-30

screenshot_2019-01-30_06-40-45

screenshot_2019-01-30_06-42-36

screenshot_2019-01-30_06-49-45

screenshot_2019-01-30_06-52-38

screenshot_2019-01-30_06-54-36

screenshot_2019-01-30_07-00-45

screenshot_2019-01-30_07-03-28

@jamesmcclain
Copy link
Contributor Author

Tested using azavea/raster-vision-cloudformation#8 and updating ~/.rastervision/default to contain:

[AWS_BATCH]
job_queue=lewfishRasterVisionGpuJobQueue
job_definition=lewfishRasterVisionCustomGpuJobDefinition
cpu_job_queue=lewfishRasterVisionCpuJobQueue
cpu_job_definition=lewfishRasterVisionCustomCpuJobDefinition

The only requested change is to update the docs at: https://github.com/azavea/raster-vision/blob/develop/docs/setup.rst#L203-L219 with the new fields.

Updated, but still out of date because the instructions should probably reference raster-vision-cloudformation (see #672).

@lewfish
Copy link
Contributor

lewfish commented Jan 30, 2019

After making some changes (for one, lowering the requested RAM) I've got the jobs to move past Runnable in the CPU queue but they still crash. I think there's something wrong with the Cloudformation setup. I have one more idea to try before I contact Ops.

@jamesmcclain
Copy link
Contributor Author

After making some changes (for one, lowering the requested RAM) I've got the jobs to move past Runnable in the CPU queue but they still crash. I think there's something wrong with the Cloudformation setup. I have one more idea to try before I contact Ops.

Okay

@jamesmcclain jamesmcclain merged commit 5ce55ee into azavea:develop Feb 1, 2019
@jamesmcclain jamesmcclain deleted the cpu-gpu branch February 1, 2019 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants