-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Use of CPU and GPU Queues #668
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested using azavea/raster-vision-aws#8 and updating ~/.rastervision/default
to contain:
[AWS_BATCH]
job_queue=lewfishRasterVisionGpuJobQueue
job_definition=lewfishRasterVisionCustomGpuJobDefinition
cpu_job_queue=lewfishRasterVisionCpuJobQueue
cpu_job_definition=lewfishRasterVisionCustomCpuJobDefinition
The only requested change is to update the docs at: https://github.com/azavea/raster-vision/blob/develop/docs/setup.rst#L203-L219 with the new fields.
I thought this worked but when I looked at the Batch console I noticed that the first job is stuck in Runnable. This could be because there's something messed up with the new Batch resources I just created using the new CloudFormation setup. But it also looks like what happened in the past when we had jobs with cross-queue dependencies. When you tested whether this was possible, did you notice if the jobs were actually completed? |
All completed. |
Updated, but still out of date because the instructions should probably reference |
After making some changes (for one, lowering the requested RAM) I've got the jobs to move past Runnable in the CPU queue but they still crash. I think there's something wrong with the Cloudformation setup. I have one more idea to try before I contact Ops. |
Okay |
Overview
Allows jobs to be run on both CPU and GPU instances on AWS.
Checklist
docs/changelog.rst
needs-backport
label if PR is bug fix that applies to previous minor releaseCloses #634
Closes #649
See also https://github.com/azavea/pfb-network-connectivity/blob/0.8.1/src/django/pfb_analysis/models.py#L712-L716 and https://github.com/azavea/pfb-network-connectivity/blob/0.8.1/src/django/pfb_analysis/models.py#L745-L756
Testing
Tested with Vegas SpaceNet, using this command line:
rastervision run aws_batch -e spacenet.vegas -a test True -a use_remote_data True -a root_uri s3://bucket/prefix -a target buildings -a task_type semantic_segmentation
and this patch on top of this branch