You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Docker container on #98 successfully reproduces MDV5a detections locally. However, when it is deployed as a serverless endpoint on Sagemaker, requests hang initially for many minutes and then produce an OOM error.
context
We are working with yolov5, a standard widely used object detection model, that is relatively low mem and has fast inference performance (a few seconds): https://github.com/ultralytics/yolov5
We are deploying this with a custom container built from a torchserve base image that copies a sagemaker entrypoint and config to the container. That is located here: #98
In the past we successfully deployed this model to sagemaker serverless without a preprocessing step to resize the image. Using this commit: 9c0ec84
However this new model with the two changes now causes the sagemaker memory limitation error, even with an endpoint config that has the max 6Gb of memory. Before for the working deployment, I was able to use a 4Gb serverless endpoint. On local, I've confirmed the container is limited to under 6gb, so I think sagemaker is not returning the correct error since the container isn't using more than 6Gb on local. Docker stats confirms this
→ docker stats --no-stream
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
9b2035d8d906 confident_saha 1.07% 5.175GiB / 5.807GiB 89.12% 208kB / 28.6kB 352MB / 922kB 62
1 final data point, the endpoint accepts requests and returns good inferences when I change the instance concurrency to 1, but only after 30 seconds of inference time. This is much longer than the time it takes to spin up and load the torchserve and model + perform inference on local (about 7 seconds to start the server and load the model, 7 seconds to run inference on my Mac). So it seems like changing the instance count helps, but the concurrency at 5 wasn't an issue for the old yolov5 model without the specific resize step so it doesn't seem like the root of the issue.
things we changed in the new deployment
fully reproduced the preprocessing step applied to each image, which has a negligible memory footprint.
We also changed how we load the model so that we load the model weight file using yolov5 instead of a portable torchscript file. This also has a negligible mem footprint.
Once we added these steps and test the container locally, inference works as intended similar to the old deployment, with more accurate results.
Error logs
Here is the failing endpoint and some sample logs from when I sent a request
The Docker container on #98 successfully reproduces MDV5a detections locally. However, when it is deployed as a serverless endpoint on Sagemaker, requests hang initially for many minutes and then produce an OOM error.
context
We are working with yolov5, a standard widely used object detection model, that is relatively low mem and has fast inference performance (a few seconds): https://github.com/ultralytics/yolov5
We are deploying this with a custom container built from a torchserve base image that copies a sagemaker entrypoint and config to the container. That is located here: #98
In the past we successfully deployed this model to sagemaker serverless without a preprocessing step to resize the image. Using this commit: 9c0ec84
However this new model with the two changes now causes the sagemaker memory limitation error, even with an endpoint config that has the max 6Gb of memory. Before for the working deployment, I was able to use a 4Gb serverless endpoint. On local, I've confirmed the container is limited to under 6gb, so I think sagemaker is not returning the correct error since the container isn't using more than 6Gb on local. Docker stats confirms this
1 final data point, the endpoint accepts requests and returns good inferences when I change the instance concurrency to 1, but only after 30 seconds of inference time. This is much longer than the time it takes to spin up and load the torchserve and model + perform inference on local (about 7 seconds to start the server and load the model, 7 seconds to run inference on my Mac). So it seems like changing the instance count helps, but the concurrency at 5 wasn't an issue for the old yolov5 model without the specific resize step so it doesn't seem like the root of the issue.
things we changed in the new deployment
Once we added these steps and test the container locally, inference works as intended similar to the old deployment, with more accurate results.
Error logs
Here is the failing endpoint and some sample logs from when I sent a request
https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FEndpoints$252Fmdv5a-letterbox/log-events/AllTraffic$252Fd9a382cf1da9ba7cafdfe4d5f96e3374-2445ef8648874902bd52bd2d1bfa603c
And the error
On the client side, this is the response with a first request. Subsequent requests state it is an out of memory error:
The text was updated successfully, but these errors were encountered: