Run serverless endpoint batch test and record cost and time results #99

rbavery · 2022-12-09T21:51:00Z

User story

We need to understand the cost of running the current architecture inference on large archives (25 Gb) of imagery. In terms of both time (does it take a week with retries? 2 days?) and in terms of costs for the serverless MDV5 endpoint that auto-scales with requests. For this first run, we won't include the Mira endpoints in this test.

we will run this test on duplicated images that matches ratio of animals/no animals. ~ 60% are empty. All are jpegs.

secondarily, we'd like to understand:

the types of exceptions that are thrown
the bottlenecks of the pipeline (is it the free tier DB? should we test with atlas again? potentially.)
long term cost implications of current serverless architecture in terms of monthly cost and/or per 100,000 images
back of napkin calc for batch inference in monthly and 1000,000 image terms. plus dev cost estimate. if a lower cost option is determined to be needed by Natty and team

Things we need to run the test:

Instructions on where to put archive of imagery to kick off test, the ingestion bucket. Not designed to handle zip files
- fargate batch could extract zip files when put in ingestion bucket
- or aws cli? doesn't match end user way of uploading but quicker to use so we will do this for test
example images to duplicate
how to clean database ourselves

Resolution Criteria

For 25 Gb of random imagery , where sample images will be close to 1280x1280 (Natty will pick a representative range), how long does autoscale inference take for mdv5?

[ ]

What was the cost per image? Did this vary throughout the job due to retries?

look to see how many requests were made
try to see if we can get cost per request
- @ingalls will post billing viewing strategy

[ ]

Were there any failures not resolved by retries?

[ ]

rbavery · 2022-12-09T23:17:56Z

additional meeting notes:

Testing the impact of automation rules and addition of mira models during test would show multiple trips and writes to db impact cost

green light from mongodb to get dedicated instance for mongodb atlas. should improve performance of DB.

preference is to reduce cost over inference time. kinesis firehose or dynamo db would be more time performant DBs

we agreed to run without atlas next week, and assess if we need to run the test again with atlas later

rbavery · 2023-03-27T21:00:39Z

With letterboxing and the fully reproduced yolov5, we get average inference times of 9 seconds per image on sagemaker serverless, which only supports CPU.

Initialization time (model loading): 8.54904127120971
Preprocess time (letterbox): 0.032587528228759766
Inference time (model running on image of a given size): 8.05414867401123
Postprocessing time (NMS): 0.02366042137145996

Back when we ran the above test last year, we were testing on fixed resizing to 640x640 with a torchscript model compiled for the CPU, inference time was closer to 2.5 seconds per image: https://docs.google.com/spreadsheets/d/17t-zgKwWdVSArf7mgu4QJXOvtGVIlcUYTnwYEpNZQsU/edit#gid=0

We'll be exploring how to reduce inference time while preserving reproduced accuracy: #106

nathanielrindlaub · 2023-04-07T22:33:09Z

@rbavery deployed the ONNX MDv5 (PR here) to a Sagemaker Serverless endpoint and it looks like per-image inference is around 3.5-4 seconds.

The entire processing time for a test batch of 10,168 images was 11hrs, 8 mins (3.9 seconds per image).

So 1000 images takes roughly an hour to process, 100k would take 4.5 days.

Not bad for now! We'll explore speeding this up perhaps down the road by taking advantage of concurrent processing (having two separate Serverless endpoints for Megadetector - one for real-time inference needs and one for batch, and ditching the FIFO queues for standard SQS queues).

There are also endpoint and model level optimizations we could explore as well (#112).

nathanielrindlaub closed this as completed Apr 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run serverless endpoint batch test and record cost and time results #99

Run serverless endpoint batch test and record cost and time results #99

rbavery commented Dec 9, 2022 •

edited

Loading

rbavery commented Dec 9, 2022

rbavery commented Mar 27, 2023

nathanielrindlaub commented Apr 7, 2023

Run serverless endpoint batch test and record cost and time results #99

Run serverless endpoint batch test and record cost and time results #99

Comments

rbavery commented Dec 9, 2022 • edited Loading

User story

Things we need to run the test:

Resolution Criteria

rbavery commented Dec 9, 2022

rbavery commented Mar 27, 2023

nathanielrindlaub commented Apr 7, 2023

rbavery commented Dec 9, 2022 •

edited

Loading