-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instance creation fails if checkpoint/timeframe are passed #65
Comments
@vgeorge Per conversion this morning it is eventually set to active=true it's just very slow |
ContextModels are now ~100x bigger than the NAIP model. The initially deployed NAIP model is approx 1.5mb, while currently deployed Sentinel models appears to be in the 150mb range. This large increase in model size lead to substantially longer init times in our worker instances. Unfortunately we do not currently surface that an instance has started but is in the process of configuring itself. Instances are only marked as As such the frontend has no way of knowing that an instance is in a "starting" phase expect to poll and wait for the We have two options to move forward here :
|
@ingalls thanks for writing this, it describes the issue very well. Please let me know if you need anything from me. |
@ingalls @vgeorge I'm able to reproduce this ~5min init time for instances. @ingalls I think we can do what you suggest in option 2 and can talk though how to test locally. There's already a @srmsoumya could you look into this briefly and suggest options we have? Of course a 5min wait time for retraining is not a great experience while inference is relatively quick. @ingalls in the init time, what's the specific action (https://github.com/developmentseed/pearl-backend/blob/develop/services/gpu/lib/ModelSrv.py#L658-L662) that's time consuming? It can't be the download right? |
We are using a DeepLabv3 framework with EfficientNet-b5 as the backbone. EfficientNet-b5 has 28M parameters, I can try changing that to EfficientNet-b2 or EfficientNet-b3 that have 7M & 10M parameters respectively, we can go further down, but there will be an accuracy tradeoff. Inside the model.zip folder we have 2 files:
We can reduce the number of data points in seed dataset that will reduce the space (I am currently using 16_000 embeddings). Apart from this, we can look at possible quantization or pruning options for the model, but this will be experimental. For NAIP, I can see two models in the azure blob storage.
From the code and Martha's notes, I infer they are using a DeepLabv3 framework with ResNet18 backbone (11M parameters). I am not sure what is being used for model-1, though. cc' @geohacker @ingalls |
I just pulled the
|
@srmsoumya thanks for taking a look. I'm surprised too about the init time. Before we do any optimisations, let's get some numbers on what actions are time consuming. I suspect it's actually the k8s instance creation and not the checkpoint loading. If it's the instance creation, that should be similar to the cold start for any cpu/gpu and we can workaround by running some placeholders. |
At the staging API, if the client does a request to
POST /project/:id/instance
includingcheckpoint_id
andtimeframe_id
in the payload, an instance is created and starts running, but it never gets to theactive=true
state and it is not returned inGET /project/:id/instance
list. It is possible to get its status atGET /project/:id/instance/:id
.cc @ingalls @geohacker
The text was updated successfully, but these errors were encountered: