-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Multiple calls of model deploy
API causes exception from Memory Circuit Breaker
#2308
Comments
I'm not 100% sure if this is ml-commons bug. Seems like in the cluster memory usage is still very high. |
we do have this set to 100 for neural-search https://github.com/opensearch-project/neural-search/blob/main/src/testFixtures/java/org/opensearch/neuralsearch/BaseNeuralSearchIT.java#L116. Let me try with different values for other setting for jvm heap |
I found that value of |
I believe the problem is related to the fact that after several load/unload, the un-released memory was hold by pytorch runtime library, which was used as a blackbox in DJL. The most common use case of using pytorch is hosing a model server and performance is NO.1 priority, so it's designed to consume large memory even the model is unloaded. Our use case is special so that's why we don't recommend using pre-trained or local models in the production environment. For this integration tests problem, can you reduce the number of load/unload in your tests? In other words, is it possible to finish all the necessary tests in only one model lifecycle? Also, can you try using a smaller model in the IT? |
I think we already using the small model from |
@Zhangxunmt My team has an assumption that Memory CB does not calculate used memory properly, in particular mmaped files are also counted. That causes leak kind of behavior when with time after multiple undeployments amount of memory that is counted goes beyond the memory that is actually used. I've verified this by following experiment:
Step 1 confirms the issue. Step 3 shows that even with 100% threshold CB doesn't count memory usage correctly. For repro the issue I setup https://github.com/opensearch-project/opensearch-build/ locally and point it to my custom branch of ml-commons. I suggest that either ml-commons should add an option or setting to disable memory CB completely, or ignore CB check if threshold is set to >= 100% |
memory CB is disabled with heap threshold == 100. Resolving this issue. |
What is the bug?
When uploading model with
_upload
API, system return following response:How can one reproduce the bug?
Steps to reproduce the behavior:
To increase chance of error change amount of max heap for JVM to 1Gb here. This settings is same that infra build team uses for distribution pipeline run.
Exact tests that are failing are random, but error happens for every execution of run tests command, its always 2 to 6 failing tests
What is the expected behavior?
No CB error
What is your host/environment?
Do you have any additional context?
We upload model from ml-commons repo using following request payload: https://github.com/opensearch-project/neural-search/blob/main/src/test/resources/processor/UploadModelRequestBody.json
We use following sequence for model upload:
create model group
upload model, wait for task to complete, got model id
Code ref: https://github.com/opensearch-project/neural-search/blob/main/src/testFixtures/java/org/opensearch/neuralsearch/BaseNeuralSearchIT.java#L146
deploy model by model id
Code ref: https://github.com/opensearch-project/neural-search/blob/main/src/testFixtures/java/org/opensearch/neuralsearch/BaseNeuralSearchIT.java#L175
Following sequence of calls to delete resources:
Code ref: https://github.com/opensearch-project/neural-search/blob/main/src/testFixtures/java/org/opensearch/neuralsearch/BaseNeuralSearchIT.java#L916
there isn't call for model group delete as it has to be deleted when last associated model is deleted
To increase chance of error change amount of max heap for JVM to 1Gb here. This settings is same that infra build team uses for distribution pipeline run.
It's somehow related to #1896, but at that time we lower the chance of test failures by increasing max heap size to 4Gb. For 2.14 it's not an option as per this global issue opensearch-project/neural-search#667
The text was updated successfully, but these errors were encountered: