-
Notifications
You must be signed in to change notification settings - Fork 866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pytorch transfer learning and model hosting failed to load model test case #620
Comments
@quantum-fusion There aren't enough details in given log hence will try executing this and let you know. |
@dhaniram-kshirsagar I tested the generated .MAR file using the Pytorch Archiver, and then ran another test this time using the multi-model-server ( https://github.com/awslabs/multi-model-server ) . The transcript file is here (https://www.dropbox.com/s/zhh87q0mtmys2zh/modelserverLog.txt?dl=0). The transcript file provides a lot more data, and also leads me to believe that there may have been an issue with how the .MAR file was generated using the Pytorch archiver. @dhaniram-kshirsagar I exported the model and converted to .Mar format (https://www.dropbox.com/s/m29a1h1y0u6haa8/model_save.tar?dl=0) @dhaniram-kshirsagar Do you see anything wrong with how the .PTH file was saved? save checkpointpath = "/Users/hottelet/pytorch/tutorials/beginner_source/model_save/transferlearningmodel.pth" |
Analysis
Solution
|
This should be added to ts guide #215 |
@dhaniram-kshirsagar I have tried the following steps.
|
Looks like you are using torchserve 0.1.1 release [per logs on dropbox], can you please upgrade your torchserve to the latest and try again. |
@dhaniram-kshirsagar torch-model-archiver and torch serve was upgraded to 0.2.0 . I am trying again. The MAR format has to be 0.2.0 |
@dhaniram-kshirsagar Are we not expecting the result to be a tiger cat, like the torchserve example? This transfer learning experiment was supposed to add Ants and Bees as a tensor pattern, but not replace the Tiger cat with an Ant or Bee. Please let me know why the index_to_name.json class names were only Ant for 0, and Bee for 1. Also, please tell me why those index class names replace the more than 1000 class names already in the Resnet18 class. The result is as follows: torchserve --start --ncs --model-store ./ --models transferlearningmodel.mar |
@dhaniram-kshirsagar I had a similar set of results using the BentoML transfer learning as well, and while the REST api had a similar format, the transfer learning also produced the wrong sets of results, which invalidated the performance of the model. (jjmachan/resnet-bentoml#1). @dhaniram-kshirsagar Can you please help with explaining the transfer learning results, and should we not expect to see a Tiger Cat as the result? Transfer learning should add patterns, not delete them. |
@prashantsail Do you see the results not showing a Tiger Cat? I don't think this is what we were expecting. |
@prashantsail look at results below, I used the index_to_name.json from the torch serve repo, and now we see wrong results @prashantsail It may be that we have the wrong format for the index_to_name.json that does not include the ants and bees Step1: Step2: Results: |
Referring to transfer_learning_tutorial.py
index_to_name.json
To clarify further - |
Also, the model provided VALID PREDICTION RESULTS when I tested it with a random ant and bee images - sample1.jpg (ant) {
"ant": 0.9995238780975342,
"bee": 0.0004761181480716914
} sample2.jpg (bee) {
"ant": 0.003603871911764145,
"bee": 0.9963961243629456
} |
@prashantsail I thought the whole point about transfer learning is that you can add the 2 additional classifiers without losing the 1000 classes of the pre-trained model. What they mean when they say transfer learning, is that you learn what the 1000 classifiers already knew from the pre-trained model, but then add 2 additional classifiers. Am I missing something? What you are describing is a basic neural net that does not do any transfer learning. What you have described is a simple tensor classifier. Please explain the intent and definition of transfer learning... |
@prashantsail I do not know if @devansh20la had a great idea when he tried to add 100 cells to an existing layer, by popping 2 modules off. (https://discuss.pytorch.org/t/finetuning-the-convnet-question-can-i-change-the-cells-in-a-fc-layer/7096/4) See his writeup: #Replace vgg16's classifier with this new classifier |
|
@prashantsail I tried this example that trains multiple layers, and uses transfer learning (krazygaurav/krazygaurav.github.io#1), however there were some coding errors, with index out of range. I also found the coursework from Stanford that you are referencing, but there are not any solutions published for the course examples. |
@prashantsail It is really a shame, but the transfer learning appears not to work, because the pre-trained model that was already provided was already trained with these patterns. If you simply follow the example with torch serve (https://github.com/pytorch/serve) using the densenet161 model you will see it detects the Ant and Bee, and in addition it also detects the kitten tiger cat, and the whale, and the elephant. I am a bit disappointed, because the transfer learning was supposed to add 2 patterns for the Ant and Bee to Resnet18, but not delete patterns of the 1000 classes, and in fact the densenet161 pre-trained model is superior. curl http://127.0.0.1:8080/predictions/densenet161 -T kitten.jpg |
Since you are now able to host both Resnet and Densenet models using torchserve we will be closing this issue. |
Your issue may already be reported!
Please search on the issue tracker before creating one.
Pytorch transfer learning and model load failure.
I tried out the transfer learning like shown here (https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html) and this example(https://github.com/pytorch/tutorials/blob/master/beginner_source/transfer_learning_tutorial.py), and then added a save model to the end of the python script. (https://www.dropbox.com/s/jru4p9hbbazm7zn/transfer_learning_tutorial.py?dl=0)
I exported the model and converted to .Mar format (https://www.dropbox.com/s/m29a1h1y0u6haa8/model_save.tar?dl=0)
using this archiver command:
torch-model-archiver --model-name transferlearningmodel --version 1.0 --model-file ~/torchserve/serve/examples/image_classifier/resnet_18/model.py --serialized-file ~/pytorch/tutorials/beginner_source/model_save/transferlearningmodel.pth --export-path model_save --extra-files ~/torchserve/serve/examples/image_classifier/index_to_name.json --handler image_classifier
Once the archiver convert to .MAR was complete I started the torch serve:
(base) MacBook-Pro:~/pytorch/tutorials/beginner_source quantum-fusion$ torchserve --start --ncs --model-store ~/pytorch/tutorials/beginner_source/model_save --models transferlearningmodel.mar
The model did not load successfully, see errors: Load model failed: transferlearningmodel, error: Worker died.
2020-08-16 16:07:01,994 [DEBUG] W-9013-transferlearningmodel_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException
at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1668)
at java.base/java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:435)
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:129)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.base/java.lang.Thread.run(Thread.java:832)
2020-08-16 16:07:01,994 [WARN ] W-9013-transferlearningmodel_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: transferlearningmodel, error: Worker died.
2020-08-16 16:07:01,994 [DEBUG] W-9013-transferlearningmodel_1.0 org.pytorch.serve.wlm.WorkerThread - W-9013-transferlearningmodel_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2020-08-16 16:07:01,994 [INFO ] W-9013-transferlearningmodel_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9013-transferlearningmodel_1.0-stdout
2020-08-16 16:07:01,994 [INFO ] W-9013-transferlearningmodel_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9013-transferlearningmodel_1.0-stderr
2020-08-16 16:07:01,994 [WARN ] W-9013-transferlearningmodel_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9013-transferlearningmodel_1.0-stderr
2020-08-16 16:07:01,994 [WARN ] W-9013-transferlearningmodel_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9013-transferlearningmodel_1.0-stdout
2020-08-16 16:07:01,994 [INFO ] W-9013-transferlearningmodel_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9013 in 55 seconds.
2020-08-16 16:07:02,003 [INFO ] KQueueEventLoopGroup-4-29 org.pytorch.serve.wlm.WorkerThread - 9004 Worker disconnected. WORKER_STARTED
2020-08-16 16:07:02,003 [DEBUG] W-9004-transferlearningmodel_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2020-08-16 16:07:02,003 [DEBUG] W-9004-transferlearningmodel_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException
at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1668)
at java.base/java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:435)
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:129)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.base/java.lang.Thread.run(Thread.java:832)
2020-08-16 16:07:02,004 [WARN ] W-9004-transferlearningmodel_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: transferlearningmodel, error: Worker died.
2020-08-16 16:07:02,004 [DEBUG] W-9004-transferlearningmodel_1.0 org.pytorch.serve.wlm.WorkerThread - W-9004-transferlearningmodel_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2020-08-16 16:07:02,004 [INFO ] W-9004-transferlearningmodel_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9004-transferlearningmodel_1.0-stdout
2020-08-16 16:07:02,004 [INFO ] W-9004-transferlearningmodel_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9004-transferlearningmodel_1.0-stderr
2020-08-16 16:07:02,004 [WARN ] W-9004-transferlearningmodel_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9004-transferlearningmodel_1.0-stderr
2020-08-16 16:07:02,004 [WARN ] W-9004-transferlearningmodel_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9004-transferlearningmodel_1.0-stdout
2020-08-16 16:07:02,004 [INFO ] W-9004-transferlearningmodel_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9004 in 55 seconds.
(base) MacBook-Pro:
/pytorch/tutorials/beginner_source quantum-fusion$ torchserve --stop/pytorch/tutorials/beginner_source quantum-fusion$ 2020-08-16 16:07:25,194 [INFO ] main org.pytorch.serve.ModelServer - Torchserve stopped.TorchServe has stopped.
2020-08-16 16:07:22,985 [INFO ] KQueueEventLoopGroup-2-2 org.pytorch.serve.ModelServer - Management model server stopped.
2020-08-16 16:07:22,985 [INFO ] KQueueEventLoopGroup-2-1 org.pytorch.serve.ModelServer - Inference model server stopped.
(base) MacBook-Pro:
The text was updated successfully, but these errors were encountered: