-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Does MXNET support to train on multi CPU cluster ? #5094
Comments
Actually, MXNet supports training on multi CPU cluster. (We tried two months ago on mnist.) To my knowledge, MXNet will look for CPUs if GPUs is not set. You can found these codes in I think you encountered this bug because your MXNet is complied with By the way, your |
@qiyuangong Thanks for the reply. I have actually been trying out multiple AWS AMIs (on Amazon Linux), both from community AMIs and the ones that I made, after installing MXNET with latest source code from GitHub, using installation steps provided on the website. The one I posted above is from the deeplearning.template which is linked from the mxnet repository. When I installed MXNET from scratch on AWS CPU instance (t2.micro/t2.medium) with Amazon Linux, it is not even starting the training at all. (Nothing being printed) when the following command is launched: Also, while installing, I made sure that USE_CUDA=0 and also USE_DIST_KVSTORE=1 (used for distributed training). In fact, I have used the script at $MXNET_HOME/setup-utils/install-mxnet-amz-linux.sh Following are the options being populated into my config.mk from my installation script: When you said that you tried out distributed training on multi CPU cluster using MXNET, can you please share more details like --
Since I have been using the standard installation script for AWS Linux, I am not sure, what could probably be going wrong now. With the training not even starting or no log being printed, I am not sure how to debug even. Please help. |
please use latest mxnet master + #5094 (comment) |
May be it gets stuck when downloading data file (22 MB). This url is always unreachable in my env. I suggest download data files and load locally in Here are the details about our env:
In this case, MXNet will sync the train_mnist dir (with data files) to all remote nodes (/tmp/mxnet). Then launch the training command. Your command is also valid, but you need to install MXNet on each node, and make sure your network is fine (for both client and nodes). |
@qiyuangong Thanks for providing me with your configuration details and steps. Installation of MXNET with Ubuntu 16.04 is giving an internal g++ compiler error. However, with Ubuntu 14.04 on AWS instances, am able to install and run MXNET on multi CPU cluster environment. |
Issue #185 has same question, but looks like it is not resolved. Link pointed to in the tracker is broken:
https://mxnet.readthedocs.org/en/latest/distributed_training.html
However, I see a similar article at:
http://newdocs.readthedocs.io/en/latest/distributed_training.html
Following this, I created an AWS EC2 cluster of t2.micro CPUs using the deeplearning.template (modified to create cluster of t2.micro CPUs).
However, I see the following issue when trying to run distributed training:
Steps to Reproduce:
It seems that MXNET is looking for a GPU, whenever distributed training is attempted.
Can anyone please confirm, if distributed training on multiple CPUs is supported at all using MXNET.
Thanks !!
The text was updated successfully, but these errors were encountered: