-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training does not end. #26
Comments
Do you have a GPU in your machine? Right now you are running on CPU... that is definitely the reason why 'nothing' is happening... |
Yes sir, absolutely right. I have a GPU but had not enabled "USE_CUDA" flag in config.mk of "incubator-mxnet". I am recompiling the mxnet repo with "USE_CUDA = 1", It threw a error:- 92 errors detected in the compilation of "/tmp/tmpxft_00002feb_0000000 0-12_cudnn_batch_norm.compute_70.cpp1.ii". Now I enabled "USE_CUDNN=1" in make/config.mk I get an error:- 92 errors detected in the compilation of "/tmp/tmpxft_00003084_00000000-12_cudnn_batch_norm.compute_70.cpp1.ii". Now I enable "USE_NCCL = 1" and give path "USE_NCCL_PATH = /usr/local/cuda/lib64", I get an error:- 92 errors detected in the compilation of "/tmp/tmpxft_00003084_00000000-12_cudnn_batch_norm.compute_70.cpp1.ii". |
actually the errors begin with:- /usr/lib/gcc/x86_64-linux-gnu/5/include/avx512fintrin.h(9220): error: argument of type "const void *" is incompatible with parameter of type "const float *" /usr/lib/gcc/x86_64-linux-gnu/5/include/avx512fintrin.h(9292): error: argument of type "const void *" is incompatible with parameter of type "const double *" Rest of the errors are similar. Let me know what are the other details you would need |
I can not help you with those compile errors 😅, but which version of MXNet are you trying to compile? |
I am compiling the one i downloaded here:- |
Please check the README of his repo again! It says that you should use version |
Yes, fine. Thanks for the reply. Makefile:27: mshadow/make/mshadow.mk: No such file or directory Now I re clonned it using:- git clone --recursive I still get the same error |
And if I dont check out v0.9.3, and do make.. I get the compilation errors stated above. Please help. Thanks in advance |
you'll need to also checkout the submodules 😉 |
I have issued the command for training (svhn) as per the instructions. It does not progress at all.
##########################################################################
Command : python train_svhn.py /home/aditya/stn-ocr/generated/centered/train.csv /home/aditya/stn-ocr/generated/centered/valid.csv --log-dir /home/aditya/stn-ocr -b 400 --lr 1e-5
/home/aditya/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
loading data
2018-10-29 13:53:20,201 Node[0] start with arguments Namespace(batch_size=400, blank_label=0, char_map=None, checkpoint_interval=None, eval_image=None, fix_loc=False, gif=False, gpus=None, ip=None, kv_store='local', load_epoch=None, log_dir='/home/aditya/stn-ocr/2018-10-29T13:53:16.415078_training', log_file='/home/aditya/stn-ocr/2018-10-29T13:53:16.415078_training/log', log_level='INFO', log_name='training', lr=1e-05, lr_factor=1, lr_factor_epoch=1, model_prefix=None, num_epochs=10, plot_network_graph=False, port=1337, progressbar=False, save_model_prefix=None, send_bboxes=False, train_file='/home/aditya/stn-ocr/generated/centered/train.csv', val_file='/home/aditya/stn-ocr/generated/centered/valid.csv', video=False, zoom=0.9)
2018-10-29 13:53:20,202 Node[0] EPOCH SIZE: 250
2018-10-29 13:53:20,226 Node[0] Start training with [cpu(0)]
############################################################################
It stops right there. No progress.
The text was updated successfully, but these errors were encountered: