Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training does not end. #26

Open
avasisht-celadon opened this issue Oct 29, 2018 · 9 comments
Open

Training does not end. #26

avasisht-celadon opened this issue Oct 29, 2018 · 9 comments

Comments

@avasisht-celadon
Copy link

I have issued the command for training (svhn) as per the instructions. It does not progress at all.
##########################################################################
Command : python train_svhn.py /home/aditya/stn-ocr/generated/centered/train.csv /home/aditya/stn-ocr/generated/centered/valid.csv --log-dir /home/aditya/stn-ocr -b 400 --lr 1e-5

/home/aditya/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
loading data
2018-10-29 13:53:20,201 Node[0] start with arguments Namespace(batch_size=400, blank_label=0, char_map=None, checkpoint_interval=None, eval_image=None, fix_loc=False, gif=False, gpus=None, ip=None, kv_store='local', load_epoch=None, log_dir='/home/aditya/stn-ocr/2018-10-29T13:53:16.415078_training', log_file='/home/aditya/stn-ocr/2018-10-29T13:53:16.415078_training/log', log_level='INFO', log_name='training', lr=1e-05, lr_factor=1, lr_factor_epoch=1, model_prefix=None, num_epochs=10, plot_network_graph=False, port=1337, progressbar=False, save_model_prefix=None, send_bboxes=False, train_file='/home/aditya/stn-ocr/generated/centered/train.csv', val_file='/home/aditya/stn-ocr/generated/centered/valid.csv', video=False, zoom=0.9)
2018-10-29 13:53:20,202 Node[0] EPOCH SIZE: 250
2018-10-29 13:53:20,226 Node[0] Start training with [cpu(0)]

############################################################################

It stops right there. No progress.

@Bartzi
Copy link
Owner

Bartzi commented Oct 29, 2018

Do you have a GPU in your machine? Right now you are running on CPU... that is definitely the reason why 'nothing' is happening...

@avasisht-celadon
Copy link
Author

Yes sir, absolutely right. I have a GPU but had not enabled "USE_CUDA" flag in config.mk of "incubator-mxnet". I am recompiling the mxnet repo with "USE_CUDA = 1",

It threw a error:-

92 errors detected in the compilation of "/tmp/tmpxft_00002feb_0000000 0-12_cudnn_batch_norm.compute_70.cpp1.ii".
Makefile:465: recipe for target 'build/src/operator/nn/cudnn/cudnn_bat ch_norm_gpu.o' failed
make: *** [build/src/operator/nn/cudnn/cudnn_batch_norm_gpu.o] Error 1

Now I enabled "USE_CUDNN=1" in make/config.mk

I get an error:-

92 errors detected in the compilation of "/tmp/tmpxft_00003084_00000000-12_cudnn_batch_norm.compute_70.cpp1.ii".
Makefile:465: recipe for target 'build/src/operator/nn/cudnn/cudnn_batch_norm_gpu.o' failed
make: *** [build/src/operator/nn/cudnn/cudnn_batch_norm_gpu.o] Error 1

Now I enable "USE_NCCL = 1" and give path "USE_NCCL_PATH = /usr/local/cuda/lib64",

I get an error:-

92 errors detected in the compilation of "/tmp/tmpxft_00003084_00000000-12_cudnn_batch_norm.compute_70.cpp1.ii".
Makefile:465: recipe for target 'build/src/operator/nn/cudnn/cudnn_batch_norm_gpu.o' failed
make: *** [build/src/operator/nn/cudnn/cudnn_batch_norm_gpu.o] Error 1

@avasisht-celadon
Copy link
Author

actually the errors begin with:-

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512fintrin.h(9220): error: argument of type "const void *" is incompatible with parameter of type "const float *"

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512fintrin.h(9292): error: argument of type "const void *" is incompatible with parameter of type "const double *"

Rest of the errors are similar.

Let me know what are the other details you would need

@Bartzi
Copy link
Owner

Bartzi commented Oct 29, 2018

I can not help you with those compile errors 😅, but which version of MXNet are you trying to compile?

@avasisht-celadon
Copy link
Author

I am compiling the one i downloaded here:-
https://github.com/apache/incubator-mxnet.
1.3 apparently

@Bartzi
Copy link
Owner

Bartzi commented Oct 29, 2018

Please check the README of his repo again! It says that you should use version 0.9.3 of MXNet, because it is not guaranteed to work with newer versions of MXNet...

@avasisht-celadon
Copy link
Author

Yes, fine. Thanks for the reply.
I checked out v0.9.3, but if I "make" now, I get the error:-

Makefile:27: mshadow/make/mshadow.mk: No such file or directory
Makefile:28: /home/aditya/stn-ocr/incubator-mxnet/dmlc-core/make/dmlc.mk: No such file or directory
Makefile:126: /home/aditya/stn-ocr/incubator-mxnet/ps-lite/make/ps.mk: No such file or directory
make: *** No rule to make target '/home/aditya/stn-ocr/incubator-mxnet/ps-lite/make/ps.mk'. Stop.

Now I re clonned it using:-

git clone --recursive

I still get the same error

@avasisht-celadon
Copy link
Author

And if I dont check out v0.9.3, and do make..

I get the compilation errors stated above.

Please help. Thanks in advance

@Bartzi
Copy link
Owner

Bartzi commented Oct 30, 2018

you'll need to also checkout the submodules 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants