Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

resnet validation bug?! when using identical validation/training, train acc reaches 1 and val 0.75 #6863

Closed
Pagey opened this issue Jun 28, 2017 · 8 comments

Comments

@Pagey
Copy link

Pagey commented Jun 28, 2017

Very weird behavior in resnet validation (when using NDArrayIter);
i was testing my validation errors on some image recognition task, when i tried plugging the training set as the validation set.
i expected to get similar training and eval accuracies- but the training accuracy reached 1 and the validation accuracy didn't pass ~0.75
this behavior DID NOT repeat when using alexnet as the network- both accuracies reached 1 as expected

does someone have any idea what could be causing this?

Environment info

Operating System: Windows 10

Compiler: vc14

Package used (Python/R/Scala/Julia): Python

MXNet version: 0.10.1

Or if installed from source: yajiedesign/mxnet mxnet-20170620

MXNet commit hash (git rev-parse HEAD):

If you are using python package, please provide

Python version and distribution: 3.5.3

If you are using R package, please provide

R sessionInfo():

Error Message:

Please paste the full error message, including stack trace.

this is the training and validation error log when using alexnet (validation and training set are identical):
"C:\Program Files\Python35\python.exe" C:/Users/Admin/Documents/MY/tryinpython230417/mxnet-master/example/image-classification/mvc_ff_module_orig_resenettest.py
started 06/29/17 01:15:09
mean train before: 0.00200461403317
mean val before: 0.00200431063389
(2048, 1, 128, 128)
(2048,)
(2048, 1, 128, 128)
(2048,)
[01:15:17] d:\program files (x86)\jenkins\workspace\mxnet\mxnet\src\operator./cudnn_algoreg-inl.h:65: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO:root:Epoch[0] Train-accuracy=0.509277
INFO:root:Epoch[0] Time cost=2.138
INFO:root:Epoch[0] Validation-accuracy=0.517578
INFO:root:Epoch[1] Train-accuracy=0.517578
INFO:root:Epoch[1] Time cost=0.796
INFO:root:Epoch[1] Validation-accuracy=0.517578
INFO:root:Epoch[2] Train-accuracy=0.517578
INFO:root:Epoch[2] Time cost=0.772
INFO:root:Epoch[2] Validation-accuracy=0.517578
INFO:root:Epoch[3] Train-accuracy=0.517578
INFO:root:Epoch[3] Time cost=0.799
INFO:root:Epoch[3] Validation-accuracy=0.517578
INFO:root:Epoch[4] Train-accuracy=0.517578
INFO:root:Epoch[4] Time cost=0.797
INFO:root:Epoch[4] Validation-accuracy=0.517578
INFO:root:Epoch[5] Train-accuracy=0.517578
INFO:root:Epoch[5] Time cost=0.785
INFO:root:Epoch[5] Validation-accuracy=0.517578
INFO:root:Epoch[6] Train-accuracy=0.517578
INFO:root:Epoch[6] Time cost=0.784
INFO:root:Epoch[6] Validation-accuracy=0.517578
INFO:root:Epoch[7] Train-accuracy=0.517578
INFO:root:Epoch[7] Time cost=0.785
INFO:root:Epoch[7] Validation-accuracy=0.517578
INFO:root:Epoch[8] Train-accuracy=0.517578
INFO:root:Epoch[8] Time cost=0.782
INFO:root:Epoch[8] Validation-accuracy=0.517578
INFO:root:Epoch[9] Train-accuracy=0.517578
INFO:root:Epoch[9] Time cost=0.768
INFO:root:Epoch[9] Validation-accuracy=0.517578
INFO:root:Epoch[10] Train-accuracy=0.517578
INFO:root:Epoch[10] Time cost=0.770
INFO:root:Epoch[10] Validation-accuracy=0.517578
INFO:root:Epoch[11] Train-accuracy=0.517578
INFO:root:Epoch[11] Time cost=0.782
INFO:root:Epoch[11] Validation-accuracy=0.517578
INFO:root:Epoch[12] Train-accuracy=0.517578
INFO:root:Epoch[12] Time cost=0.786
INFO:root:Epoch[12] Validation-accuracy=0.517578
INFO:root:Epoch[13] Train-accuracy=0.524414
INFO:root:Epoch[13] Time cost=0.800
INFO:root:Epoch[13] Validation-accuracy=0.517578
INFO:root:Epoch[14] Train-accuracy=0.550293
INFO:root:Epoch[14] Time cost=0.770
INFO:root:Epoch[14] Validation-accuracy=0.517578
INFO:root:Epoch[15] Train-accuracy=0.553711
INFO:root:Epoch[15] Time cost=0.796
INFO:root:Epoch[15] Validation-accuracy=0.594727
INFO:root:Epoch[16] Train-accuracy=0.570801
INFO:root:Epoch[16] Time cost=0.797
INFO:root:Epoch[16] Validation-accuracy=0.622559
INFO:root:Epoch[17] Train-accuracy=0.568848
INFO:root:Epoch[17] Time cost=0.799
INFO:root:Epoch[17] Validation-accuracy=0.624512
INFO:root:Epoch[18] Train-accuracy=0.584473
INFO:root:Epoch[18] Time cost=0.770
INFO:root:Epoch[18] Validation-accuracy=0.518555
INFO:root:Epoch[19] Train-accuracy=0.579102
INFO:root:Epoch[19] Time cost=0.813
INFO:root:Epoch[19] Validation-accuracy=0.542969
INFO:root:Epoch[20] Train-accuracy=0.596680
INFO:root:Epoch[20] Time cost=0.813
INFO:root:Epoch[20] Validation-accuracy=0.611328
INFO:root:Epoch[21] Train-accuracy=0.637207
INFO:root:Epoch[21] Time cost=0.797
INFO:root:Epoch[21] Validation-accuracy=0.694824
INFO:root:Epoch[22] Train-accuracy=0.676758
INFO:root:Epoch[22] Time cost=0.828
INFO:root:Epoch[22] Validation-accuracy=0.607910
INFO:root:Epoch[23] Train-accuracy=0.667480
INFO:root:Epoch[23] Time cost=0.797
INFO:root:Epoch[23] Validation-accuracy=0.643066
INFO:root:Epoch[24] Train-accuracy=0.684570
INFO:root:Epoch[24] Time cost=0.813
INFO:root:Epoch[24] Validation-accuracy=0.690918
INFO:root:Epoch[25] Train-accuracy=0.700195
INFO:root:Epoch[25] Time cost=0.797
INFO:root:Epoch[25] Validation-accuracy=0.653809
INFO:root:Epoch[26] Train-accuracy=0.691406
INFO:root:Epoch[26] Time cost=0.813
INFO:root:Epoch[26] Validation-accuracy=0.683105
INFO:root:Epoch[27] Train-accuracy=0.702148
INFO:root:Epoch[27] Time cost=0.813
INFO:root:Epoch[27] Validation-accuracy=0.692871
INFO:root:Epoch[28] Train-accuracy=0.711914
INFO:root:Epoch[28] Time cost=0.813
INFO:root:Epoch[28] Validation-accuracy=0.689941
INFO:root:Epoch[29] Train-accuracy=0.716797
INFO:root:Epoch[29] Time cost=0.781
INFO:root:Epoch[29] Validation-accuracy=0.695312
INFO:root:Epoch[30] Train-accuracy=0.717773
INFO:root:Epoch[30] Time cost=0.797
INFO:root:Epoch[30] Validation-accuracy=0.705566
INFO:root:Epoch[31] Train-accuracy=0.729492
INFO:root:Epoch[31] Time cost=0.800
INFO:root:Epoch[31] Validation-accuracy=0.705566
INFO:root:Epoch[32] Train-accuracy=0.727051
INFO:root:Epoch[32] Time cost=0.775
INFO:root:Epoch[32] Validation-accuracy=0.709473
INFO:root:Epoch[33] Train-accuracy=0.727539
INFO:root:Epoch[33] Time cost=0.814
INFO:root:Epoch[33] Validation-accuracy=0.717773
INFO:root:Epoch[34] Train-accuracy=0.733398
INFO:root:Epoch[34] Time cost=0.841
INFO:root:Epoch[34] Validation-accuracy=0.722168
INFO:root:Epoch[35] Train-accuracy=0.733887
INFO:root:Epoch[35] Time cost=0.791
INFO:root:Epoch[35] Validation-accuracy=0.726074
INFO:root:Epoch[36] Train-accuracy=0.733887
INFO:root:Epoch[36] Time cost=0.777
INFO:root:Epoch[36] Validation-accuracy=0.731445
INFO:root:Epoch[37] Train-accuracy=0.735352
INFO:root:Epoch[37] Time cost=0.811
INFO:root:Epoch[37] Validation-accuracy=0.731445
INFO:root:Epoch[38] Train-accuracy=0.735840
INFO:root:Epoch[38] Time cost=0.789
INFO:root:Epoch[38] Validation-accuracy=0.733887
INFO:root:Epoch[39] Train-accuracy=0.734375
INFO:root:Epoch[39] Time cost=0.792
INFO:root:Epoch[39] Validation-accuracy=0.739258
INFO:root:Epoch[40] Train-accuracy=0.744629
INFO:root:Epoch[40] Time cost=0.791
INFO:root:Epoch[40] Validation-accuracy=0.744629
INFO:root:Epoch[41] Train-accuracy=0.747070
INFO:root:Epoch[41] Time cost=0.815
INFO:root:Epoch[41] Validation-accuracy=0.745117
INFO:root:Epoch[42] Train-accuracy=0.746094
INFO:root:Epoch[42] Time cost=0.838
INFO:root:Epoch[42] Validation-accuracy=0.746582
INFO:root:Epoch[43] Train-accuracy=0.746094
INFO:root:Epoch[43] Time cost=0.776
INFO:root:Epoch[43] Validation-accuracy=0.748535
INFO:root:Epoch[44] Train-accuracy=0.749512
INFO:root:Epoch[44] Time cost=0.818
INFO:root:Epoch[44] Validation-accuracy=0.750488
INFO:root:Epoch[45] Train-accuracy=0.750488
INFO:root:Epoch[45] Time cost=0.817
INFO:root:Epoch[45] Validation-accuracy=0.750977
INFO:root:Epoch[46] Train-accuracy=0.747559
INFO:root:Epoch[46] Time cost=0.823
INFO:root:Epoch[46] Validation-accuracy=0.753906
INFO:root:Epoch[47] Train-accuracy=0.754883
INFO:root:Epoch[47] Time cost=0.806
INFO:root:Epoch[47] Validation-accuracy=0.762695
INFO:root:Epoch[48] Train-accuracy=0.757324
INFO:root:Epoch[48] Time cost=0.834
INFO:root:Epoch[48] Validation-accuracy=0.753418
INFO:root:Epoch[49] Train-accuracy=0.758301
INFO:root:Epoch[49] Time cost=0.806
INFO:root:Epoch[49] Validation-accuracy=0.763184
INFO:root:Epoch[50] Train-accuracy=0.762695
INFO:root:Epoch[50] Time cost=0.772
INFO:root:Epoch[50] Validation-accuracy=0.771973
INFO:root:Epoch[51] Train-accuracy=0.772461
INFO:root:Epoch[51] Time cost=0.795
INFO:root:Epoch[51] Validation-accuracy=0.760742
INFO:root:Epoch[52] Train-accuracy=0.766113
INFO:root:Epoch[52] Time cost=0.800
INFO:root:Epoch[52] Validation-accuracy=0.772461
INFO:root:Epoch[53] Train-accuracy=0.773438
INFO:root:Epoch[53] Time cost=0.798
INFO:root:Epoch[53] Validation-accuracy=0.783691
INFO:root:Epoch[54] Train-accuracy=0.777832
INFO:root:Epoch[54] Time cost=0.798
INFO:root:Epoch[54] Validation-accuracy=0.780762
INFO:root:Epoch[55] Train-accuracy=0.786133
INFO:root:Epoch[55] Time cost=0.834
INFO:root:Epoch[55] Validation-accuracy=0.781738
INFO:root:Epoch[56] Train-accuracy=0.782715
INFO:root:Epoch[56] Time cost=0.809
INFO:root:Epoch[56] Validation-accuracy=0.785645
INFO:root:Epoch[57] Train-accuracy=0.783691
INFO:root:Epoch[57] Time cost=0.824
INFO:root:Epoch[57] Validation-accuracy=0.790039
INFO:root:Epoch[58] Train-accuracy=0.791016
INFO:root:Epoch[58] Time cost=0.809
INFO:root:Epoch[58] Validation-accuracy=0.794922
INFO:root:Epoch[59] Train-accuracy=0.786133
INFO:root:Epoch[59] Time cost=0.826
INFO:root:Epoch[59] Validation-accuracy=0.797852
INFO:root:Epoch[60] Train-accuracy=0.795898
INFO:root:Epoch[60] Time cost=0.804
INFO:root:Epoch[60] Validation-accuracy=0.801270
INFO:root:Epoch[61] Train-accuracy=0.803711
INFO:root:Epoch[61] Time cost=0.785
INFO:root:Epoch[61] Validation-accuracy=0.806641
INFO:root:Epoch[62] Train-accuracy=0.809570
INFO:root:Epoch[62] Time cost=0.834
INFO:root:Epoch[62] Validation-accuracy=0.808594
INFO:root:Epoch[63] Train-accuracy=0.811035
INFO:root:Epoch[63] Time cost=0.803
INFO:root:Epoch[63] Validation-accuracy=0.812012
INFO:root:Epoch[64] Train-accuracy=0.814453
INFO:root:Epoch[64] Time cost=0.807
INFO:root:Epoch[64] Validation-accuracy=0.819824
INFO:root:Epoch[65] Train-accuracy=0.821777
INFO:root:Epoch[65] Time cost=0.784
INFO:root:Epoch[65] Validation-accuracy=0.823242
INFO:root:Epoch[66] Train-accuracy=0.829102
INFO:root:Epoch[66] Time cost=0.800
INFO:root:Epoch[66] Validation-accuracy=0.830078
INFO:root:Epoch[67] Train-accuracy=0.833984
INFO:root:Epoch[67] Time cost=0.811
INFO:root:Epoch[67] Validation-accuracy=0.834961
INFO:root:Epoch[68] Train-accuracy=0.839355
INFO:root:Epoch[68] Time cost=0.777
INFO:root:Epoch[68] Validation-accuracy=0.844238
INFO:root:Epoch[69] Train-accuracy=0.854492
INFO:root:Epoch[69] Time cost=0.818
INFO:root:Epoch[69] Validation-accuracy=0.860352
INFO:root:Epoch[70] Train-accuracy=0.863770
INFO:root:Epoch[70] Time cost=0.797
INFO:root:Epoch[70] Validation-accuracy=0.875488
INFO:root:Epoch[71] Train-accuracy=0.881836
INFO:root:Epoch[71] Time cost=0.807
INFO:root:Epoch[71] Validation-accuracy=0.880371
INFO:root:Epoch[72] Train-accuracy=0.889160
INFO:root:Epoch[72] Time cost=0.768
INFO:root:Epoch[72] Validation-accuracy=0.890137
INFO:root:Epoch[73] Train-accuracy=0.901855
INFO:root:Epoch[73] Time cost=0.793
INFO:root:Epoch[73] Validation-accuracy=0.895996
INFO:root:Epoch[74] Train-accuracy=0.906738
INFO:root:Epoch[74] Time cost=0.813
INFO:root:Epoch[74] Validation-accuracy=0.905273
INFO:root:Epoch[75] Train-accuracy=0.917480
INFO:root:Epoch[75] Time cost=0.800
INFO:root:Epoch[75] Validation-accuracy=0.918945
INFO:root:Epoch[76] Train-accuracy=0.921875
INFO:root:Epoch[76] Time cost=0.800
INFO:root:Epoch[76] Validation-accuracy=0.928711
INFO:root:Epoch[77] Train-accuracy=0.932129
INFO:root:Epoch[77] Time cost=0.799
INFO:root:Epoch[77] Validation-accuracy=0.935547
INFO:root:Epoch[78] Train-accuracy=0.939453
INFO:root:Epoch[78] Time cost=0.800
INFO:root:Epoch[78] Validation-accuracy=0.944336
INFO:root:Epoch[79] Train-accuracy=0.946289
INFO:root:Epoch[79] Time cost=0.787
INFO:root:Epoch[79] Validation-accuracy=0.943848
INFO:root:Epoch[80] Train-accuracy=0.950684
INFO:root:Epoch[80] Time cost=0.808
INFO:root:Epoch[80] Validation-accuracy=0.946777
INFO:root:Epoch[81] Train-accuracy=0.954590
INFO:root:Epoch[81] Time cost=0.792
INFO:root:Epoch[81] Validation-accuracy=0.966797
INFO:root:Epoch[82] Train-accuracy=0.962891
INFO:root:Epoch[82] Time cost=0.788
INFO:root:Epoch[82] Validation-accuracy=0.970703
INFO:root:Epoch[83] Train-accuracy=0.965332
INFO:root:Epoch[83] Time cost=0.769
INFO:root:Epoch[83] Validation-accuracy=0.957031
INFO:root:Epoch[84] Train-accuracy=0.966309
INFO:root:Epoch[84] Time cost=0.816
INFO:root:Epoch[84] Validation-accuracy=0.949707
INFO:root:Epoch[85] Train-accuracy=0.971680
INFO:root:Epoch[85] Time cost=0.799
INFO:root:Epoch[85] Validation-accuracy=0.949219
INFO:root:Epoch[86] Train-accuracy=0.983398
INFO:root:Epoch[86] Time cost=0.805
INFO:root:Epoch[86] Validation-accuracy=0.979980
INFO:root:Epoch[87] Train-accuracy=0.989746
INFO:root:Epoch[87] Time cost=0.800
INFO:root:Epoch[87] Validation-accuracy=0.984863
INFO:root:Epoch[88] Train-accuracy=0.990234
INFO:root:Epoch[88] Time cost=0.815
INFO:root:Epoch[88] Validation-accuracy=0.987305
INFO:root:Epoch[89] Train-accuracy=0.991211
INFO:root:Epoch[89] Time cost=0.816
INFO:root:Epoch[89] Validation-accuracy=0.985840
INFO:root:Epoch[90] Train-accuracy=0.994141
INFO:root:Epoch[90] Time cost=0.813
INFO:root:Epoch[90] Validation-accuracy=0.990723
INFO:root:Epoch[91] Train-accuracy=0.993652
INFO:root:Epoch[91] Time cost=0.813
INFO:root:Epoch[91] Validation-accuracy=0.992676
INFO:root:Epoch[92] Train-accuracy=0.996094
INFO:root:Epoch[92] Time cost=0.828
INFO:root:Epoch[92] Validation-accuracy=0.992676
INFO:root:Epoch[93] Train-accuracy=0.995605
INFO:root:Epoch[93] Time cost=0.813
INFO:root:Epoch[93] Validation-accuracy=0.992188
INFO:root:Epoch[94] Train-accuracy=0.996094
INFO:root:Epoch[94] Time cost=0.797
INFO:root:Epoch[94] Validation-accuracy=0.992188
INFO:root:Epoch[95] Train-accuracy=0.996582
INFO:root:Epoch[95] Time cost=0.813
INFO:root:Epoch[95] Validation-accuracy=0.993164
INFO:root:Epoch[96] Train-accuracy=0.996094
INFO:root:Epoch[96] Time cost=0.813
INFO:root:Epoch[96] Validation-accuracy=0.992676
INFO:root:Epoch[97] Train-accuracy=0.998047
INFO:root:Epoch[97] Time cost=0.813
INFO:root:Epoch[97] Validation-accuracy=0.991699
INFO:root:Epoch[98] Train-accuracy=0.994141
INFO:root:Epoch[98] Time cost=0.816
INFO:root:Epoch[98] Validation-accuracy=0.991211
INFO:root:Epoch[99] Train-accuracy=0.985840
INFO:root:Epoch[99] Time cost=0.813

this is the training and validation error log when using resnet (validation and training set are identical):

INFO:root:Epoch[0] Train-accuracy=0.570312
INFO:root:Epoch[0] Time cost=13.724
INFO:root:Epoch[0] Validation-accuracy=0.509766
INFO:root:Epoch[1] Train-accuracy=0.996582
INFO:root:Epoch[1] Time cost=10.845
INFO:root:Epoch[1] Validation-accuracy=0.509766
INFO:root:Epoch[2] Train-accuracy=1.000000
INFO:root:Epoch[2] Time cost=10.954
INFO:root:Epoch[2] Validation-accuracy=0.509766
INFO:root:Epoch[3] Train-accuracy=1.000000
INFO:root:Epoch[3] Time cost=10.994
INFO:root:Epoch[3] Validation-accuracy=0.529297
INFO:root:Epoch[4] Train-accuracy=1.000000
INFO:root:Epoch[4] Time cost=11.024
INFO:root:Epoch[4] Validation-accuracy=0.753906
INFO:root:Epoch[5] Train-accuracy=1.000000
INFO:root:Epoch[5] Time cost=11.079
INFO:root:Epoch[5] Validation-accuracy=0.765137
INFO:root:Epoch[6] Train-accuracy=1.000000
INFO:root:Epoch[6] Time cost=11.111
INFO:root:Epoch[6] Validation-accuracy=0.763184
INFO:root:Epoch[7] Train-accuracy=1.000000
INFO:root:Epoch[7] Time cost=11.180
INFO:root:Epoch[7] Validation-accuracy=0.765137
INFO:root:Epoch[8] Train-accuracy=1.000000
INFO:root:Epoch[8] Time cost=11.126
INFO:root:Epoch[8] Validation-accuracy=0.762695
INFO:root:Epoch[9] Train-accuracy=1.000000
INFO:root:Epoch[9] Time cost=11.174
INFO:root:Epoch[9] Validation-accuracy=0.761719
INFO:root:Epoch[10] Train-accuracy=1.000000
INFO:root:Epoch[10] Time cost=11.189
INFO:root:Epoch[10] Validation-accuracy=0.759766
INFO:root:Epoch[11] Train-accuracy=1.000000
INFO:root:Epoch[11] Time cost=11.230
INFO:root:Epoch[11] Validation-accuracy=0.756836
INFO:root:Epoch[12] Train-accuracy=1.000000
INFO:root:Epoch[12] Time cost=11.190
INFO:root:Epoch[12] Validation-accuracy=0.755371
INFO:root:Epoch[13] Train-accuracy=1.000000
INFO:root:Epoch[13] Time cost=11.176
INFO:root:Epoch[13] Validation-accuracy=0.763184
INFO:root:Epoch[14] Train-accuracy=1.000000
INFO:root:Epoch[14] Time cost=11.241
INFO:root:Epoch[14] Validation-accuracy=0.758301
INFO:root:Epoch[15] Train-accuracy=1.000000
INFO:root:Epoch[15] Time cost=11.275
INFO:root:Epoch[15] Validation-accuracy=0.761230
INFO:root:Epoch[16] Train-accuracy=1.000000
INFO:root:Epoch[16] Time cost=11.236
INFO:root:Epoch[16] Validation-accuracy=0.758301

Minimum reproducible example

if you are using your own code, please provide a short script that reproduces the error.

`import mxnet as mx
import numpy as np
import matplotlib.pyplot as plt
from symbols import resnet_fp16 as nettt
import sys
import time

print ('started ' + time.strftime('%x %X'))
startedd = time.time()

target_dim = 128#256

batcher = 64

num_images=2*512#*20#was *20

epox=500#00; # was 10

create images

rand (0) rand with circles(1)

xcoords = np.ones((target_dim,1))*np.arange(target_dim)
ycoords = np.transpose(xcoords)

def add_random_circle(da_img,xcoords,ycoords):
tolerance=1.0

center_x = int(np.random.rand() * target_dim)
center_y = int(np.random.rand() * target_dim)
radius=np.random.rand() * target_dim / 2.0

distances= np.power(np.power((xcoords-center_x),2)+np.power((ycoords-center_y),2),0.5)

da_img[np.where(np.abs(distances-radius)<tolerance)] = 1

return da_img

nn_train = int(num_images*2);
nn_val = int(num_images/2);

#image_list = np.random.rand(nn_train+nn_val,target_dim,target_dim)
llabel = np.round(np.random.rand(nn_train+nn_val))

train_indeces = np.arange(nn_train)
val_indeces= nn_train + np.arange(nn_val)

train_lbl=llabel[train_indeces]
val_lbl=llabel[val_indeces]

train_img = np.zeros((nn_train,1,target_dim,target_dim))
for indexer in range(nn_train):
img_now = np.random.rand(target_dim,target_dim)#image_list[train_indeces[indexer]]
img_now = add_random_circle(img_now, xcoords, ycoords)

if train_lbl[indexer] >0:
    img_now = add_random_circle(img_now,xcoords,ycoords)

img_now = add_random_circle(img_now, xcoords, ycoords)

train_img[indexer,0,:,:]=img_now.astype(np.float32)/255

nn_val=len(val_indeces)
val_img = np.zeros((nn_val,1,target_dim,target_dim))

for indexer in range(nn_val):
img_now = np.random.rand(target_dim,target_dim)#image_list[val_indeces[indexer]]
img_now = add_random_circle(img_now, xcoords, ycoords)

if val_lbl[indexer] > 0:
    img_now = add_random_circle(img_now, xcoords, ycoords)

val_img[indexer,0,:,:]=img_now.astype(np.float32)/255

normalizing by removing mean of training set!

meaner=np.mean(train_img)

print('mean train before: '+str(np.mean(train_img)))
print('mean val before: '+str(np.mean(val_img)))

train_img -= meaner
val_img -= meaner

test!!!! remove this

val_img=train_img
val_lbl=train_lbl

train = mx.io.NDArrayIter(
train_img, train_lbl, batcher, shuffle=True)
val = mx.io.NDArrayIter(
val_img, val_lbl, batcher, shuffle=True)

print(train_img.shape)
print(train_lbl.shape)
print(val_img.shape)
print(val_lbl.shape)

da_net = nettt.get_symbol(num_classes=2,num_layers=152,image_shape='1,'+str(target_dim)+','+str(target_dim))

devices = mx.gpu()

import logging

logging.getLogger().setLevel(logging.DEBUG)

batch_end_callbacks = [mx.callback.log_train_metric(5*2)]#40)]

model = mx.mod.Module(context = devices, symbol = da_net)

model.fit(train_data=train, eval_data =val ,eval_metric='acc', num_epoch=epox, optimizer='adam',optimizer_params={'learning_rate':1e-5,'wd':1e-5},initializer = mx.init.Xavier(rnd_type='gaussian', factor_type="in", magnitude=2))

finishedd= time.time()
print ('finished ' + time.strftime('%x %X') )
print ('sex:' + str(finishedd - startedd) )`

@piiswrong
Copy link
Contributor

Could be due to dropout or batchnorm.

@kevinthesun
Copy link
Contributor

kevinthesun commented Jun 28, 2017

Have you tried to train 100 epochs for resnet?

@Pagey
Copy link
Author

Pagey commented Jun 29, 2017

here's 60 epochs with batch = 100 (i thought that's what you meant ;) i dont think its going anywhere but stuck on the 81 this time- can one of you try to reproduce the effect with my code?

"C:\Program Files\Python35\python.exe" C:/Users/Admin/Documents/MY/tryinpython230417/mxnet-master/example/image-classification/mvc_ff_module_orig_resenettest.py
started 06/29/17 12:44:58
mean train before: 0.00200464849855
mean val before: 0.00200578962234
(2048, 1, 128, 128)
(2048,)
(2048, 1, 128, 128)
(2048,)
[12:45:06] d:\program files (x86)\jenkins\workspace\mxnet\mxnet\src\operator./cudnn_algoreg-inl.h:65: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO:root:Epoch[0] Train-accuracy=0.584762
INFO:root:Epoch[0] Time cost=12.633
INFO:root:Epoch[0] Validation-accuracy=0.511429
INFO:root:Epoch[1] Train-accuracy=0.920476
INFO:root:Epoch[1] Time cost=9.791
INFO:root:Epoch[1] Validation-accuracy=0.511429
INFO:root:Epoch[2] Train-accuracy=0.999048
INFO:root:Epoch[2] Time cost=9.829
INFO:root:Epoch[2] Validation-accuracy=0.511429
INFO:root:Epoch[3] Train-accuracy=1.000000
INFO:root:Epoch[3] Time cost=9.878
INFO:root:Epoch[3] Validation-accuracy=0.511429
INFO:root:Epoch[4] Train-accuracy=1.000000
INFO:root:Epoch[4] Time cost=9.954
INFO:root:Epoch[4] Validation-accuracy=0.511429
INFO:root:Epoch[5] Train-accuracy=1.000000
INFO:root:Epoch[5] Time cost=10.017
INFO:root:Epoch[5] Validation-accuracy=0.533333
INFO:root:Epoch[6] Train-accuracy=1.000000
INFO:root:Epoch[6] Time cost=9.985
INFO:root:Epoch[6] Validation-accuracy=0.761905
INFO:root:Epoch[7] Train-accuracy=1.000000
INFO:root:Epoch[7] Time cost=10.105
INFO:root:Epoch[7] Validation-accuracy=0.817143
INFO:root:Epoch[8] Train-accuracy=1.000000
INFO:root:Epoch[8] Time cost=10.097
INFO:root:Epoch[8] Validation-accuracy=0.815714
INFO:root:Epoch[9] Train-accuracy=1.000000
INFO:root:Epoch[9] Time cost=10.059
INFO:root:Epoch[9] Validation-accuracy=0.819048
INFO:root:Epoch[10] Train-accuracy=1.000000
INFO:root:Epoch[10] Time cost=10.251
INFO:root:Epoch[10] Validation-accuracy=0.822381
INFO:root:Epoch[11] Train-accuracy=1.000000
INFO:root:Epoch[11] Time cost=10.209
INFO:root:Epoch[11] Validation-accuracy=0.817143
INFO:root:Epoch[12] Train-accuracy=1.000000
INFO:root:Epoch[12] Time cost=10.064
INFO:root:Epoch[12] Validation-accuracy=0.823333
INFO:root:Epoch[13] Train-accuracy=1.000000
INFO:root:Epoch[13] Time cost=10.173
INFO:root:Epoch[13] Validation-accuracy=0.820476
INFO:root:Epoch[14] Train-accuracy=1.000000
INFO:root:Epoch[14] Time cost=10.271
INFO:root:Epoch[14] Validation-accuracy=0.820476
INFO:root:Epoch[15] Train-accuracy=1.000000
INFO:root:Epoch[15] Time cost=10.146
INFO:root:Epoch[15] Validation-accuracy=0.820000
INFO:root:Epoch[16] Train-accuracy=1.000000
INFO:root:Epoch[16] Time cost=10.173
INFO:root:Epoch[16] Validation-accuracy=0.826190
INFO:root:Epoch[17] Train-accuracy=1.000000
INFO:root:Epoch[17] Time cost=10.220
INFO:root:Epoch[17] Validation-accuracy=0.816667
INFO:root:Epoch[18] Train-accuracy=1.000000
INFO:root:Epoch[18] Time cost=10.125
INFO:root:Epoch[18] Validation-accuracy=0.819524
INFO:root:Epoch[19] Train-accuracy=1.000000
INFO:root:Epoch[19] Time cost=10.158
INFO:root:Epoch[19] Validation-accuracy=0.819048
INFO:root:Epoch[20] Train-accuracy=1.000000
INFO:root:Epoch[20] Time cost=10.327
INFO:root:Epoch[20] Validation-accuracy=0.820000
INFO:root:Epoch[21] Train-accuracy=1.000000
INFO:root:Epoch[21] Time cost=10.113
INFO:root:Epoch[21] Validation-accuracy=0.816667
INFO:root:Epoch[22] Train-accuracy=1.000000
INFO:root:Epoch[22] Time cost=10.034
INFO:root:Epoch[22] Validation-accuracy=0.823333
INFO:root:Epoch[23] Train-accuracy=1.000000
INFO:root:Epoch[23] Time cost=10.032
INFO:root:Epoch[23] Validation-accuracy=0.818571
INFO:root:Epoch[24] Train-accuracy=1.000000
INFO:root:Epoch[24] Time cost=10.126
INFO:root:Epoch[24] Validation-accuracy=0.821905
INFO:root:Epoch[25] Train-accuracy=1.000000
INFO:root:Epoch[25] Time cost=10.064
INFO:root:Epoch[25] Validation-accuracy=0.815238
INFO:root:Epoch[26] Train-accuracy=1.000000
INFO:root:Epoch[26] Time cost=10.092
INFO:root:Epoch[26] Validation-accuracy=0.814762
INFO:root:Epoch[27] Train-accuracy=1.000000
INFO:root:Epoch[27] Time cost=9.985
INFO:root:Epoch[27] Validation-accuracy=0.823333
INFO:root:Epoch[28] Train-accuracy=1.000000
INFO:root:Epoch[28] Time cost=10.001
INFO:root:Epoch[28] Validation-accuracy=0.818571
INFO:root:Epoch[29] Train-accuracy=1.000000
INFO:root:Epoch[29] Time cost=9.892
INFO:root:Epoch[29] Validation-accuracy=0.820000
INFO:root:Epoch[30] Train-accuracy=1.000000
INFO:root:Epoch[30] Time cost=9.977
INFO:root:Epoch[30] Validation-accuracy=0.820952
INFO:root:Epoch[31] Train-accuracy=1.000000
INFO:root:Epoch[31] Time cost=9.954
INFO:root:Epoch[31] Validation-accuracy=0.820476
INFO:root:Epoch[32] Train-accuracy=1.000000
INFO:root:Epoch[32] Time cost=9.954
INFO:root:Epoch[32] Validation-accuracy=0.816190
INFO:root:Epoch[33] Train-accuracy=1.000000
INFO:root:Epoch[33] Time cost=10.052
INFO:root:Epoch[33] Validation-accuracy=0.820476
INFO:root:Epoch[34] Train-accuracy=1.000000
INFO:root:Epoch[34] Time cost=10.021
INFO:root:Epoch[34] Validation-accuracy=0.820476
INFO:root:Epoch[35] Train-accuracy=1.000000
INFO:root:Epoch[35] Time cost=9.950
INFO:root:Epoch[35] Validation-accuracy=0.823810
INFO:root:Epoch[36] Train-accuracy=1.000000
INFO:root:Epoch[36] Time cost=9.939
INFO:root:Epoch[36] Validation-accuracy=0.822381
INFO:root:Epoch[37] Train-accuracy=1.000000
INFO:root:Epoch[37] Time cost=10.035
INFO:root:Epoch[37] Validation-accuracy=0.821429
INFO:root:Epoch[38] Train-accuracy=1.000000
INFO:root:Epoch[38] Time cost=9.935
INFO:root:Epoch[38] Validation-accuracy=0.819048
INFO:root:Epoch[39] Train-accuracy=1.000000
INFO:root:Epoch[39] Time cost=9.954
INFO:root:Epoch[39] Validation-accuracy=0.820476
INFO:root:Epoch[40] Train-accuracy=1.000000
INFO:root:Epoch[40] Time cost=10.126
INFO:root:Epoch[40] Validation-accuracy=0.822381
INFO:root:Epoch[41] Train-accuracy=1.000000
INFO:root:Epoch[41] Time cost=10.001
INFO:root:Epoch[41] Validation-accuracy=0.818571
INFO:root:Epoch[42] Train-accuracy=1.000000
INFO:root:Epoch[42] Time cost=10.252
INFO:root:Epoch[42] Validation-accuracy=0.820476
INFO:root:Epoch[43] Train-accuracy=1.000000
INFO:root:Epoch[43] Time cost=10.017
INFO:root:Epoch[43] Validation-accuracy=0.820952
INFO:root:Epoch[44] Train-accuracy=1.000000
INFO:root:Epoch[44] Time cost=10.112
INFO:root:Epoch[44] Validation-accuracy=0.820952
INFO:root:Epoch[45] Train-accuracy=1.000000
INFO:root:Epoch[45] Time cost=10.079
INFO:root:Epoch[45] Validation-accuracy=0.819524
INFO:root:Epoch[46] Train-accuracy=1.000000
INFO:root:Epoch[46] Time cost=9.990
INFO:root:Epoch[46] Validation-accuracy=0.817143
INFO:root:Epoch[47] Train-accuracy=1.000000
INFO:root:Epoch[47] Time cost=10.110
INFO:root:Epoch[47] Validation-accuracy=0.822381
INFO:root:Epoch[48] Train-accuracy=1.000000
INFO:root:Epoch[48] Time cost=10.115
INFO:root:Epoch[48] Validation-accuracy=0.820000
INFO:root:Epoch[49] Train-accuracy=1.000000
INFO:root:Epoch[49] Time cost=10.148
INFO:root:Epoch[49] Validation-accuracy=0.821429
INFO:root:Epoch[50] Train-accuracy=1.000000
INFO:root:Epoch[50] Time cost=10.000
INFO:root:Epoch[50] Validation-accuracy=0.820476
INFO:root:Epoch[51] Train-accuracy=1.000000
INFO:root:Epoch[51] Time cost=10.081
INFO:root:Epoch[51] Validation-accuracy=0.821429
INFO:root:Epoch[52] Train-accuracy=1.000000
INFO:root:Epoch[52] Time cost=10.106
INFO:root:Epoch[52] Validation-accuracy=0.822857
INFO:root:Epoch[53] Train-accuracy=1.000000
INFO:root:Epoch[53] Time cost=10.084
INFO:root:Epoch[53] Validation-accuracy=0.819048
INFO:root:Epoch[54] Train-accuracy=1.000000
INFO:root:Epoch[54] Time cost=9.950
INFO:root:Epoch[54] Validation-accuracy=0.817143
INFO:root:Epoch[55] Train-accuracy=1.000000
INFO:root:Epoch[55] Time cost=10.025
INFO:root:Epoch[55] Validation-accuracy=0.816667
INFO:root:Epoch[56] Train-accuracy=1.000000
INFO:root:Epoch[56] Time cost=9.874
INFO:root:Epoch[56] Validation-accuracy=0.822857
INFO:root:Epoch[57] Train-accuracy=1.000000
INFO:root:Epoch[57] Time cost=10.067
INFO:root:Epoch[57] Validation-accuracy=0.820952
INFO:root:Epoch[58] Train-accuracy=1.000000
INFO:root:Epoch[58] Time cost=9.929
INFO:root:Epoch[58] Validation-accuracy=0.820476
INFO:root:Epoch[59] Train-accuracy=1.000000
INFO:root:Epoch[59] Time cost=10.036
INFO:root:Epoch[59] Validation-accuracy=0.821905
INFO:root:Epoch[60] Train-accuracy=1.000000
INFO:root:Epoch[60] Time cost=9.990
INFO:root:Epoch[60] Validation-accuracy=0.817619
INFO:root:Epoch[61] Train-accuracy=1.000000
INFO:root:Epoch[61] Time cost=10.170
INFO:root:Epoch[61] Validation-accuracy=0.819048

@Pagey
Copy link
Author

Pagey commented Jun 30, 2017

@kevinthesun it turns our that after 300 or 500 epochs (notice i'm using a pretty small training set, i.e. about 2000 images) the validation score suddenly burps and then jumps to almost 1??!
what's going on? as i said the validation is identical to the training set, and the training error reaches 1 within a few epochs, but the validation grade was stuck around 0.75 or 0.8- up until as i said 300 or 500 epochs
see attached graph showing both training and validtion error for epochs so they appear as whole painted areas between 1 and 0.75 until line 937 (wich repesnts almost 470 epochs- cause i have 2 lines per epoch in this excel..) and then dips for a few epochs and then jumps to almost 1 as well

why doesn't the validation grade match the training grade for so long?? (they are identical!)
image

@kevinthesun
Copy link
Contributor

batchnorm has different behavior for training and testing.
@cjolivier01 Chris, I believed someone asked similar question before, can you give some explanation?

@Pagey
Copy link
Author

Pagey commented Jun 30, 2017

but could this different behavior explain these discrepancies when validaton is identical to training??

interestingly this effect seems exaggerated with adam optimizer, and doesn't seem to appear when using e.g. adadelta (same code, with adadelta and lr 1e-7)

(in adam it happens consistently with current parameters)

this is with adadelta:

"C:\Program Files\Python35\python.exe" C:/Users/Admin/Documents/MY/tryinpython230417/mxnet-master/example/image-classification/mvc_ff_module_orig_resenettest.py
started 07/01/17 01:09:31
mean train before: 0.00200443849818
mean val before: 0.00200668286666
(2048, 1, 128, 128)
(2048,)
(2048, 1, 128, 128)
(2048,)
[01:09:39] d:\program files (x86)\jenkins\workspace\mxnet\mxnet\src\operator./cudnn_algoreg-inl.h:65: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO:root:Epoch[0] Train-accuracy=0.692383
INFO:root:Epoch[0] Time cost=21.791
INFO:root:Epoch[0] Validation-accuracy=0.522949
INFO:root:Epoch[1] Train-accuracy=0.764160
INFO:root:Epoch[1] Time cost=18.924
INFO:root:Epoch[1] Validation-accuracy=0.522949
INFO:root:Epoch[2] Train-accuracy=0.795410
INFO:root:Epoch[2] Time cost=18.850
INFO:root:Epoch[2] Validation-accuracy=0.522949
INFO:root:Epoch[3] Train-accuracy=0.811523
INFO:root:Epoch[3] Time cost=18.837
INFO:root:Epoch[3] Validation-accuracy=0.693848
INFO:root:Epoch[4] Train-accuracy=0.881348
INFO:root:Epoch[4] Time cost=18.799
INFO:root:Epoch[4] Validation-accuracy=0.739746
INFO:root:Epoch[5] Train-accuracy=0.909668
INFO:root:Epoch[5] Time cost=18.823
INFO:root:Epoch[5] Validation-accuracy=0.651855
INFO:root:Epoch[6] Train-accuracy=0.945801
INFO:root:Epoch[6] Time cost=18.908
INFO:root:Epoch[6] Validation-accuracy=0.513184
INFO:root:Epoch[7] Train-accuracy=0.964844
INFO:root:Epoch[7] Time cost=18.853
INFO:root:Epoch[7] Validation-accuracy=0.926270
INFO:root:Epoch[8] Train-accuracy=0.976562
INFO:root:Epoch[8] Time cost=18.902
INFO:root:Epoch[8] Validation-accuracy=0.976562
INFO:root:Epoch[9] Train-accuracy=0.997559
INFO:root:Epoch[9] Time cost=18.828
INFO:root:Epoch[9] Validation-accuracy=0.981445
INFO:root:Epoch[10] Train-accuracy=0.998535
INFO:root:Epoch[10] Time cost=18.883
INFO:root:Epoch[10] Validation-accuracy=0.999512
INFO:root:Epoch[11] Train-accuracy=1.000000
INFO:root:Epoch[11] Time cost=18.786
INFO:root:Epoch[11] Validation-accuracy=1.000000
INFO:root:Epoch[12] Train-accuracy=1.000000
INFO:root:Epoch[12] Time cost=18.866
INFO:root:Epoch[12] Validation-accuracy=1.000000

@Pagey
Copy link
Author

Pagey commented Jul 9, 2017

Was anybody able to reproduce this? (i.e. stuck validation acc. on validation set identical to training set?)

@szha
Copy link
Member

szha commented Oct 8, 2017

This issue is closed due to lack of activity in the last 90 days. Feel free to ping me to reopen if this is still an active issue. Thanks!

@szha szha closed this as completed Oct 8, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants