-
Notifications
You must be signed in to change notification settings - Fork 6.8k
program crash when run sparse model predict #8500
Comments
How large is your weight matrix? This is one of the mshadow-related legacy issues mentioned in #7319 (comment) https://github.com/apache/incubator-mxnet/blob/master/include/mxnet/tensor_blob.h#L270 It's on our roadmap but we don't have the time to fix that recently... |
i have no idea how to caculate the weight matrix? batchsize*num_features? |
If you reduce batch_size to 1000 or 100, is the error still there? I suspect there's some integer overflow problem |
yes,there is no error report when batchsize reduce to 200.Maybe there is bug in code. |
The code only breaks when it calls this functions: https://github.com/apache/incubator-mxnet/blob/master/include/mxnet/tensor_blob.h#L270 |
OK, |
batch size = 200 seems to be very small. You can increase the batch size to reduce the time for training per epoch (you have to increase learning rate according). Module.load just loads the model symbols and model parameters. You can train with large data batch and predict with small batch. |
is learning rate related to batchsize ?In code, grad has divided to batchsize. |
In fact,even the batchsize is increased to 200000,it take much to finish one epoch also |
Usually you want to increase the learning rate if batch size is increased. What's the density of your model and how many classes are there? If you have a GPU, you can change the ctx to mx.gpu(0) and run the same code without kvstore. The sparse.dot is implemented on GPU. |
It cost 1868.217 in every epoch in one gpu,seemed so slowly. |
What was your legacy code based on? Is it MXNet dense NDArray implementation? |
The legacy code is implented by mpi for lr,used in my product,not mxnet. /search/ted/mxnet_new/example/../../python/mxnet/metric.py:1191: UserWarning: ^[[91mNew metric mxnet.metric.AUC registered with name auc isoverriding ex and line_model.py as below: def linear_model(num_features, positive_cls_weight): |
I just set 2 environment values in mxnet. |
Linear_classfication.py is also changed ,call model.fit directly model
|
|
The cpu _backward_dot operator was improved by at least 3x in #8611 |
Current LR example is broken due to the custom op. This is fixed in #8721. |
1.After updating new codes,the trainning speed is still slowly. |
You can either inject these code into the training loop:
or set You can post the |
I think the performance of |
Only touch profile_output.json first,and when run python linear_classification.py , the logs will be writed to |
I see a lot of gaps between batches. It could be thread scheduling or IO overhead. |
profile_output.txt |
profile_output.txt |
There's a 2-second gap between batches. What metric are you using? Did you verify how much time computing the metric takes? What data iterator are you using? |
I used libsvm data iterator.And I set batch_end_callback=None,and re-run the new codes.It cost 1303.680s per epoch,the older codes cost 2038.416s.It did much more faster. |
For this model you don't need to change the default value for The time spent on forward and backward is only a small fraction of the total time. I don't see the profiler block for updater. What optimizer are you using? Is it any custom optimizer? |
I think probably IO (libsvmIter) is the bottleneck. |
profile (2).txt |
If I skip the forward/backward/update/update_metric,the cost time is reduced from 1303.680s to 276.323s.It did cost much time when get batch data. |
@liumilan I don't know why your profiler doesn't show SGD optimizer at all. That's very strange. How did you compile MXNet? Did you set |
@liumilan Just to follow up with this thread, are you still facing the issue? |
@sandeep-krishnamurthy @nswamy @anirudh2290 Could you please close this issue due to lack of activity. @liumilan Please feel free to re-open in case closed in error |
@mxnet-label-bot add [suggest-closed] |
I just run the linear_classification.py and save model.Then i need to predict data base on this model.
The predict code such as below:
if name == 'main':
args = parse_args()
shell command is as below:
python predict.py --test-dir="./test41.xls" --iteration=1 --batch-size=2000
and then it will report
include/mxnet/././tensor_blob.h:275: Check failed: this->shape_.Size() == shape.Size() (77286
68000 vs. 3433700704) TBlob.get_with_shape: new and old shape do not match total elements
how can i fixed it?
@eric-haibin-lin
The text was updated successfully, but these errors were encountered: