Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

program crash when run sparse model predict #8500

Closed
liumilan opened this issue Nov 1, 2017 · 36 comments
Closed

program crash when run sparse model predict #8500

liumilan opened this issue Nov 1, 2017 · 36 comments

Comments

@liumilan
Copy link

liumilan commented Nov 1, 2017

I just run the linear_classification.py and save model.Then i need to predict data base on this model.
The predict code such as below:
if name == 'main':
args = parse_args()

prefix = './dnn'

devs = mx.cpu() if args.gpus is None else [
         mx.gpu(int(i)) for i in args.gpus.split(',')]

model_load = mx.model.FeedForward.load(prefix, args.iteration,ctx=devs)

test = mx.io.LibSVMIter(data_libsvm=args.test_dir, data_shape=(3864334,),
                              batch_size=args.batch_size, num_parts=1,
                              part_index=0,
                              num_epoch = 1)

[prob, data, label] = model_load.predict(test, return_data=True)
data_len = len(prob)

for i in range(0, data_len):
    print '%s\t%d\t%.6f' % ('1', label[i], prob[i]) 

shell command is as below:
python predict.py --test-dir="./test41.xls" --iteration=1 --batch-size=2000

and then it will report
include/mxnet/././tensor_blob.h:275: Check failed: this->shape_.Size() == shape.Size() (77286
68000 vs. 3433700704) TBlob.get_with_shape: new and old shape do not match total elements
how can i fixed it?
@eric-haibin-lin

@eric-haibin-lin
Copy link
Member

eric-haibin-lin commented Nov 2, 2017

How large is your weight matrix? This is one of the mshadow-related legacy issues mentioned in #7319 (comment)
The problem is that 32 bit int is not sufficient to calculate the number of elements in a large matrix..

https://github.com/apache/incubator-mxnet/blob/master/include/mxnet/tensor_blob.h#L270

It's on our roadmap but we don't have the time to fix that recently...

@liumilan
Copy link
Author

liumilan commented Nov 2, 2017

i have no idea how to caculate the weight matrix? batchsize*num_features?
batchsize is 2000,num_features is 3864334,
and in error log how 3433700704 is caculated?
In fact,during training ,the batchsize and num_features are setted the same as predict,and it can run during trainning.
@eric-haibin-lin

@eric-haibin-lin
Copy link
Member

If you reduce batch_size to 1000 or 100, is the error still there?

I suspect there's some integer overflow problem

@liumilan
Copy link
Author

liumilan commented Nov 3, 2017

yes,there is no error report when batchsize reduce to 200.Maybe there is bug in code.
But why it is ok during trainning? Is the result right if the batchsize is setted to 2000 during trainning?
@eric-haibin-lin

@eric-haibin-lin
Copy link
Member

The code only breaks when it calls this functions: https://github.com/apache/incubator-mxnet/blob/master/include/mxnet/tensor_blob.h#L270
It's just doing a sanity check to sure the shape matches. But unfortunately 32bit is not enough to perform the check, hence the failure.
It won't affect the quality of the model. You can either use small batch size for now as a walk-around, or remove that CHECK and build from source to predict with large batch size.

@liumilan
Copy link
Author

liumilan commented Nov 3, 2017

OK,
And one more question,
Now the train datasize is 98994453 ,bathsize is 200
it takes 2 hours to finish 1 epoch,seemed to slow.
Can it possible to speed up sparse trainning?
In other sparse lr training ,it should be much more faster
@eric-haibin-lin

@eric-haibin-lin
Copy link
Member

batch size = 200 seems to be very small. You can increase the batch size to reduce the time for training per epoch (you have to increase learning rate according).

Module.load just loads the model symbols and model parameters. You can train with large data batch and predict with small batch.

@liumilan
Copy link
Author

liumilan commented Nov 4, 2017

is learning rate related to batchsize ?In code, grad has divided to batchsize.
In my data,if batchsize are increased to 20000 or 200000,the model result will be bad.
@eric-haibin-lin

@liumilan
Copy link
Author

liumilan commented Nov 4, 2017

In fact,even the batchsize is increased to 200000,it take much to finish one epoch also
Time cost=2174.153
@eric-haibin-lin

@eric-haibin-lin
Copy link
Member

Usually you want to increase the learning rate if batch size is increased.

What's the density of your model and how many classes are there?

If you have a GPU, you can change the ctx to mx.gpu(0) and run the same code without kvstore. The sparse.dot is implemented on GPU.
If you have multiple machines, you can do distributed training.

@liumilan
Copy link
Author

It cost 1868.217 in every epoch in one gpu,seemed so slowly.
There are only 2 classes ,0 and 1.Batchsizes are assigned to 80000,features num are 3864334.Train data sizes are 100 millions.
The same train data is run much fasted in my leagcy code in one machine
@eric-haibin-lin

@eric-haibin-lin eric-haibin-lin self-assigned this Nov 11, 2017
@eric-haibin-lin
Copy link
Member

What was your legacy code based on? Is it MXNet dense NDArray implementation?
For cpu training, did you train setting OMP_NUM_THREADS to different values and see if that affects training time? What version of MXNet are you using? Is it built from source?
If you run your programme in terminal, do you see any warning messages?
What optimizer are you using?
I wonder if you can post your training code and see if there's any obvious place for improvement.

@liumilan
Copy link
Author

The legacy code is implented by mpi for lr,used in my product,not mxnet.
The trainning code is modified a little by your sparse examples,change from weighted loss to LogsticRegressOutput
The mxnet version is 0.12.0,and it is build from source.
The log is as below:
auc is added by myself,and it won't affect the speed,wii it?

/search/ted/mxnet_new/example/../../python/mxnet/metric.py:1191: UserWarning: ^[[91mNew metric mxnet.metric.AUC registered with name auc isoverriding ex
isting metric mxnet.metric.AUC^[[0m
class AUC(EvalMetric):
2017-11-08 16:47:05,419 Namespace(batch_size=80000, gpus=0, kvstore=None, num_epoch=100, optimizer='sgd')
2017-11-08 16:47:09,010 Already bound, ignoring bind()
/search/ted/mxnet_new/example/../../python/mxnet/module/base_module.py:464: UserWarning: Parameters already initialized and force_init=False. init_param
s call ignored.
allow_missing=allow_missing, force_init=force_init)
2017-11-08 16:47:09,012 optimizer already initialized, ignoring...
/search/ted/mxnet_new/example../../python/mxnet/metric.py:1191: UserWarning: ^[[91mNew metric mxnet.metric.AUC registered with name auc isoverriding ex
isting metric mxnet.metric.AUC^[[0m
class AUC(EvalMetric):
2017-11-08 17:12:12,118 Namespace(batch_size=80000, gpus=0, kvstore=None, num_epoch=100, optimizer='sgd')
2017-11-08 17:12:15,516 Already bound, ignoring bind()
/search/ted/mxnet_new/example../../python/mxnet/module/base_module.py:464: UserWarning: Parameters already initialized and force_init=False. init_param
s call ignored.
allow_missing=allow_missing, force_init=force_init)

and line_model.py as below:

def linear_model(num_features, positive_cls_weight):
# data with csr storage type to enable feeding data with CSRNDArray
x = mx.symbol.Variable("data", stype='csr')
norm_init = mx.initializer.Normal(sigma=0.01)
# weight with row_sparse storage type to enable sparse gradient updates
#weight = mx.symbol.Variable("weight", shape=(num_features, 1),
# init=norm_init, stype='row_sparse')
weight = mx.symbol.Variable("weight", shape=(num_features, 1),stype='row_sparse',
)
bias = mx.symbol.Variable("bias", shape=(1,))
dot = mx.symbol.sparse.dot(x, weight)
pred = mx.symbol.broadcast_add(dot, bias)
y = mx.symbol.Variable("softmax_label")
#model = mx.sym.Custom(pred, y, op_type='weighted_softmax_ce_loss',
# positive_cls_weight=positive_cls_weight, name='out')
#return mx.sym.MakeLoss(model)
net = mx.symbol.LogisticRegressionOutput(data=dot, name="softmax")
return net

@eric-haibin-lin

@liumilan
Copy link
Author

I just set 2 environment values in mxnet.
export MXNET_GPU_COPY_NTHREADS=4
export MXNET_CPU_WORKER_NTHREADS=20
@eric-haibin-lin

@liumilan
Copy link
Author

liumilan commented Nov 11, 2017

Linear_classfication.py is also changed ,call model.fit directly

model

# The positive class weight, says how much more we should upweight the importance of
# positive instances in the objective function.
# This is used to combat the extreme class imbalance.
positive_class_weight = 2
model = linear_model(num_features, positive_class_weight)

# module
mod = mx.mod.Module(symbol=model,data_names=['data'], label_names=['softmax_label'],context=devs)
mod.bind(data_shapes=train_data.provide_data, label_shapes=train_data.provide_label)
mod.init_params()
optim = mx.optimizer.create(optimizer, learning_rate=0.7,momentum = 0.8,rescale_grad=1.0/batch_size/num_worker)
mod.init_optimizer(optimizer=optim, kvstore=kv)
# use accuracy as the metric
metric = mx.metric.create(['ce','auc'])

# get the sparse weight parameter
weight_index = mod._exec_group.param_names.index('weight')
weight_param = mod._exec_group.param_arrays[weight_index]
speedometer = mx.callback.Speedometer(batch_size, 100)

mod.fit(train_data = train_data,
        eval_data = eval_data,
        eval_metric = metric,
        optimizer = optim,
        kvstore = kvstore,
        num_epoch = args.num_epoch,
        #batch_end_callback = mx.callback.Speedometer(batch_size, 10000),
        batch_end_callback = None,
        epoch_end_callback = save_model())

@eric-haibin-lin
Copy link
Member

  • io: Are you using LibSVMIter for io?
  • metrics: it might affect speed depending on your implementation. Are you using numpy to implement it? (as a side-note, contribution back to MXNet is welcome!)
  • what commit id did you build MXNet from?
  • do you mind build it with USE_PROFILER=1 to enable profiler, and set env_var MXNET_PROFILER_AUTOSTART to get the profiling result for your training of first 20 batches? That will be very helpful diagnose any performance bottleneck.
  • BTW, we observed some space for performance improvement for the backward of sparse.dot, and there's a PR for it: optimization for dot(csr.T, dense) = rsp #8611 which will be potentially helpful in your case.

@eric-haibin-lin
Copy link
Member

The cpu _backward_dot operator was improved by at least 3x in #8611
Do you want to sync with master and run it again?

@ZiyueHuang
Copy link
Member

Current LR example is broken due to the custom op. This is fixed in #8721.

@liumilan
Copy link
Author

liumilan commented Nov 20, 2017

1.After updating new codes,the trainning speed is still slowly.
Batchsize is 80000 also,train data size is 100 millions,the same as last testing.
It cost 1977.781s in gpus per epoch,and cost 2038.416 cpus .
2.I don't t think metric affect the speed,because i have skip the metric during trainning.
3.how can i collect profile log?i'have build it with profile=1,and set export MXNET_PROFILER_AUTOSTART=0,but no profile log are printed on nohup logs.
Is there any other issues that i missed?
@eric-haibin-lin

@eric-haibin-lin
Copy link
Member

You can either inject these code into the training loop:

    mx.profiler.profiler_set_config(mode='all', filename='profile_output.json')
    mx.profiler.profiler_set_state('run')

    # Code to be profiled goes here...

    mx.profiler.profiler_set_state('stop')

or set MXNET_PROFILER_AUTOSTART=1 to enable auto start for profiling.

You can post the xxx_profile.json file here for us to take a look which part is the bottleneck.

@eric-haibin-lin
Copy link
Member

I think the performance of _backward_dot operator needs to be improved further. I'll keep you posted.

@liumilan
Copy link
Author

Only touch profile_output.json first,and when run python linear_classification.py , the logs will be writed to
profile_output.json,But no more logs when the program started to training.

@liumilan
Copy link
Author

profile_output.txt

@eric-haibin-lin
Copy link
Member

I see a lot of gaps between batches. It could be thread scheduling or IO overhead.
I don't think export MXNET_CPU_WORKER_NTHREADS=20 will help. This is useful when you have parallel path in the graph (for example, two operators can be executed at the same time). Maybe setting this to 1 or 2 is good enough. Setting this to a too high value will lead to some overhead.
MXNET_CPU_WORKER_NTHREADS controls how many operators executed in parallel, while OMP_NUM_THREADS controls how many threads run in parallel for any particular operator.
Could you also run with MXNET_CPU_WORKER_NTHREADS=1 and see if there's any IO bottleneck?

@liumilan
Copy link
Author

liumilan commented Nov 21, 2017

profile_output.txt
this is the updated profile ,when the 1 epoch is finished.I will try the MXNET_CPU_WORKER_NTHREADS=1 right now.

@liumilan
Copy link
Author

profile_output.txt
Here is the MXNET_CPU_WORKER_NTHREADS=1 ,it cost the same time as cpu_worker_nthreads=20

@eric-haibin-lin
Copy link
Member

There's a 2-second gap between batches. What metric are you using? Did you verify how much time computing the metric takes? What data iterator are you using?

@liumilan
Copy link
Author

liumilan commented Nov 24, 2017

I used libsvm data iterator.And I set batch_end_callback=None,and re-run the new codes.It cost 1303.680s per epoch,the older codes cost 2038.416s.It did much more faster.
But it is still slower than my mpi lr code.For my code,the same datasize,it only cost 1.6 hours when run 100 epoches.
Do u think mxnet can be improved much for faster?
profile.txt
@eric-haibin-lin

@eric-haibin-lin
Copy link
Member

For this model you don't need to change the default value for MXNET_CPU_WORKER_NTHREADS . This env var is to enable running multiple operators in parallel when you have a computation graph that has divergent path. I recommend you setting it back to 1, which will make the profiler result much easier to read.

The time spent on forward and backward is only a small fraction of the total time. I don't see the profiler block for updater. What optimizer are you using? Is it any custom optimizer?

@eric-haibin-lin
Copy link
Member

eric-haibin-lin commented Dec 1, 2017

I think probably IO (libsvmIter) is the bottleneck.
Edit: I'm able to read 3GB of criteo libsvm test data within 4 seconds using LibSVMIter. If you simply skip forward/backward/update/update_metric, how much time is spent on IO? You can also use the DummyIter to see if IO is the bottleneck and the speed

@liumilan
Copy link
Author

liumilan commented Dec 6, 2017

profile (2).txt
This the latest profile,and i have set env value to 1.
SGD optimizer is used ,not any custom optimizer.
It did slowly ,though it is faster than last version

@liumilan
Copy link
Author

liumilan commented Dec 6, 2017

If I skip the forward/backward/update/update_metric,the cost time is reduced from 1303.680s to 276.323s.It did cost much time when get batch data.
It may be a issue for every io iter,not only libsvmIter?
@eric-haibin-lin

@eric-haibin-lin
Copy link
Member

@liumilan I don't know why your profiler doesn't show SGD optimizer at all. That's very strange. How did you compile MXNet? Did you set DEV=0, DEBUG=0, USE_PROFILER=1? Did you do a clean build when build config was changed?

@lanking520
Copy link
Member

@liumilan Just to follow up with this thread, are you still facing the issue?

@ankkhedia
Copy link
Contributor

@sandeep-krishnamurthy @nswamy @anirudh2290 Could you please close this issue due to lack of activity.

@liumilan Please feel free to re-open in case closed in error

@zachgk
Copy link
Contributor

zachgk commented Nov 15, 2018

@mxnet-label-bot add [suggest-closed]

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants