-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse convolutional neural networks #4328
Comments
@wenwei202 could you explain a bit further how to use your fork? Any example? I have convolution layers where 90% of the weights are zero if I use your version of caffe the computations will automatically take advantage of this sparsity? If I use a dense matrix will the computations be slower or will it use the normal way of computing? Thanks for sharing your work 👍 |
@jpiabrantes You can use conv_mode in each conv layer to indicate which method be utilized to do the computation. Thanks |
I just tested on the Lenet network for the MNIST example. I was able to achieve the following sparse layers:
I used |
@jpiabrantes in CPU mode, you need to use mkl. LOWERED_CSRMM is only implemented by mkl sparse blas since sparseblas is not supported by openblas and atlas. |
@wenwei202 I used the GPU mode. |
@jpiabrantes it is normal to achieve very limited 'speedup' in GPU even you have sparsity higher than 90%. Because GPU is high-parallelism, and irregular sparse pattern will impact the performance. I am working on structured sparsity to achieve speedup in GPU. |
@wenwei202 I am not able to complete compilation. 'make runtest' fails. |
@Rupeshd @wenwei202
To stabilize the sparsity during training, I zero out weights whose absolute values are smaller than 0.0001 after each weight updating. So, the precision of RMSPropSolverTest may not be enough to pass the test. You can comment the following code if you do not want to zero out (but it is recommended during training to stabilize the sparsity).
The only failed (crashed) test case is "TYPED_TEST(ConvolutionLayerTest, Test0DConvolution)" of https://github.com/wenwei202/caffe/blob/scnn/src/caffe/test/test_convolution_layer.cpp#L311. Hope this helps. -Wei |
@wenwei202 |
@wenwei202 |
@zhaishengfu The implementation was abandoned. Hardly it can achieve good speedup unless the sparse weights were hardcoded in the source code as the paper did. I didn't try hardcoding weights but you are free to try if you have interest. What the paper did was to convert each conv layer to three small layers. You can use this to generate the equivalent net prototxt and this to generate the corresponding decomposed caffemodel. But the code is deprecated. |
@wenwei202 Thank you for your reply. But i don't understand your meaning of 'hardcoded'. I didn't see words describing about it in the paper. According to my understanding, you can get speed-up as long as your network is sparse and you implemented methods of sparse-dense matrix multiplication described in the paper. Am i wrong??? |
@zhaishengfu Please refer to section 4 in the paper, like "Therefore, the location of non-zero elements are known and can be encoded directly in the compiled multiplication code." The duplication of that work was abandoned because of that tricky scheme. Our speedup is achieved by structured sparsity to overcome the irregular memory access pattern suffered from random distribution of sparse weights in the memory space. Hopefully, we can release our related paper soon. |
@wenwei202 Thank you very much. Really looking forward to your paper. Can you let me know when you realease your paper??(or can you tell me the name of your paper??) |
Hi @zhaishengfu @jpiabrantes @Rupeshd @pluskid @sergeyk , our paper related to this caffe fork is just accepted by NIPS 2016. You are welcome to contribute, in case you still have interest in sparse convolutional neural networks. [paper] [Github code ] |
@wenwei202 Thank you very much!! I will read it carefully!! I really enjoy your contribution to this fork |
@wenwei202 hello, i have seen your paper and code roughly. is the code same with your original code?? i did't see any difference(or may be i should see more carefully) |
@zhaishengfu Please use the scnn fork, and I have updated tutorial. Help that will help. |
@wenwei202 ok, indeed i have used your code already. i used all of your related parameter to generate my prototxt as following. I see that you don't use tensor decomposition. |
For the setting: |
@hiyijian the code says clearly, the xdimen and ydimen represents the column and row dimension respectively. For example, if you have A rows and ydimen is B, then you will have A/B groups and in each group you will use regularization |
Thanks. Clear now. |
@hiyijian Indeed i also want to know the answer. In my trial of traning(my problem is not classification but regression), when the sparsity gets about >60%, the accuracy will decrease apprently. I think the configuration of xdimen and ydimen is related to your network and question. Maybe you can set the configuration as the paper says(such as xdimen is equal to the columns of your convolution kernel and ydimen is equal to the rows of your convolutional kernel). |
Thank you @zhaishengfu |
@zhaishengfu @hiyijian The setups of xdimen and ydimen are based on what kinds of structure sparsity you want. For example, if weight matrix with many all-zero columns are expected, then xdimen = 1 and ydimen = the number of rows. For the trade off between accuracy and sparsity, pls train nn without ssl first to get the baseline, then train it ssl, and finally finetune it without ssl. Make sure your training converges well at every phase. |
Thanks @wenwei202 . It very helps. Would you like to make it more clear : How to put them into practice respectively via xdimen/ydimen control ? |
Say we have a typical conv layer with nfilter* nchannel * nHeight * nWidth = 128 * 64 * 3 * 3 Did I do anything obivoius stupid? |
Thanks for the explanation. I get the intuition now and can relate with the theory in paper. However, I am not sure about the line in the code above Regarding the issue, it is coming during make test, if I resolve it, I will post my fix here. |
@srivastavag89 you are right, you should reduce along three axes if it is a 3D filter. My error. |
@wenwei202 I would like to ask few more questions if you don't mind.
Baseline (GEMM) : 1657.76 ms and Sparsities of each case are Baseline : (0.430833, 0.655312, 0.659883) What I'm interested is that inference speed has changed a lot after fine-tuning even though there wasn't dramatic difference of sparsity. From what I understand, fine tuning is about regaining accuracy. Is it naturally to get such a gain? Sorry for so much question. I wish I could get your intuition before applying to detection. |
@HyunJaeLee2
|
@wenwei202 Thanks for your work. |
@srivastavag89 may be the hyper-parameter of structured sparsity regularization is too large such that it only optimizes the sparsity? |
Thanks for the reply. I will try taking a trained network and then retrain it adding structured sparsity to it this time. Also, I will try hyperparameters as per your suggestions. |
@srivastavag89 SSL is a universal method, it would work for versatile networks not just for those in the paper. |
@wenwei202 I try to train SSL with CPU mode |
FYI, Structured Sparsity Learning (SSL) approach is now also implemented in TensorFlow (code). We also extend and advance SSL to Recurrent Neural Networks to reduce the hidden dimension of LSTMs, i.e., learning the number of hidden/cell states/neurons in RNNs. Missing details (e.g. training method) are included in Section 3.2 here. |
Theory When going beyond the theories and applications we can simplify all kinds of identities into one type or another. |
@wenwei202 Great job on this paper! It is very impressive! I wonder where I can find your prototxt for training the alexnet for imagenet (the one shown in paper)? Because of the need of my experiment, I would like to make sure that I am using the correct method to compress the network in the correct way (which should reproduce your results). |
@Jarvistonychen Took a while to look around the logs, and find the the hyperparameter of 0.0005 for entry 5 in table 4. |
@wenwei202 Thanks a lot for spending time answering my question! Since there is only column sparsity in entry 5 of table 4, I think 0.0005 is for kernel_shape_decay? If I also wanna do breadth_decay, is 0.0005 also the right value to use? |
@Jarvistonychen Yes, you may start from there. |
@wenwei202 |
Correct. More precisely:
|
@wenwei202
I1024 15:29:05.849658 10494 solver.cpp:231] Iteration 4420, loss = 0.494514 and how to calculate/know percentage of sparsity in model before and after training ? |
@ananddb90 During training, you will see some sparsity statistics. The sparsity is shown in the order of layers, and in each layer, in the order of weights and then biases. Basically, it plots sparsity for all parameter blobs in caffe, like parameters for a batch normalization layer. We usually care only about the sparsity of weights. The "Element Sparsity" is the percentage of zeros. "Block Sparsity" is the percentage of all-zero blocks if you used |
@wenwei202 Also, I am using block_group_lasso but I am not getting any output in my log file. Train_scnn.prototxt solver.prototxt weight_decay: 0.0005 #kernel_shape_decay: 0.0 snapshot: 10000 iter_size: 2 |
@ananddb90 here is how sparsity displayed. Reading the code may be the best way. Alternatively, you may use pycaffe to analyze the trained model. |
@jpiabrantes |
@wenwei202 read your paper and watch your blogs recently. I am also studying for a master's degree at Beihang University,very admired you and your research results. I have a question that hope you can help me. In your SSL paper, you want to learn structed sparsity through setting breadth_decay, kernel_shape_decay or block_group_decay, but as you said below, during SSL, zeros in a row or column can still go back to nonzeros if they get a large update by the gradients of the cross entropy. Then, after fine-tuning the SSL, the weights in a row , column or block may not be all zeros, there are some nonzero values at any place, so not the structed sparsity. I don't know am I right? |
Hello @Demohai , answers may differ for different stages. In a learning stage of structured sparsity using group Lasso, zeros can go back only when those weights are very important since group Lasso regularization enforces them to zeros; in a fine-tuning stage after group Lasso, we simply fixed zeros in all-zero groups/rows/columns and retrained remaining ones, so that structured sparsity were kept. Thanks! :) |
hello @wenwei202 I find each time I run the program ,under the caffe root dictionary, there are some weights files for each layer, what are they used for and which section of the source code product them? |
Thanks a lot !!! |
hello @wenwei202 bother you again. When to deploy the fine-tuned SSL network, we have several conv mode for choice, I want to know what the lowered tensors and lowered feature maps are as the following picture shows. |
Anyone has interest to utilize the sparsity to accelerate DNNs?
I am working on the fork https://github.com/wenwei202/caffe/tree/scnn and currently, on average, achieve ~5x CPU and ~3x GPU layer-wise speedups of convolutional layers in AlexNet by off-the-shelf GEMM (after ~2% top1 accuracy loss).
http://papers.nips.cc/paper/6504-learning-structured-sparsity-in-deep-neural-networks.pdf
The text was updated successfully, but these errors were encountered: