This is paper is published in CVPR 2022. Our goal is to re-implement the proposed model and study its capability and limitations.
The paper proposes a model called 'DECORE' which aims to compress a deep model architecture by dropping channels or linear layers of low importance, the proposed model is based on a multi-agent reinforcement framework. Previous works targeted the problem of compressing AI models, such as pruning weights based on the network statistcs and learning channel's importance subject to compression constraints, reinforcement learning has been also used to find optimized model's architecture, however, all these methods suffer from high complexity due to iterative search and fine-tuning. This work work proposed a model in which the architecture search and fine-tuning are independent which helps in speeding up the compression process.
Given a Deep Model, the 'DECORE' framework assigns an agent to each layer of the model similar to Figure 1, the agent holds a vector
Figure.1: (a) Agents are inserted after the convolution layer. (b) For Networks that has parallel pathes such as ResNet, the same policy will be shared between the parallel pathes. (c) the sampled actions
where
The cost function is:
In equation 5,
In the REINFORCE policy gradient algorithm, the infer of
Moreover, in equation 5, the authors have labeled the state representation as
The rest of our implementation follows the same structure of the original work, at first the model is embedded with agents after each convolution layer, an agent takes an action based on the policy described by equation 1, the taken action works as a mask which either passes the output of the channel in case action
The original work tests their framework by compressing four different models, namely, VGG16, DenseNet,GoogleLeNet and ResNet. They used CIFAR-10 and IMAGENET as datasets.
During training, they used ADAM optimizer with learning rate 0.01 and batch size of 256, initial weights
main.ipynb is a demo on how to deploy and use our implementation.
At first, we import a pretrained VGG16 model and fine tune it on CIFAR-10 dataset, the classfication layer is replaced with a layer that has the same number of classes in CIFAR-10.
# Import a pre-trained VGG16 and replace classification layer with new layer that has number of classes in CIFAR-10
vgg16 = models.vgg16(weights='IMAGENET1K_V1')
input_lastLayer = vgg16.classifier[6].in_features
vgg16.classifier[6] = nn.Linear(input_lastLayer,10)
After fine tuning the target model, in this case VGG16, it is passed to the 'DECOR' module in which the model will be embedded with agents. For optimizer, only the state representations
num_epochs = 1
lr = 0.01
net = DECOR(model)
net = net.to(device)
id = 0
param = []
for n,p in net.named_parameters():
if n.endswith(f".S_{id}") and p.requires_grad:
param.append(p)
id += 1
optimizer = Adam(param, lr=lr)
criterion = CustomLoss(-200)
The model then can be trained and the state representations
import gc
gc.collect()
torch.cuda.empty_cache()
print(torch.cuda.mem_get_info())
for epoch in range(num_epochs):
#train_loss = 0.0
for i, (imgs , labels) in enumerate(train_loader):
imgs = imgs.to(device)
labels = labels.to(device)
labels_hat = net.target_model(imgs)
loss_value = criterion(net.agents_list,labels_hat, labels)
#train_loss += loss_value.detach().cpu().item() / len(train_loader)
optimizer.zero_grad()
loss_value.backward()
optimizer.step()
if (i+1) % 250 == 0:
print(f'epoch {epoch+1}/{num_epochs}, step: {i+1}/{n_total_step}: loss = {loss_value:.5f}')
print()
Finally, the trained model can be retrieved as
target_model = net.target_model
where the agents can be removed and channels that have state representation with probability less than 50% can be pruned.
We could not reproduce the results in the paper because of unclarity of
Below are some results obtained by the original paper:
Table.1: Accuracy, Number of Parameters, and number of FLOPs for VGG16 models compressed using different penalty (e.g. DECORE-500 has
Figure.2: The curves show the effect of removing channels with high weights (High importance) and low weights (low importance) on the accuracy of the model, the figure shows that proposed model by the orignal work can learn the important channels in a deep model. Disclaimer: This figure is taken from the original paper.
Although the paper proposes a promising work for compressing deep models, due to lack of some details or misinterpreting the model could not be verified. During training process we observed that the loss function is barely changing and hence the state representation or weights do not change significantly, our reasoning is that initializing the weights at 9.6 gives probability of taking action
DECORE: Deep Compression with Reinforcement Learning, 2022, Manoj Alwani, Yang Wang, Vashisht Madhavan
Mahmoud Alasmar ([email protected])