-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About BP problem mentioned in the introduction #47
Comments
Hi, thanks for your interest in our work. I think the core problem here is to optimize the prediction module. Directly deleting these tokens is correct if we only want to finetune the ViT and improve its performance on incomplete tokens. Here we use a strategy similar to policy gradient in RL by keeping the gradient of probabilities of dropped tokens to guide the prediction module to better explore possible sparaification polices. |
Thanks a lot for your prompt and insightful response! |
Hello~ I recently read your brilliant paper, but confused anout BP problem mentioned in the introduction:
Moreover, this would also hinder the back-propagation for the prediction module, which needs to calculate the probability distribution of whether to keep the token even if it is finally eliminated.
My understanding is that the deleted tokens do not participate in subsequent attention calculations, meaning there is no information exchange. They are also irrelevant to the calculation of loss. Therefore, it seems that directly deleting these tokens during training does not affect the correct backpropagation of gradients. I am a bit confused about this statement in the article and would appreciate it if you could clarify any misunderstandings.
The text was updated successfully, but these errors were encountered: