-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pre-trained weights incompatible with backbone #65
Comments
@LucFrachon Same idea, I have. |
I wrote code to update the state dict and it worked. |
This issue is heavily connected to three other issues: #26, #33, and #47. You correctly observed the mismatch in the layer naming between model weights and checkpoint weights, as well as the shape mismatch. Certainly, I tried reshaping and renaming the weights, and it was successful, but I would argue that the model won't reconstruct the masked images as intended, if used for that purpose. If you observe the weights more closely, you could see another issue: there are some weights present in the model but not found in the checkpoint state dictionary. This is due to the fact that the pretrained weights of all the models available on the main page of this repository only have the encoder structure present in the .pt files, whereas the decoder was cut off. For that reason, this isn't a proper solution if you are using the model for any reconstruction task, which I'd argue would be the main use case. The fine-tuning of those weights is still successful due to the fact that fine-tuning for any downstream task (classification, object detection, segmentation, etc.) will construct a decoder, and for that reason, these pretrained weights are still useful. The absence of the decoder head in the .pt files is the main cause of these visualization issues people often come across when trying to reconstruct the masked image, i.e., when using the pretrained weights in a reconstruction task. These are the visualization issues I found raised, which I'd say are covered by this comment: #48, #42. The visualization itself is probably not wrong in each of those cases. I myself tried and succeeded in visualizing the reconstructions using code from MAE (https://github.com/facebookresearch/mae/blob/main/demo/mae_visualize.ipynb), and all the images contained nothing but white noise in the masked areas (reconstruction was mere noise, per se). I verified that the pretrained weights output the same noise as the randomly initialized and not pretrained model. That makes sense knowing the decoder weights are not present in the .pt files, meaning that the reconstruction part was practically a reconstruction using randomly initialized weights. Now, that means reconstruction won't be successful per se, even if the weights are renamed and reshaped. The solution is to train a decoder head. I indeed pretrained a ConvNeXt-V2 Atto model myself and saved all the weights, both of the encoder and decoder. Then I went back to visualization and saw that the reconstruction code worked, and these weight mismatches weren't a problem anymore, as all the keys matched successfully. The visualization code, among other things, can be found on my GitHub profile in the forked and modified version of this repository (https://github.com/MarkoHaralovic/ConvNeXt-V2). |
The weights for ConvNeXt-V2-Base, pretrained on INet1k, provided here, have several incompatibilities with the encoder architecture:
self.ln
element that means that the checkpoint keys should look likedownsample_layers.1.0.ln.bias
, whereas the weights provided havedownsample_layers.1.0.bias
.pwconv<i>
haveself.linear
, which the checkpoints doesn't have.kernel
(e.g.,downsample_layers.1.1.kernel
), whereas the provided checkpoint has keys"downsample_layers.1.1.weight
.stages
: biases need to be (1, c) but are (c,), and weights need to be (p ** 2, c) but are (c, -1, p, p)These can all be fixed with some code, but it would make life easier for everyone if you could upload the correct weights.
Thanks a lot!
The text was updated successfully, but these errors were encountered: