Pyramid VIT can work on a 512x512 image??? #2104

sparshgarg23 · 2024-03-03T05:22:12Z

sparshgarg23
Mar 3, 2024

So, I am currently playing with timm's vit model family and I was wondering do the VIT models have a constraint on the image size.For eg: SWIN and VIT can only work on inputs of dimensions 224x224,but what if we wanted to use these pre-trained models on images of larger resolution say 512 or 1024.In that case are there any models on timm that can be useful.

I came across something called multiscale long former(https://github.com/microsoft/vision-longformer).if i was to go for a cnn based approach I would easily choose efficient net /hrnet/cspnet,as these models have been known to give decent results on my own data.

However,since I am interested in learning more about transformers,I would like to know if there are any models in the visual transformer family which can accept high resolution images.

PS At a whim,I decided to use the Pyramid VIT and obtain the model summary with different image sizes varying from the standard 224x224 to 640x480(COCO image size).The model summary uptil the first pyramid vision stage transformer 45 is shown below for each image size 224x224,512x512,640x480.

For 224x224

 Conv2d-1           [-1, 32, 56, 56]           4,736
         LayerNorm-2           [-1, 56, 56, 32]              64
 OverlapPatchEmbed-3           [-1, 56, 56, 32]               0
         LayerNorm-4             [-1, 3136, 32]              64
            Linear-5             [-1, 3136, 32]           1,056
            Conv2d-6             [-1, 32, 7, 7]          65,568
         LayerNorm-7               [-1, 49, 32]              64
            Linear-8               [-1, 49, 64]           2,112
            Linear-9             [-1, 3136, 32]           1,056
          Dropout-10             [-1, 3136, 32]               0
        Attention-11             [-1, 3136, 32]               0
         Identity-12             [-1, 3136, 32]               0
        LayerNorm-13             [-1, 3136, 32]              64
           Linear-14            [-1, 3136, 256]           8,448
         Identity-15          [-1, 256, 56, 56]               0
           Conv2d-16          [-1, 256, 56, 56]           2,560
             GELU-17            [-1, 3136, 256]               0
          Dropout-18            [-1, 3136, 256]               0
           Linear-19             [-1, 3136, 32]           8,224
          Dropout-20             [-1, 3136, 32]               0
MlpWithDepthwiseConv-21             [-1, 3136, 32]               0
         Identity-22             [-1, 3136, 32]               0
            Block-23             [-1, 3136, 32]               0
        LayerNorm-24             [-1, 3136, 32]              64
           Linear-25             [-1, 3136, 32]           1,056
           Conv2d-26             [-1, 32, 7, 7]          65,568
        LayerNorm-27               [-1, 49, 32]              64
           Linear-28               [-1, 49, 64]           2,112
           Linear-29             [-1, 3136, 32]           1,056
          Dropout-30             [-1, 3136, 32]               0
        Attention-31             [-1, 3136, 32]               0
         Identity-32             [-1, 3136, 32]               0
        LayerNorm-33             [-1, 3136, 32]              64
           Linear-34            [-1, 3136, 256]           8,448
         Identity-35          [-1, 256, 56, 56]               0
           Conv2d-36          [-1, 256, 56, 56]           2,560
             GELU-37            [-1, 3136, 256]               0
          Dropout-38            [-1, 3136, 256]               0
           Linear-39             [-1, 3136, 32]           8,224
          Dropout-40             [-1, 3136, 32]               0
MlpWithDepthwiseConv-41             [-1, 3136, 32]               0
         Identity-42             [-1, 3136, 32]               0
            Block-43             [-1, 3136, 32]               0
        LayerNorm-44             [-1, 3136, 32]              64
PyramidVisionTransformerStage-45           [-1, 32, 56, 56]

For 512x512

       Conv2d-1         [-1, 32, 128, 128]           4,736
         LayerNorm-2         [-1, 128, 128, 32]              64
 OverlapPatchEmbed-3         [-1, 128, 128, 32]               0
         LayerNorm-4            [-1, 16384, 32]              64
            Linear-5            [-1, 16384, 32]           1,056
            Conv2d-6           [-1, 32, 16, 16]          65,568
         LayerNorm-7              [-1, 256, 32]              64
            Linear-8              [-1, 256, 64]           2,112
            Linear-9            [-1, 16384, 32]           1,056
          Dropout-10            [-1, 16384, 32]               0
        Attention-11            [-1, 16384, 32]               0
         Identity-12            [-1, 16384, 32]               0
        LayerNorm-13            [-1, 16384, 32]              64
           Linear-14           [-1, 16384, 256]           8,448
         Identity-15        [-1, 256, 128, 128]               0
           Conv2d-16        [-1, 256, 128, 128]           2,560
             GELU-17           [-1, 16384, 256]               0
          Dropout-18           [-1, 16384, 256]               0
           Linear-19            [-1, 16384, 32]           8,224
          Dropout-20            [-1, 16384, 32]               0
MlpWithDepthwiseConv-21            [-1, 16384, 32]               0
         Identity-22            [-1, 16384, 32]               0
            Block-23            [-1, 16384, 32]               0
        LayerNorm-24            [-1, 16384, 32]              64
           Linear-25            [-1, 16384, 32]           1,056
           Conv2d-26           [-1, 32, 16, 16]          65,568
        LayerNorm-27              [-1, 256, 32]              64
           Linear-28              [-1, 256, 64]           2,112
           Linear-29            [-1, 16384, 32]           1,056
          Dropout-30            [-1, 16384, 32]               0
        Attention-31            [-1, 16384, 32]               0
         Identity-32            [-1, 16384, 32]               0
        LayerNorm-33            [-1, 16384, 32]              64
           Linear-34           [-1, 16384, 256]           8,448
         Identity-35        [-1, 256, 128, 128]               0
           Conv2d-36        [-1, 256, 128, 128]           2,560
             GELU-37           [-1, 16384, 256]               0
          Dropout-38           [-1, 16384, 256]               0
           Linear-39            [-1, 16384, 32]           8,224
          Dropout-40            [-1, 16384, 32]               0
MlpWithDepthwiseConv-41            [-1, 16384, 32]               0
         Identity-42            [-1, 16384, 32]               0
            Block-43            [-1, 16384, 32]               0
        LayerNorm-44            [-1, 16384, 32]              64
PyramidVisionTransformerStage-45         [-1, 32, 128, 128]               0

for 640x480


 Conv2d-1         [-1, 32, 160, 120]           4,736
         LayerNorm-2         [-1, 160, 120, 32]              64
 OverlapPatchEmbed-3         [-1, 160, 120, 32]               0
         LayerNorm-4            [-1, 19200, 32]              64
            Linear-5            [-1, 19200, 32]           1,056
            Conv2d-6           [-1, 32, 20, 15]          65,568
         LayerNorm-7              [-1, 300, 32]              64
            Linear-8              [-1, 300, 64]           2,112
            Linear-9            [-1, 19200, 32]           1,056
          Dropout-10            [-1, 19200, 32]               0
        Attention-11            [-1, 19200, 32]               0
         Identity-12            [-1, 19200, 32]               0
        LayerNorm-13            [-1, 19200, 32]              64
           Linear-14           [-1, 19200, 256]           8,448
         Identity-15        [-1, 256, 160, 120]               0
           Conv2d-16        [-1, 256, 160, 120]           2,560
             GELU-17           [-1, 19200, 256]               0
          Dropout-18           [-1, 19200, 256]               0
           Linear-19            [-1, 19200, 32]           8,224
          Dropout-20            [-1, 19200, 32]               0
MlpWithDepthwiseConv-21            [-1, 19200, 32]               0
         Identity-22            [-1, 19200, 32]               0
            Block-23            [-1, 19200, 32]               0
        LayerNorm-24            [-1, 19200, 32]              64
           Linear-25            [-1, 19200, 32]           1,056
           Conv2d-26           [-1, 32, 20, 15]          65,568
        LayerNorm-27              [-1, 300, 32]              64
           Linear-28              [-1, 300, 64]           2,112
           Linear-29            [-1, 19200, 32]           1,056
          Dropout-30            [-1, 19200, 32]               0
        Attention-31            [-1, 19200, 32]               0
         Identity-32            [-1, 19200, 32]               0
        LayerNorm-33            [-1, 19200, 32]              64
           Linear-34           [-1, 19200, 256]           8,448
         Identity-35        [-1, 256, 160, 120]               0
           Conv2d-36        [-1, 256, 160, 120]           2,560
             GELU-37           [-1, 19200, 256]               0
          Dropout-38           [-1, 19200, 256]               0
           Linear-39            [-1, 19200, 32]           8,224
          Dropout-40            [-1, 19200, 32]               0
MlpWithDepthwiseConv-41            [-1, 19200, 32]               0
         Identity-42            [-1, 19200, 32]               0
            Block-43            [-1, 19200, 32]               0
        LayerNorm-44            [-1, 19200, 32]              64
PyramidVisionTransformerStage-45         [-1, 32, 160, 120]

And although the model is happy to accept any image i throw at it,I would be happy if some guidance can be provided as to why the model isn't complaining on accepting variable image sizes.

To summarize ,are there any visual transformer models provided in the timm library which are capable of accepting variable image sizes.I tried the same cases with SWIN ,and was able to confirm that models like SWIN and VIT can't accept images larger than 384.
Is that assumption correct.Secondly does PyramidVIT not have any such constraints on the image size,and finally,are there any plans on introducing multiscale long former in timm.

rwightman · 2024-03-05T01:38:45Z

rwightman
Mar 5, 2024
Maintainer

Most vit or similar models can take an img_size= arg on init that will change the original 'fixed' resolution to a different fixed resolution, meaning it resizes position embeddings once when you create and load pretrained weights, and then from there you can use with higher res.

A number of vit-like models that don't have fixed size position embeddings, or have extra code to resize on the fly can take any input size. Pyramid (pvt) is one of those, davit, and a number of others.

I added dynamic resizing to standard vits not long ago...

dynamic image resize & padding (accepts any image size as long as size is consistent per batch). Will pad with black borders on right/bottom edge if img size not a multiple of patch size: create_model('vit_base_patch16_224', pretrained=True, dynamic_img_size=True, dynamic_img_pad=True)
dynamic image resize. Accepts any image size that's a multiple of patch size: create_model('vit_base_patch16_224', pretrained=True, dynamic_img_size=True)
static resize. Resize model position embeddings once on init to a new size and will then accept only that size: create_model('vit_base_patch16_224', pretrained=True, img_size=512)

The third option works for a lot of models with fixed size. A tuple can usually be passed for non-square images, same can be done for patch_size or window_size on some models. timm has more support for non-square aspects than other impl I've seen as it's often skipped.

0 replies

sparshgarg23 · 2024-03-05T04:39:47Z

sparshgarg23
Mar 5, 2024
Author

Thanks ross,based on your answer,shall I assume that as of now VIT and SWIN aren't able to handle dynamic sizes.
Also any plans of introducing longformer to timm?

4 replies

rwightman Mar 5, 2024
Maintainer

timm vit can dynamic resize as per options 1 & 2. Swin can only do static resize but it's possible to resize image, patch and window size in the timm swin, non-square as well, which isn't possible elsewhere. Dynamic resize for swin would be a fair bit of fiddling and require caching multiple sets of position indices, but it is possible.

rwightman Mar 5, 2024
Maintainer

I don't feel longformer would be a big benefit, at least didnt look like it when I checked. The flash attention support via pytorch F.sdpa does make a pretty big impact for vit at higher resolution. Improves the scaling quite a bit for longer seq len.

sparshgarg23 Mar 5, 2024
Author

one last thing,so apart from longformer would you recommend any other architecture in transformer family for high resolution tasks like segmentation??

rwightman Mar 6, 2024
Maintainer

Hmm, that's a tough one, swin used to have better resolution scaling than vit, but if you try vit with F.sdpa in latest pytorch versions you'll see that the fused mem efficient / flash kernels actually bring vit comparable with swin as the resolution increases. Convolutional arch are still better as you crank up the res in terms of scaling, convnext can be a nice balance (still has some traits like vits, such as all LayerNorm). Maybe some of the hybrids that put conv blocks in early stages and transformers in later .. there are a number of those, CoAtNet, and some of the 'efficient' vit/former models...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pyramid VIT can work on a 512x512 image??? #2104

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Pyramid VIT can work on a 512x512 image??? #2104

sparshgarg23 Mar 3, 2024

Replies: 2 comments · 4 replies

rwightman Mar 5, 2024 Maintainer

sparshgarg23 Mar 5, 2024 Author

rwightman Mar 5, 2024 Maintainer

rwightman Mar 5, 2024 Maintainer

sparshgarg23 Mar 5, 2024 Author

rwightman Mar 6, 2024 Maintainer

sparshgarg23
Mar 3, 2024

Replies: 2 comments 4 replies

rwightman
Mar 5, 2024
Maintainer

sparshgarg23
Mar 5, 2024
Author

rwightman Mar 5, 2024
Maintainer

rwightman Mar 5, 2024
Maintainer

sparshgarg23 Mar 5, 2024
Author

rwightman Mar 6, 2024
Maintainer