Pyramid VIT can work on a 512x512 image??? #2104
Replies: 2 comments 4 replies
-
Most vit or similar models can take an A number of vit-like models that don't have fixed size position embeddings, or have extra code to resize on the fly can take any input size. Pyramid (pvt) is one of those, davit, and a number of others. I added dynamic resizing to standard vits not long ago...
The third option works for a lot of models with fixed size. A tuple can usually be passed for non-square images, same can be done for patch_size or window_size on some models. |
Beta Was this translation helpful? Give feedback.
-
Thanks ross,based on your answer,shall I assume that as of now VIT and SWIN aren't able to handle dynamic sizes. |
Beta Was this translation helpful? Give feedback.
-
So, I am currently playing with timm's vit model family and I was wondering do the VIT models have a constraint on the image size.For eg: SWIN and VIT can only work on inputs of dimensions 224x224,but what if we wanted to use these pre-trained models on images of larger resolution say 512 or 1024.In that case are there any models on timm that can be useful.
I came across something called multiscale long former(https://github.com/microsoft/vision-longformer).if i was to go for a cnn based approach I would easily choose efficient net /hrnet/cspnet,as these models have been known to give decent results on my own data.
However,since I am interested in learning more about transformers,I would like to know if there are any models in the visual transformer family which can accept high resolution images.
PS At a whim,I decided to use the Pyramid VIT and obtain the model summary with different image sizes varying from the standard 224x224 to 640x480(COCO image size).The model summary uptil the first pyramid vision stage transformer 45 is shown below for each image size 224x224,512x512,640x480.
For 224x224
For 512x512
for 640x480
And although the model is happy to accept any image i throw at it,I would be happy if some guidance can be provided as to why the model isn't complaining on accepting variable image sizes.
To summarize ,are there any visual transformer models provided in the timm library which are capable of accepting variable image sizes.I tried the same cases with SWIN ,and was able to confirm that models like SWIN and VIT can't accept images larger than 384.
Is that assumption correct.Secondly does PyramidVIT not have any such constraints on the image size,and finally,are there any plans on introducing multiscale long former in timm.
Beta Was this translation helpful? Give feedback.
All reactions