ViTencoder input #15

Followmeczx · 2024-04-18T03:22:49Z

I found that the front/back normal maps are also used as input to the encoder and image to generate three-plane features. I want to know why? Will the result be improved?

Reading the code, I found that after obtaining the three-plane feature map, it was concatenated with the normal feature.

I only input the image through VitPose's pre-trained ViTencoder model to get the image features, and then also through the three decoders to get the three-plane features and splice with the normal features. Is that all right?

River-Zhang · 2024-04-18T04:47:33Z

Yes, it could improve the performance a little. As normal images of the input image are easy to acquire (using a pre-trained normal estimation model), we also input the front/back normal images. However, if you want to input a single image, the whole model may have to be retrained.

Followmeczx · 2024-04-18T09:05:55Z

I have one more question. Since I used vitpose pre-trained model, the input image resolution is (256,192) and the final feature dim is 1024. This will produce an output of size 192x1024 when passed through the encoder. I would like to use your method later. But your ViT model gets 1024x256 output. Do I need to change the resolution of the image to (512,512) and change the feature dim to 256?

Or I can simply change image_size and dim to 1024. I don't know if it will have any effect on the back? It seems like (512,512) resolution is all you need for an ICON.

River-Zhang · 2024-04-18T12:53:37Z

If you just want to use our model for inference, you can just input the image and the script will automatically resize it to (512,512). However, if you want to use it in training, you will need to change the parameters and retrain the model. I'm not sure what you mean that you used the vitpose pre-trained model.

Followmeczx · 2024-04-23T08:48:14Z

I have one more question.

I can't seem to find the code in PIFuDtaset for how the sample points, labels and calib are obtained.
What information is included in the calib parameter? Is it the rotation and translation of the extrinsic parameters and the focal length and principal point of the Intrinsic parameters?
I used the Human36M dataset and I found that his camera parameters are only intrinsic parameters of focal length and principal point. Can you give me some advice?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ViTencoder input #15

ViTencoder input #15

Followmeczx commented Apr 18, 2024 •

edited

Loading

River-Zhang commented Apr 18, 2024

Followmeczx commented Apr 18, 2024

River-Zhang commented Apr 18, 2024

Followmeczx commented Apr 23, 2024 •

edited

Loading

ViTencoder input #15

ViTencoder input #15

Comments

Followmeczx commented Apr 18, 2024 • edited Loading

River-Zhang commented Apr 18, 2024

Followmeczx commented Apr 18, 2024

River-Zhang commented Apr 18, 2024

Followmeczx commented Apr 23, 2024 • edited Loading

Followmeczx commented Apr 18, 2024 •

edited

Loading

Followmeczx commented Apr 23, 2024 •

edited

Loading