Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ViTencoder input #15

Open
Followmeczx opened this issue Apr 18, 2024 · 4 comments
Open

ViTencoder input #15

Followmeczx opened this issue Apr 18, 2024 · 4 comments

Comments

@Followmeczx
Copy link

Followmeczx commented Apr 18, 2024

I found that the front/back normal maps are also used as input to the encoder and image to generate three-plane features. I want to know why? Will the result be improved?
image
Reading the code, I found that after obtaining the three-plane feature map, it was concatenated with the normal feature.
image
I only input the image through VitPose's pre-trained ViTencoder model to get the image features, and then also through the three decoders to get the three-plane features and splice with the normal features. Is that all right?

@River-Zhang
Copy link
Owner

Yes, it could improve the performance a little. As normal images of the input image are easy to acquire (using a pre-trained normal estimation model), we also input the front/back normal images. However, if you want to input a single image, the whole model may have to be retrained.

@Followmeczx
Copy link
Author

I have one more question. Since I used vitpose pre-trained model, the input image resolution is (256,192) and the final feature dim is 1024. This will produce an output of size 192x1024 when passed through the encoder. I would like to use your method later. But your ViT model gets 1024x256 output. Do I need to change the resolution of the image to (512,512) and change the feature dim to 256?
image
Or I can simply change image_size and dim to 1024. I don't know if it will have any effect on the back? It seems like (512,512) resolution is all you need for an ICON.

@River-Zhang
Copy link
Owner

If you just want to use our model for inference, you can just input the image and the script will automatically resize it to (512,512). However, if you want to use it in training, you will need to change the parameters and retrain the model. I'm not sure what you mean that you used the vitpose pre-trained model.

@Followmeczx
Copy link
Author

Followmeczx commented Apr 23, 2024

I have one more question.
image
I can't seem to find the code in PIFuDtaset for how the sample points, labels and calib are obtained.
What information is included in the calib parameter? Is it the rotation and translation of the extrinsic parameters and the focal length and principal point of the Intrinsic parameters?
I used the Human36M dataset and I found that his camera parameters are only intrinsic parameters of focal length and principal point. Can you give me some advice?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants