Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train the PTv3-object model with dinov2 #8

Open
TSlus opened this issue Dec 7, 2024 · 3 comments
Open

train the PTv3-object model with dinov2 #8

TSlus opened this issue Dec 7, 2024 · 3 comments

Comments

@TSlus
Copy link

TSlus commented Dec 7, 2024

hi, thanks for your amazing work! I have tow question :

  1. in your paper you say, We first pre-train 3D backbone PTv3-object on 3D large-scale data Objaverse, distilling visual features from FeatUp-DINOv2. Do you provide this part of the training code in your code (include how to generate DINO feature)? because I want to train on custom dataset instead of using the pre-trained model directly.
  2. in the part of train light-weight MLPs to distill 2D masks to scale-conditioned grouping, SAM results supervision is required. But you did not use when training Objaverse data--knight, is right ?

thanks for your reply.

@yhyang-myron
Copy link
Member

Hi, thanks for your interest in our work.

  1. The code is like this:
self.Encoder_2d = torch.hub.load("mhamilton723/FeatUp", 'dinov2').cuda().eval()

point = self.backbone(pcd_dict)
point_feat = point.feat
frames_per_mesh = input_dict["frames_per_mesh"]
imgs = input_dict["imgs"]
B = imgs.shape[0]
V = imgs.shape[1]
link = input_dict["link"]
with torch.no_grad():
img_feat_list = []
for i in range(B):
    img_f = self.Encoder_2d(imgs[i])
    img_f = F.interpolate(img_f, size=(512, 512))
    img_feat_list.append(img_f)
    img_feat = torch.stack(img_feat_list, dim=0)

frame_pcd_feat = torch.zeros_like(point_feat)
frame_pcd_feat_flag = torch.zeros((point_feat.shape[0], 1)).cuda()
for v in range(V):
    f = img_feat[link[:, 0, v], v, :, link[:, 2, v], link[:, 1, v]]
    f *= link[:, 3, v].unsqueeze(dim=1).float()
    frame_pcd_feat += f
    frame_pcd_feat_flag += link[:, 3, v].unsqueeze(dim=1)

frame_pcd_feat_mask = (frame_pcd_feat_flag > 0).squeeze(-1)
frame_pcd_feat[frame_pcd_feat_mask] = frame_pcd_feat[frame_pcd_feat_mask] / frame_pcd_feat_flag[frame_pcd_feat_mask]

loss = F.mse_loss(point_feat[frame_pcd_feat_mask], frame_pcd_feat[frame_pcd_feat_mask])
  1. We use SAM results in the MLPs training stage.

@htuann2712
Copy link

Can you release full dataset class code for this distilling PTv3-object stage?
I am planning to implement your method, however I am a bit confused about the link above the code you provided.

Many thanks for your reply.

@yhyang-myron
Copy link
Member

SAMPart3D_pretrain_dataset.txt
Hi, this is the dataset we used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants