Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The MeshTransformer does not generate coherent results #18

Closed
Kurokabe opened this issue Dec 16, 2023 · 15 comments
Closed

The MeshTransformer does not generate coherent results #18

Kurokabe opened this issue Dec 16, 2023 · 15 comments

Comments

@Kurokabe
Copy link
Contributor

I have trained the MeshTransformer on 200 different meshes from the chair category on ShapeNet after decimation and filtering meshes with less than 400 vertices and faces. The MeshTransformer reached a loss very close to 0
image
But when I call the generate method from the MeshTransformer, I get very bad results.
From left to right, ground truth, autoencoder output, MeshTransformer generated mesh with a temperature of 0, with a temperature of 0.1, a temperature of 0.7 and a temperature of 1. This is done with meshgpt-pytorch version 0.3.3
image
Note: the MeshTransformer was not conditioned on text or anything, so the output is not supposed to exactly look like the sofa, but it barely look like a chair. We can guess the backrest and the legs but that's it.

Initially I thought that there might have been an error with the KV cache so here are the results with cache_kv=False:
image

And this one with meshgpt-pytorch version 0.2.11
image

When I trained on a single chair with a version before 0.2.11, the generate method was able to create a coherent chair (from left to right, ground truth, autoencoder output, meshtranformer.generate())

comparisons

Why even though the transformer loss was very low the generated results are very bad?

I have uploaded the autoencoder and meshtransformer checkpoint (on version 0.3.3) as well as 10 data samples there: https://file.io/nNsfTyHX4aFB

Also quick question, why rewrite the transformer from scratch, and not use the HuggingFace GPT2 transformer?

@fire

This comment was marked as outdated.

@Kurokabe
Copy link
Contributor Author

What is your autoencoder loss?

Around 0.35
image

The autoencoder is able to reconstruct accurately the input, so I don't understand why the MeshTransformer is not able to create a coherent chair. Do you also have similar results?

@fire
Copy link

fire commented Dec 16, 2023

I am currently on an old version of the codebase. So I don't know. Something probably broke

@MarcusLoppe
Copy link
Contributor

MarcusLoppe commented Dec 16, 2023

I have trained the MeshTransformer on 200 different meshes from the chair category on ShapeNet after decimation and filtering meshes with less than 400 vertices and faces. The MeshTransformer reached a loss very close to 0 !

Why even though the transformer loss was very low the generated results are very bad?

I have uploaded the autoencoder and meshtransformer checkpoint (on version 0.3.3) as well as 10 data samples there: https://file.io/nNsfTyHX4aFB

The issue is this: not enough data.

In the PolyGen and MeshGPT paper they stress that they didn't have enough training data and used only 28 000 mesh models.
They needed to augment those with lets say 20 augments, this means that they trained on 560 000 mesh models.
But since it seems like you are not using the texts you can try to feed the transformer a prompt of 10-30 connected faces of a model and see what happens (like in the paper), it should act as a autocomplete.

The loss of the transformer should be below 0.0001 for successful generations.

Here is some idea what amount of data you should use.
20 augment * 100 duplicates of the augments * 200 models = 400 000 per dataset.

I recommend that you create/take at look at my fork a trainer that uses epochs instead of steps since printing out 400k steps will slow down the training.

Then train on this for a day or two and use a large batch size (less then 64) to promote for generalizing.

In the paper they used 28 000 3d models, lets say they generate 10 augmentations per each model and then used 10 duplicates since the it's more effective to train a model with big batch size of 64 and when you are using a small number of models per dataset it will not train effectively and you will waste parallelism of GPUs.
This means that : 10 * 10 = 100 * 28 000 = 2 800 000

I want to stress this:
Over fitting a model with 1 model = super easy.
Training a model to be general enough for many different models = very hard.

Also quick question, why rewrite the transformer from scratch, and not use the HuggingFace GPT2 transformer?

It's GPT2 is quite old and there have been improvements so it's not very good anymore.

@MarcusLoppe
Copy link
Contributor

MarcusLoppe commented Dec 17, 2023

I have trained the MeshTransformer on 200 different meshes from the chair category on ShapeNet after decimation and filtering meshes with less than 400 vertices and faces. The MeshTransformer reached a loss very close to 0 !

I did some testing using 10 3d mesh chairs, I apply augmentation so each chair got 3 variations.
Then i duplicated each variation 500 times so the total dataset size is 3000.
I encoded the meshes using text as well but just using the same word 'chair', but this proves that the text generation works.

After 22minutes training the encoder (0.24 loss) and then 2.5hrs training the transformer (0.0048) I got the result below.
To generate the complete models I think about 0.0001 loss should be good.
I trained the transformer on different learning rates but in total there was 30 epochs e.g 30x 3000= 90 000 steps.

The training is very slow at the end, might need to up the transformers dim size to 1024.

Epoch 5/10: 100%|██████████| 1500/1500 [06:24<00:00,  3.91it/s, loss=0.00494]
Epoch 5 average loss: 0.004852373525189857
Epoch 6/10: 100%|██████████| 1500/1500 [06:25<00:00,  3.89it/s, loss=0.00417]
Epoch 6 average loss: 0.004819516897356759
Epoch 7/10: 100%|██████████| 1500/1500 [06:24<00:00,  3.90it/s, loss=0.00501]
Epoch 7 average loss: 0.004833068791311234
Epoch 8/10: 100%|██████████| 1500/1500 [06:22<00:00,  3.93it/s, loss=0.005]  
Epoch 8 average loss: 0.004832435622811318

I provided the text "chair" and looped the generation to use different temperature values from 0 to 1.0 with 0.1 as stepping value.
bild
bild

@fire
Copy link

fire commented Dec 17, 2023

It appears to be using the text to choose from the learned mesh instances of the chair class?

@lucidrains
Copy link
Owner

lucidrains commented Dec 17, 2023

@MarcusLoppe @fire next get some tables and chairs and see if it can learn to generate two separate classes! so this isn't documented, but in order to do better text binding (once you scale up with more text variety), you can improve text binding with classifier free guidance, a technique employed in a lot of denoising diffusion models (SD included). .generate(cond_scale = 3)

@MarcusLoppe
Copy link
Contributor

@MarcusLoppe @fire next get some tables and chairs and see if it can learn to generate two separate classes! so this isn't documented, but in order to do better text binding (once you scale up with more text variety), you can improve text binding with classifier free guidance, a technique employed in a lot of denoising diffusion models (SD included). .generate(cond_scale = 3)

Success! :)
I let it run during the night and got this result, I think i might modify the encoder parameters, I used 768 dim for the transformer so that got 236M parameters. I'll try decreasing and increasing the parameter count for the encoder.

Autoencoder training 6h 210 epochs, 756 000 steps : 0.9 loss
Transformer training 3h 20 epochs : 0.00496 loss

Variations: 3, 100 examples each = 3600 examples/steps /epoch


num_examples: 100
filtered/chair
103b75dfd146976563ed57e35c972b4b vertices 285 faces 171
112cee32461c31d1d84b8ba651dfb8ac vertices 360 faces 272
11347c7e8bc5881775907ca70d2973a4 vertices 208 faces 160

filtered/sofa
10f2a1cbaee4101896e12b33feac8da2 vertices 152 faces 100
126ed5982cdd56243b02598625ec1bf7 vertices 270 faces 212
12aec536f7d558f9342398ca9dc32672 vertices 244 faces 184

filtered/table
10bb44a54a12a74e4719088c8e42c6ab vertices 240 faces 152
10e6398274554867fdf2e93846e20960 vertices 216 faces 152
119a538325398df617b2b37d6988a89b vertices 192 faces 120

filtered/vessel
1b1cf4f2cc24a2a2a5895e3729304f68 vertices 548 faces 228
29c5c9924a3e1e2367585a906cb87a62 vertices 130 faces 156
4ac3edea6f7b3521cd71f832bc14be6f vertices 178 faces 166

Chosen models count for each category:
chair: 3
sofa: 3
table: 3
vessel: 3
Total number of chosen models: 12
Got 3600 data

bild

When changing cond_scale it gave me this error:


File [c:\Users\Username\AppData\Local\Programs\Python\Python311\Lib\site-packages\meshgpt_pytorch\meshgpt_pytorch.py:1048](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1048), in MeshTransformer.generate(self, prompt, batch_size, filter_logits_fn, filter_kwargs, temperature, return_codes, texts, text_embeds, cond_scale, cache_kv)
   [1043](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1043) for i in tqdm(range(curr_length, self.max_seq_len)):
   [1044](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1044)     # v1([q1] [q2] [q1] [q2] [q1] [q2]) v2([eos| q1] [q2] [q1] [q2] [q1] [q2]) -> 0 1 2 3 4 5 6 7 8 9 10 11 12 -> v1(F F F F F F) v2(T F F F F F) v3(T F F F F F)
   [1046](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1046)     can_eos = i != 0 and divisible_by(i, self.num_quantizers * 3)  # only allow for eos to be decoded at the end of each face, defined as 3 vertices with D residual VQ codes
-> [1048](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1048)     logits, new_cache = self.forward_on_codes(
   [1049](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1049)         codes,
   [1050](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1050)         cache = cache,
   [1051](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1051)         text_embeds = text_embeds,
   [1052](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1052)         return_loss = False,
   [1053](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1053)         return_cache = True,
   [1054](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1054)         append_eos = False,
   [1055](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1055)         cond_scale = cond_scale
   [1056](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1056)     )
   [1058](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1058)     if can_cache:
   [1059](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1059)         cache = new_cache

File [c:\Users\Username\AppData\Local\Programs\Python\Python311\Lib\site-packages\classifier_free_guidance_pytorch\classifier_free_guidance_pytorch.py:146](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/classifier_free_guidance_pytorch/classifier_free_guidance_pytorch.py:146), in classifier_free_guidance.<locals>.inner(self, cond_scale, rescale_phi, *args, **kwargs)
    [143](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/classifier_free_guidance_pytorch/classifier_free_guidance_pytorch.py:143)     return logits
    [145](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/classifier_free_guidance_pytorch/classifier_free_guidance_pytorch.py:145) null_logits = fn_maybe_with_text(self, *args, **kwargs_with_cond_dropout)
--> [146](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/classifier_free_guidance_pytorch/classifier_free_guidance_pytorch.py:146) scaled_logits = null_logits + (logits - null_logits) * cond_scale
    [148](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/classifier_free_guidance_pytorch/classifier_free_guidance_pytorch.py:148) if rescale_phi <= 0:
    [149](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/classifier_free_guidance_pytorch/classifier_free_guidance_pytorch.py:149)     return scaled_logits

TypeError: unsupported operand type(s) for -: 'tuple' and 'tuple'

@lucidrains
Copy link
Owner

@MarcusLoppe i'll fix the classifier free guidance mid next week, there's another issue with it (caching is very tricky)

@lucidrains
Copy link
Owner

@MarcusLoppe fixed your current issue for now, but inference will still be super slow. will need a couple days to support kv cache correctly for conditional scaling

congratulations on training these results! enjoy the rest of your sunday

@MarcusLoppe
Copy link
Contributor

@MarcusLoppe @fire next get some tables and chairs and see if it can learn to generate two separate classes! so this isn't documented, but in order to do better text binding (once you scale up with more text variety), you can improve text binding with classifier free guidance, a technique employed in a lot of denoising diffusion models (SD included). .generate(cond_scale = 3)

Not much better with cond_scale = 3 :/
bild

@whaohan
Copy link

whaohan commented Dec 18, 2023

Hi,

I also trained the model on around 4k decimated chair meshes with less than 800 faces as suggested by the paper. And I got very similar results as @Kurokabe.

image

Here is the training loss of my autoencoder and GPT:

image

image

@lucidrains
Copy link
Owner

@whaohan on first glance, your autoencoder loss is way too high

@lucidrains
Copy link
Owner

@MarcusLoppe @fire next get some tables and chairs and see if it can learn to generate two separate classes! so this isn't documented, but in order to do better text binding (once you scale up with more text variety), you can improve text binding with classifier free guidance, a technique employed in a lot of denoising diffusion models (SD included). .generate(cond_scale = 3)

Not much better with cond_scale = 3 :/ bild

this hyperparameter doesn't actually improve results, just better alignment to the text description (if it is not following it)

@fire
Copy link

fire commented Dec 18, 2023

The transformer needs to be near 0.01 or 0.001 and the autoencoder can be from 0.25 or 0.35 or lower.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants