Gene mapping between user-provided scRNA and genes from SCimilarity during vae training #10

humengying0907 · 2024-11-08T04:33:32Z

When using the pre-trained weight from SCimilarity, how does VAE_train.py account for different genes between user provided adata and SCimilarity? What if a gene is present in user's data but not in SCimilarity? Or what if a gene is present in SCimilarity but not in user's data?

There is indeed a num_genes parameter from VAE_train.py to control the dimension of VAE, such that it fits user provided scRNA, but I am not seeing it having any control over the gene order. When trying to reproduce vae training step from this command:

CUDA_VISIBLE_DEVICES=0 python VAE_train.py --data_dir '/workspace/projects/001_scDiffusion/data/data_in/tabula_muris/all.h5ad' --num_genes 18996 --state_dict "/workspace/projects/001_scDiffusion/scripts/scDiffusion/annotation_model_v1" --save_dir '../checkpoint/AE/my_VAE' --max_steps 200000 --max_minutes 600

I got these loss reports, which are always around 0.04:

step 0 loss 0.21746787428855896
step 1000 loss 0.04769279062747955
step 2000 loss 0.048065099865198135
step 3000 loss 0.04667588323354721
step 4000 loss 0.045960813760757446

Could you please provide some clarification or possible solution to this? Thank you so much!

The text was updated successfully, but these errors were encountered:

EperLuo · 2024-11-08T06:11:02Z

Hi! The num_genes parameter is used to control the shape of the first layer of encoder and the last layer of decoder, which you can see from the VAE_model.py line 52. This means that the first layer of the encoder would not use the weight from SCimilarity (so does the last layer of the decoder) and the size would be the same as the user's input gene set.

The loss seems to drop normally, have you tried finishing the training process and seeing the reconstruction result?

humengying0907 · 2024-11-08T19:38:08Z

Hi! The num_genes parameter is used to control the shape of the first layer of encoder and the last layer of decoder, which you can see from the VAE_model.py line 52. This means that the first layer of the encoder would not use the weight from SCimilarity (so does the last layer of the decoder) and the size would be the same as the user's input gene set.

The loss seems to drop normally, have you tried finishing the training process and seeing the reconstruction result?

Thanks. Yes, I have tried both pre-trained model from SCimilarity and VAE without pre-trained model, my VAE loss is always around 0.04, even near the end of the training

step 194000 loss 0.042236313223838806
step 195000 loss 0.03814302757382393
step 196000 loss 0.043102916330099106
step 197000 loss 0.04345700144767761
step 198000 loss 0.042990997433662415
step 199000 loss 0.039488837122917175

Is this actually expected for the WOT data?

Additionally, what classifier is considered to be "good enough"? For the WOT data, the training accuracy at the end of the training process is 0.164, which is far from being perfect. Is this expected as well? What train_acc should we expect in general?

| grad_norm | 0.544 |
| param_norm | 101 |
| samples | 1.28e+07 |
| step | 9.99e+04 |
| train_acc@1 | 0.164 |
| train_acc@1_q0 | 0 |
| train_acc@1_q1 | 0 |
| train_loss | 2.42 |
| train_loss_q0 | 3.6 |
| train_loss_q1 | 2.59 |

Thank you so much!

EperLuo · 2024-11-09T09:25:15Z

The training loss looks normal. This might indicate that the model has reached its convergence. If you still have concerns about that, you can try to use the model you trained to reconstruct the data and see if the reconstructed data matches the training data.

As for the classifier, since the training samples are noised, the training accuracy might be very low, which is in line with expected. The accuracy of 0.164 should be normal, but I'm sorry that I can't give you specific expectations. I think it's good enough so long as it is obviously higher than random choice.

humengying0907 · 2024-11-09T18:23:02Z

The training loss looks normal. This might indicate that the model has reached its convergence. If you still have concerns about that, you can try to use the model you trained to reconstruct the data and see if the reconstructed data matches the training data.

As for the classifier, since the training samples are noised, the training accuracy might be very low, which is in line with expected. The accuracy of 0.164 should be normal, but I'm sorry that I can't give you specific expectations. I think it's good enough so long as it is obviously higher than random choice.

Thank you! I have another general question about the VAE model. Technically, it is just an autoencoder, not VAE, since we are only optimizing reconstruction loss but not kl divergence. Is there any advantage of using autoencoder over VAE? Given that we don't heavily rely on the pre-trained weight from SCimilarity.

Additionally, given that we are using the entire dataset for training, how can we avoid overfitting?

Thank you!

EperLuo · 2024-11-13T08:38:55Z

For the first question, since the diffusion model has no strict requirements for the distribution of the input training data, there is no need to constrain the feature in the hidden space using KL loss or other variational loss, and the original distribution of the encoder output is just fine. Though we didn't put an experiment to prove that, I do think the extra loss in the latent space would make reconstruction less effective.

As for the overfitting problem, if the target is to augment the existing data, overfitting might not be a serious problem. But for the out of distribution data generation, there indeed needs a validation set to justify that (which we didn't do, cause we didn't observe overfitting). You can split a part of the dataset (totally apart from the rest) to validate the model.

humengying0907 · 2024-11-14T18:47:52Z

For the first question, since the diffusion model has no strict requirements for the distribution of the input training data, there is no need to constrain the feature in the hidden space using KL loss or other variational loss, and the original distribution of the encoder output is just fine. Though we didn't put an experiment to prove that, I do think the extra loss in the latent space would make reconstruction less effective.

As for the overfitting problem, if the target is to augment the existing data, overfitting might not be a serious problem. But for the out of distribution data generation, there indeed needs a validation set to justify that (which we didn't do, cause we didn't observe overfitting). You can split a part of the dataset (totally apart from the rest) to validate the model.

Thank you so much!

humengying0907 changed the title ~~Gene mapping between user-provided scRNA and genes from SCimilarity~~ Gene mapping between user-provided scRNA and genes from SCimilarity during vae training Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gene mapping between user-provided scRNA and genes from SCimilarity during vae training #10

Gene mapping between user-provided scRNA and genes from SCimilarity during vae training #10

humengying0907 commented Nov 8, 2024

EperLuo commented Nov 8, 2024

humengying0907 commented Nov 8, 2024

EperLuo commented Nov 9, 2024

humengying0907 commented Nov 9, 2024

EperLuo commented Nov 13, 2024

humengying0907 commented Nov 14, 2024

Gene mapping between user-provided scRNA and genes from SCimilarity during vae training #10

Gene mapping between user-provided scRNA and genes from SCimilarity during vae training #10

Comments

humengying0907 commented Nov 8, 2024

EperLuo commented Nov 8, 2024

humengying0907 commented Nov 8, 2024

EperLuo commented Nov 9, 2024

humengying0907 commented Nov 9, 2024

EperLuo commented Nov 13, 2024

humengying0907 commented Nov 14, 2024