-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gene mapping between user-provided scRNA and genes from SCimilarity during vae training #10
Comments
Hi! The num_genes parameter is used to control the shape of the first layer of encoder and the last layer of decoder, which you can see from the VAE_model.py line 52. This means that the first layer of the encoder would not use the weight from SCimilarity (so does the last layer of the decoder) and the size would be the same as the user's input gene set. The loss seems to drop normally, have you tried finishing the training process and seeing the reconstruction result? |
Thanks. Yes, I have tried both pre-trained model from SCimilarity and VAE without pre-trained model, my VAE loss is always around 0.04, even near the end of the training step 194000 loss 0.042236313223838806 Is this actually expected for the WOT data? Additionally, what classifier is considered to be "good enough"? For the WOT data, the training accuracy at the end of the training process is 0.164, which is far from being perfect. Is this expected as well? What train_acc should we expect in general? | grad_norm | 0.544 | Thank you so much! |
The training loss looks normal. This might indicate that the model has reached its convergence. If you still have concerns about that, you can try to use the model you trained to reconstruct the data and see if the reconstructed data matches the training data. As for the classifier, since the training samples are noised, the training accuracy might be very low, which is in line with expected. The accuracy of 0.164 should be normal, but I'm sorry that I can't give you specific expectations. I think it's good enough so long as it is obviously higher than random choice. |
Thank you! I have another general question about the VAE model. Technically, it is just an autoencoder, not VAE, since we are only optimizing reconstruction loss but not kl divergence. Is there any advantage of using autoencoder over VAE? Given that we don't heavily rely on the pre-trained weight from SCimilarity. Additionally, given that we are using the entire dataset for training, how can we avoid overfitting? Thank you! |
For the first question, since the diffusion model has no strict requirements for the distribution of the input training data, there is no need to constrain the feature in the hidden space using KL loss or other variational loss, and the original distribution of the encoder output is just fine. Though we didn't put an experiment to prove that, I do think the extra loss in the latent space would make reconstruction less effective. As for the overfitting problem, if the target is to augment the existing data, overfitting might not be a serious problem. But for the out of distribution data generation, there indeed needs a validation set to justify that (which we didn't do, cause we didn't observe overfitting). You can split a part of the dataset (totally apart from the rest) to validate the model. |
Thank you so much! |
When using the pre-trained weight from SCimilarity, how does
VAE_train.py
account for different genes between user provided adata and SCimilarity? What if a gene is present in user's data but not in SCimilarity? Or what if a gene is present in SCimilarity but not in user's data?There is indeed a
num_genes
parameter fromVAE_train.py
to control the dimension of VAE, such that it fits user provided scRNA, but I am not seeing it having any control over the gene order. When trying to reproduce vae training step from this command:CUDA_VISIBLE_DEVICES=0 python VAE_train.py --data_dir '/workspace/projects/001_scDiffusion/data/data_in/tabula_muris/all.h5ad' --num_genes 18996 --state_dict "/workspace/projects/001_scDiffusion/scripts/scDiffusion/annotation_model_v1" --save_dir '../checkpoint/AE/my_VAE' --max_steps 200000 --max_minutes 600
I got these loss reports, which are always around 0.04:
step 0 loss 0.21746787428855896
step 1000 loss 0.04769279062747955
step 2000 loss 0.048065099865198135
step 3000 loss 0.04667588323354721
step 4000 loss 0.045960813760757446
Could you please provide some clarification or possible solution to this? Thank you so much!
The text was updated successfully, but these errors were encountered: