Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gene mapping between user-provided scRNA and genes from SCimilarity during vae training #10

Open
humengying0907 opened this issue Nov 8, 2024 · 6 comments

Comments

@humengying0907
Copy link

When using the pre-trained weight from SCimilarity, how does VAE_train.py account for different genes between user provided adata and SCimilarity? What if a gene is present in user's data but not in SCimilarity? Or what if a gene is present in SCimilarity but not in user's data?

There is indeed a num_genes parameter from VAE_train.py to control the dimension of VAE, such that it fits user provided scRNA, but I am not seeing it having any control over the gene order. When trying to reproduce vae training step from this command:

CUDA_VISIBLE_DEVICES=0 python VAE_train.py --data_dir '/workspace/projects/001_scDiffusion/data/data_in/tabula_muris/all.h5ad' --num_genes 18996 --state_dict "/workspace/projects/001_scDiffusion/scripts/scDiffusion/annotation_model_v1" --save_dir '../checkpoint/AE/my_VAE' --max_steps 200000 --max_minutes 600

I got these loss reports, which are always around 0.04:

step 0 loss 0.21746787428855896
step 1000 loss 0.04769279062747955
step 2000 loss 0.048065099865198135
step 3000 loss 0.04667588323354721
step 4000 loss 0.045960813760757446

Could you please provide some clarification or possible solution to this? Thank you so much!

@humengying0907 humengying0907 changed the title Gene mapping between user-provided scRNA and genes from SCimilarity Gene mapping between user-provided scRNA and genes from SCimilarity during vae training Nov 8, 2024
@EperLuo
Copy link
Owner

EperLuo commented Nov 8, 2024

Hi! The num_genes parameter is used to control the shape of the first layer of encoder and the last layer of decoder, which you can see from the VAE_model.py line 52. This means that the first layer of the encoder would not use the weight from SCimilarity (so does the last layer of the decoder) and the size would be the same as the user's input gene set.

The loss seems to drop normally, have you tried finishing the training process and seeing the reconstruction result?

@humengying0907
Copy link
Author

Hi! The num_genes parameter is used to control the shape of the first layer of encoder and the last layer of decoder, which you can see from the VAE_model.py line 52. This means that the first layer of the encoder would not use the weight from SCimilarity (so does the last layer of the decoder) and the size would be the same as the user's input gene set.

The loss seems to drop normally, have you tried finishing the training process and seeing the reconstruction result?

Thanks. Yes, I have tried both pre-trained model from SCimilarity and VAE without pre-trained model, my VAE loss is always around 0.04, even near the end of the training

step 194000 loss 0.042236313223838806
step 195000 loss 0.03814302757382393
step 196000 loss 0.043102916330099106
step 197000 loss 0.04345700144767761
step 198000 loss 0.042990997433662415
step 199000 loss 0.039488837122917175

Is this actually expected for the WOT data?

Additionally, what classifier is considered to be "good enough"? For the WOT data, the training accuracy at the end of the training process is 0.164, which is far from being perfect. Is this expected as well? What train_acc should we expect in general?

| grad_norm | 0.544 |
| param_norm | 101 |
| samples | 1.28e+07 |
| step | 9.99e+04 |
| train_acc@1 | 0.164 |
| train_acc@1_q0 | 0 |
| train_acc@1_q1 | 0 |
| train_loss | 2.42 |
| train_loss_q0 | 3.6 |
| train_loss_q1 | 2.59 |

Thank you so much!

@EperLuo
Copy link
Owner

EperLuo commented Nov 9, 2024

The training loss looks normal. This might indicate that the model has reached its convergence. If you still have concerns about that, you can try to use the model you trained to reconstruct the data and see if the reconstructed data matches the training data.

As for the classifier, since the training samples are noised, the training accuracy might be very low, which is in line with expected. The accuracy of 0.164 should be normal, but I'm sorry that I can't give you specific expectations. I think it's good enough so long as it is obviously higher than random choice.

@humengying0907
Copy link
Author

The training loss looks normal. This might indicate that the model has reached its convergence. If you still have concerns about that, you can try to use the model you trained to reconstruct the data and see if the reconstructed data matches the training data.

As for the classifier, since the training samples are noised, the training accuracy might be very low, which is in line with expected. The accuracy of 0.164 should be normal, but I'm sorry that I can't give you specific expectations. I think it's good enough so long as it is obviously higher than random choice.

Thank you! I have another general question about the VAE model. Technically, it is just an autoencoder, not VAE, since we are only optimizing reconstruction loss but not kl divergence. Is there any advantage of using autoencoder over VAE? Given that we don't heavily rely on the pre-trained weight from SCimilarity.

Additionally, given that we are using the entire dataset for training, how can we avoid overfitting?

Thank you!

@EperLuo
Copy link
Owner

EperLuo commented Nov 13, 2024

For the first question, since the diffusion model has no strict requirements for the distribution of the input training data, there is no need to constrain the feature in the hidden space using KL loss or other variational loss, and the original distribution of the encoder output is just fine. Though we didn't put an experiment to prove that, I do think the extra loss in the latent space would make reconstruction less effective.

As for the overfitting problem, if the target is to augment the existing data, overfitting might not be a serious problem. But for the out of distribution data generation, there indeed needs a validation set to justify that (which we didn't do, cause we didn't observe overfitting). You can split a part of the dataset (totally apart from the rest) to validate the model.

@humengying0907
Copy link
Author

For the first question, since the diffusion model has no strict requirements for the distribution of the input training data, there is no need to constrain the feature in the hidden space using KL loss or other variational loss, and the original distribution of the encoder output is just fine. Though we didn't put an experiment to prove that, I do think the extra loss in the latent space would make reconstruction less effective.

As for the overfitting problem, if the target is to augment the existing data, overfitting might not be a serious problem. But for the out of distribution data generation, there indeed needs a validation set to justify that (which we didn't do, cause we didn't observe overfitting). You can split a part of the dataset (totally apart from the rest) to validate the model.

Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants