Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Label projection] scANVI task sees test cells during training #771

Closed
mxposed opened this issue Jan 7, 2023 · 6 comments
Closed

[Label projection] scANVI task sees test cells during training #771

mxposed opened this issue Jan 7, 2023 · 6 comments
Assignees
Labels
invalid This doesn't seem right

Comments

@mxposed
Copy link
Collaborator

mxposed commented Jan 7, 2023

In scArches+scANVI, dataset is split into train/test, and only the train part is used to train scVI/scANVI model: https://github.com/openproblems-bio/openproblems/blob/main/openproblems/tasks/label_projection/methods/scvi_tools.py#L102

In contrast, scANVI method trains on all cells: https://github.com/openproblems-bio/openproblems/blob/main/openproblems/tasks/label_projection/methods/scvi_tools.py#L68

If nobody has objections, I will make scANVI method train only on train part of the data, and then get the latent dimensions/predict for test.

cc @LuckyMD @adamgayoso

@mxposed mxposed added invalid This doesn't seem right method labels Jan 7, 2023
@mxposed mxposed self-assigned this Jan 7, 2023
@adamgayoso
Copy link
Contributor

scANVI actually doesn't see the test set labels. The way this is implemented it's a semi-supervised method

see:

scanvi_labels[~adata.obs["is_train"].to_numpy()] = "Unknown"

model = scvi.model.SCANVI.from_scvi_model(scvi_model, unlabeled_category="Unknown")

@mxposed
Copy link
Collaborator Author

mxposed commented Jan 7, 2023

I agree, it doesn't see the labels, but it sees the cells. Does my concern make sense?

@mxposed
Copy link
Collaborator Author

mxposed commented Jan 9, 2023

There are 2 modes of operation for label projection task, I guess:

  1. The model can be pre-trained without test data, and then applied to test data;
  2. The model cannot be pre-trained without test data, and needs to be trained with train+test data.

I assumed this task was covering only the 1st type, but it doesn't have to. However, I'd like to make clear distinction between the two types, and I think that current scANVI method implementation falls into the second category

@adamgayoso
Copy link
Contributor

I agree we should separate these things.

For example, de novo integration with scanorama on all data and then training a classifier on training embeddings would fall into (2) in this case.

@scottgigante-immunai
Copy link
Collaborator

As far as I'm concerned the models have to be able to use the expression data for the test cells to predict their labels. Whether you use these for semi-supervised or purely supervised is up to you -- I can't imagine a setting where you wouldn't be able to do it semi-supervised.

@LuckyMD
Copy link
Collaborator

LuckyMD commented Feb 1, 2023

Agree with @scottgigante-immunai, I actually think this is fine... the benchmark is an evaluation of how these tools would be used from the user perspective. When using scANVI alone for new data annotation, you need to train on both train+test to annotate a new dataset... the only alternative i can think of is using a forward pass on the most proximal batch, which is a non-realistic modus operandi for not seeing the test batch in training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

4 participants