[Label projection] scANVI task sees test cells during training #771

mxposed · 2023-01-07T00:10:29Z

In scArches+scANVI, dataset is split into train/test, and only the train part is used to train scVI/scANVI model: https://github.com/openproblems-bio/openproblems/blob/main/openproblems/tasks/label_projection/methods/scvi_tools.py#L102

In contrast, scANVI method trains on all cells: https://github.com/openproblems-bio/openproblems/blob/main/openproblems/tasks/label_projection/methods/scvi_tools.py#L68

If nobody has objections, I will make scANVI method train only on train part of the data, and then get the latent dimensions/predict for test.

cc @LuckyMD @adamgayoso

adamgayoso · 2023-01-07T00:30:54Z

scANVI actually doesn't see the test set labels. The way this is implemented it's a semi-supervised method

see:

openproblems/openproblems/tasks/label_projection/methods/scvi_tools.py

Line 66 in 3d8964a

scanvi_labels[~adata.obs["is_train"].to_numpy()] = "Unknown"

openproblems/openproblems/tasks/label_projection/methods/scvi_tools.py

Line 81 in 3d8964a

    
           model = scvi.model.SCANVI.from_scvi_model(scvi_model, unlabeled_category="Unknown")

mxposed · 2023-01-07T00:55:27Z

I agree, it doesn't see the labels, but it sees the cells. Does my concern make sense?

mxposed · 2023-01-09T21:24:14Z

There are 2 modes of operation for label projection task, I guess:

The model can be pre-trained without test data, and then applied to test data;
The model cannot be pre-trained without test data, and needs to be trained with train+test data.

I assumed this task was covering only the 1st type, but it doesn't have to. However, I'd like to make clear distinction between the two types, and I think that current scANVI method implementation falls into the second category

adamgayoso · 2023-01-09T21:34:26Z

I agree we should separate these things.

For example, de novo integration with scanorama on all data and then training a classifier on training embeddings would fall into (2) in this case.

scottgigante-immunai · 2023-02-01T20:09:47Z

As far as I'm concerned the models have to be able to use the expression data for the test cells to predict their labels. Whether you use these for semi-supervised or purely supervised is up to you -- I can't imagine a setting where you wouldn't be able to do it semi-supervised.

LuckyMD · 2023-02-01T22:44:46Z

Agree with @scottgigante-immunai, I actually think this is fine... the benchmark is an evaluation of how these tools would be used from the user perspective. When using scANVI alone for new data annotation, you need to train on both train+test to annotate a new dataset... the only alternative i can think of is using a forward pass on the most proximal batch, which is a non-realistic modus operandi for not seeing the test batch in training.

mxposed added invalid This doesn't seem right method labels Jan 7, 2023

mxposed self-assigned this Jan 7, 2023

scottgigante-immunai closed this as completed Mar 24, 2023

kthorner mentioned this issue Feb 26, 2024

Widespread inflated metrics for label projection due to leakage openproblems-bio/openproblems-v2#386

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Label projection] scANVI task sees test cells during training #771

[Label projection] scANVI task sees test cells during training #771

mxposed commented Jan 7, 2023

adamgayoso commented Jan 7, 2023

mxposed commented Jan 7, 2023

mxposed commented Jan 9, 2023

adamgayoso commented Jan 9, 2023

scottgigante-immunai commented Feb 1, 2023

LuckyMD commented Feb 1, 2023

[Label projection] scANVI task sees test cells during training #771

[Label projection] scANVI task sees test cells during training #771

Comments

mxposed commented Jan 7, 2023

adamgayoso commented Jan 7, 2023

mxposed commented Jan 7, 2023

mxposed commented Jan 9, 2023

adamgayoso commented Jan 9, 2023

scottgigante-immunai commented Feb 1, 2023

LuckyMD commented Feb 1, 2023