Simple framework for transfering labels to scRNAseq dataset from our favorite scRNAseq atlas.
NOTE: XGBoost variants are currently depricated.
We call this tool the "labelator". The purpose of a "labelator" is to easily classify cell types for out-of-sample "Test" or "Query" data.
Our general approach will be to "compress" the raw count data, and generate probability of each label category. We will do this in two ways:
-
naive mode. Or making no assumptions or attempts to account for confounding variables like "batch", "noise" (e.g. doublets, mt/rb contamination), or "library_size".
-
transfer mode. i.e.
scarches
orscvi-tools
. Basically, we will need to fit these confounding variables for the out-of-sample data using thescarches
surgery approach.
One of the crucial decisions is how to load our scRNAseq into a pytorch model. We prefer the AnnData
"annotated data" format. Several implementations of such a dataloader are available: from scvi-tools
, scarches
, and anndata
itself. The scvi-tools
is the most complex, but we have started here to enable leveraging the scvi-tools
. To state our confirmation bias, we like scvi
models so are starting here.
We will validate potential models and calibrate them with simple expectations using a typical "Train"/"Validate" and "Test"/"Probe" approach.
- "Train": data samples on which the model being tested is trained. The
torch lightning
framework used byscvi-tools
semi-automatically will "validate" to test out-of-sample prediction fidelity during training. - "Test": held-out samples to test the fidelity of the model.
- "Query": data generated externally,which is probing the fidelity of the model to general scRNAseq data.
All models will be trained on the n=3000 most highly variable genes from Xylena's scRNAseq data.
We'll can labelate either a single end-to-end way or in two steps.
In two steps:
-
encode: Embed the scRNAseq counts into dimensinally reduced representation:
- a latent sub-space of a variational Model e.g.
- scVI (a Variational Auto Encoder, VAE model)
- scANVI (a conditional VAE which also predicts "cell_type")
- linear embedding. i.e.
- PCA (naive linear encoding)
- a latent sub-space of a variational Model e.g.
-
categorize: predicting creating a probability of a each category
Linear classifier (e.g. multinomial Logistic Regression)(not implimented)- MLP (multi layer perceptron) non-linear classifier
- boosted trees (XGboost)
We will also try some end-to-end approaches for comparision. In these models a single model takes us from raw counts to category probabilities.
-
naive inference
boosted trees (e.g. xgboost)- MLP classifier
-
transfer learning
- scVI/scANVI scarches "surgery"
Models will be trained on the "train" set from xylena's "clean" data. Validation on a subset of the training data will ensure that overfitting is not a problem.
Extending the scarches
models and training classes seems to be the most straightforward. The scvi-tools
employs ligntening
which may be good to leverage eventually, but is overbuilt for the current state.
- scVI
- batch/sample/depth params vs. neutered
There are several gotchas to anticipate:
- features. Currently we are locked into the 3k genes we are testing with. Handling subsets and supersets is TBC.
- batch. In principle each "embedding" or decode part of the model should be able to measure a "batch-correction" parameter explicitly. In
scVI
this is explicitly learned. However in naive inference mode it should just be an inferred fudge factor. - noise. including or not including
doublet
,mito
, orribo
metrics
The training and query should be split up. Different cli's for each modality. The query cli can be levraged for both the validation "Test" and any eadditional _"Query" probes.
training might be split into prep and train, with prep creating the base scvi
vae models for the other model flavors to use. Note that the scanvi
variants require both 'batch_eq' and non batch corrected variants.
wrinkles:
- the
scvi
family of
model_args = { "n_layers": 2, "n_latent": 20, "dispersion": "gene-label" }