labelator

Simple framework for transfering labels to scRNAseq dataset from our favorite scRNAseq atlas.

NOTE: XGBoost variants are currently depricated.

overview.

We call this tool the "labelator". The purpose of a "labelator" is to easily classify cell types for out-of-sample "Test" or "Query" data.

Our general approach will be to "compress" the raw count data, and generate probability of each label category. We will do this in two ways:

naive mode. Or making no assumptions or attempts to account for confounding variables like "batch", "noise" (e.g. doublets, mt/rb contamination), or "library_size".
transfer mode. i.e. scarches or scvi-tools. Basically, we will need to fit these confounding variables for the out-of-sample data using the scarches surgery approach.

some details

dataloaders

One of the crucial decisions is how to load our scRNAseq into a pytorch model. We prefer the AnnData "annotated data" format. Several implementations of such a dataloader are available: from scvi-tools, scarches, and anndata itself. The scvi-tools is the most complex, but we have started here to enable leveraging the scvi-tools. To state our confirmation bias, we like scvi models so are starting here.

We will validate potential models and calibrate them with simple expectations using a typical "Train"/"Validate" and "Test"/"Probe" approach.

Data Definitions:

"Train": data samples on which the model being tested is trained. The torch lightning framework used by scvi-tools semi-automatically will "validate" to test out-of-sample prediction fidelity during training.
"Test": held-out samples to test the fidelity of the model.
"Query": data generated externally,which is probing the fidelity of the model to general scRNAseq data.

scVI data modeling specifics:

All models will be trained on the n=3000 most highly variable genes from Xylena's scRNAseq data.

Models:

We'll can labelate either a single end-to-end way or in two steps.

2 step: encode + categorize

In two steps:

encode: Embed the scRNAseq counts into dimensinally reduced representation:
- a latent sub-space of a variational Model e.g.
  - scVI (a Variational Auto Encoder, VAE model)
  - scANVI (a conditional VAE which also predicts "cell_type")
- linear embedding. i.e.
  - PCA (naive linear encoding)
categorize: predicting creating a probability of a each category
- ~~Linear classifier (e.g. multinomial Logistic Regression)~~ (not implimented)
- MLP (multi layer perceptron) non-linear classifier
- boosted trees (XGboost)

end-to-end

We will also try some end-to-end approaches for comparision. In these models a single model takes us from raw counts to category probabilities.

naive inference
- ~~boosted trees (e.g. xgboost)~~
- MLP classifier
transfer learning
- scVI/scANVI scarches "surgery"

training & validation

Models will be trained on the "train" set from xylena's "clean" data. Validation on a subset of the training data will ensure that overfitting is not a problem.

Extending the scarches models and training classes seems to be the most straightforward. The scvi-tools employs ligntening which may be good to leverage eventually, but is overbuilt for the current state.

scVI
- batch/sample/depth params vs. neutered
scANVI

Caveats

There are several gotchas to anticipate:

features. Currently we are locked into the 3k genes we are testing with. Handling subsets and supersets is TBC.
batch. In principle each "embedding" or decode part of the model should be able to measure a "batch-correction" parameter explicitly. In scVI this is explicitly learned. However in naive inference mode it should just be an inferred fudge factor.
noise. including or not including doublet, mito, or ribo metrics

future changes

The training and query should be split up. Different cli's for each modality. The query cli can be levraged for both the validation "Test" and any eadditional _"Query" probes.

training might be split into prep and train, with prep creating the base scvi vae models for the other model flavors to use. Note that the scanvi variants require both 'batch_eq' and non batch corrected variants.

wrinkles:

the scvi family of

SEA-AD

model_args = { "n_layers": 2, "n_latent": 20, "dispersion": "gene-label" }

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
--gen-plots/figs/scvi_emb		--gen-plots/figs/scvi_emb
artifacts		artifacts
clean0_cellassign		clean0_cellassign
clean0_noise_cellassign		clean0_noise_cellassign
clean1_cellassign		clean1_cellassign
clean1_noise_cellassign		clean1_noise_cellassign
full0_cellassign		full0_cellassign
full0_noise_cellassign		full0_noise_cellassign
full1_cellassign		full1_cellassign
full1_noise_cellassign		full1_noise_cellassign
lbl8r		lbl8r
models		models
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
all_genes.pkl		all_genes.pkl
batch-leiden.png		batch-leiden.png
cell_type.png		cell_type.png
data		data
grp_genes.pkl		grp_genes.pkl
grp_genes_pr.pkl		grp_genes_pr.pkl
labelator.sh		labelator.sh
labelator_query.sh		labelator_query.sh
labelator_tt.sh		labelator_tt.sh
lbl8r.yml		lbl8r.yml
lbl8r_nb.yml		lbl8r_nb.yml
noise.png		noise.png
pca-batch-leiden.png		pca-batch-leiden.png
pca-cell-type.png		pca-cell-type.png
pca-noise.png		pca-noise.png
pca-sample.png		pca-sample.png
query_labelator.py		query_labelator.py
sample.png		sample.png
scib2.yml		scib2.yml
scib3.yml		scib3.yml
scib4.yml		scib4.yml
scib_nb.yml		scib_nb.yml
scib_nb2.yml		scib_nb2.yml
train_labelator.py		train_labelator.py
xyl2_full_hvg.csv		xyl2_full_hvg.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

labelator

overview.

some details

dataloaders

Data Definitions:

scVI data modeling specifics:

Models:

2 step: encode + categorize

end-to-end

training & validation

scANVI

Caveats

future changes

SEA-AD

About

Releases

Packages

Languages

License

ergonyc/labelator

Folders and files

Latest commit

History

Repository files navigation

labelator

overview.

some details

dataloaders

Data Definitions:

scVI data modeling specifics:

Models:

2 step: encode + categorize

end-to-end

training & validation

scANVI

Caveats

future changes

SEA-AD

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages