Skip to content
This repository has been archived by the owner on Sep 19, 2024. It is now read-only.

Commit

Permalink
work in progress
Browse files Browse the repository at this point in the history
  • Loading branch information
jkobject committed Sep 9, 2024
1 parent ed4e674 commit a0294a1
Show file tree
Hide file tree
Showing 6 changed files with 1,842 additions and 24 deletions.
136 changes: 113 additions & 23 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@

# scPRINT: Large Cell Model for scRNAseq data

[![codecov](https://codecov.io/gh/jkobject/scPRINT/branch/main/graph/badge.svg?token=GRnnData_token_here)](https://codecov.io/gh/jkobject/scPRINT)
[![CI](https://github.com/jkobject/scPRINT/actions/workflows/main.yml/badge.svg)](https://github.com/jkobject/scPRINT/actions/workflows/main.yml)
[![PyPI version](https://badge.fury.io/py/scprint.svg)](https://badge.fury.io/py/scprint)
[![Documentation Status](https://readthedocs.org/projects/scprint/badge/?version=latest)](https://scprint.readthedocs.io/en/latest/?badge=latest)
[![Downloads](https://pepy.tech/badge/scprint)](https://pepy.tech/project/scprint)
[![Downloads](https://pepy.tech/badge/scprint/month)](https://pepy.tech/project/scprint)
[![Downloads](https://pepy.tech/badge/scprint/week)](https://pepy.tech/project/scprint)
[![GitHub issues](https://img.shields.io/github/issues/jkobject/scPRINT)](https://img.shields.io/github/issues/jkobject/scPRINT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![DOI](https://zenodo.org/badge/391909874.svg)]()
[![DOI](https://zenodo.org/badge/391909874.svg)](https://doi.org/10.1101/2024.07.29.605556)

![logo](logo.png)

Expand All @@ -23,39 +24,122 @@ scPRINT can be used to perform the following analyses:
- __label prediction__: predict the cell type, disease, sequencer, sex, and ethnicity of your cells
- __gene network inference__: generate a gene network from any cell or cell cluster in your scRNAseq dataset

[Read the paper!](https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1) if you would like to know more about scPRINT.
[Read the manuscript!](https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1) if you would like to know more about scPRINT. Have a look at some of my [X-plainers](https://twitter.com/jkobject).

![figure1](figure1.png)

## Table of Contents

- [scPRINT: Large Cell Model for scRNAseq data](#scprint-large-cell-model-for-scrnaseq-data)
- [Table of Contents](#table-of-contents)
- [Install `scPRINT`](#install-scprint)
- [lamin.ai](#laminai)
- [install](#install)
- [pytorch and GPUs](#pytorch-and-gpus)
- [dev install](#dev-install)
- [Usage](#usage)
- [scPRINT's basic commands](#scprints-basic-commands)
- [Notes on GPU/CPU usage with triton](#notes-on-gpucpu-usage-with-triton)
- [Simple tests:](#simple-tests)
- [FAQ](#faq)
- [I want to generate gene networks from scRNAseq data:](#i-want-to-generate-gene-networks-from-scrnaseq-data)
- [I want to generate cell embeddings and cell label predictions from scRNAseq data:](#i-want-to-generate-cell-embeddings-and-cell-label-predictions-from-scrnaseq-data)
- [I want to denoise my scRNAseq dataset:](#i-want-to-denoise-my-scrnaseq-dataset)
- [I want to generate an atlas-level embedding](#i-want-to-generate-an-atlas-level-embedding)
- [I need to generate gene tokens using pLLMs](#i-need-to-generate-gene-tokens-using-pllms)
- [I want to pre-train scPRINT from scratch on my own data](#i-want-to-pre-train-scprint-from-scratch-on-my-own-data)
- [how can I find if scPRINT was trained on my data?](#how-can-i-find-if-scprint-was-trained-on-my-data)
- [can I use scPRINT on other organisms rather than human?](#can-i-use-scprint-on-other-organisms-rather-than-human)
- [how long does scPRINT takes? what kind of resources do I need? (or in alternative: can i run scPRINT locally?)](#how-long-does-scprint-takes-what-kind-of-resources-do-i-need-or-in-alternative-can-i-run-scprint-locally)
- [I have different scRNASeq batches. Should I integrate my data before running scPRINT?](#i-have-different-scrnaseq-batches-should-i-integrate-my-data-before-running-scprint)
- [where to find the gene embeddings?](#where-to-find-the-gene-embeddings)
- [Documentation](#documentation)
- [Model Weights](#model-weights)
- [Development](#development)
- [Work in progress (PR welcomed):](#work-in-progress-pr-welcomed)


## Install `scPRINT`

For the moment scPRINT has been tested on MacOS and Linux (Ubuntu 20.04) with Python 3.10.
For the moment scPRINT has been tested on MacOS and Linux (Ubuntu 20.04) with Python 3.10. Its instalation takes on average 10 minutes.

If you want to be using flashattention2, know that it only supports triton 2.0 MLIR's version and torch==2.0.0 for now.

```python
conda create -n "[whatever]" python==3.10
### lamin.ai

To use scPRINT, I need you to use lamin.ai. This is needed to load biological informations like genes, cell types, organisms etc...

To do so, you will need to connect with google or github to [lamin.ai](https://lamin.ai/login), then be sure to connect before running anything (or before starting a notebook): `lamin login <email> --key <API-key>`. Follow the instructions on [their website](https://docs.lamin.ai/guide).

### install

To start you will need to do:

```bash
conda create -n <env-name> python==3.10 #scprint might work with python >3.10, but it is not tested
#one of
pip install scprint # OR
pip install scprint[dev] # for the dev dependencies (building etc..) AND/OR [dev,flash]
pip install scprint[flash] && pip install -e "git+https:/
/github.com/triton-lang/triton.git@legacy-backend
#egg=triton&subdirectory=python" # to use flashattention2, you will need to install triton 2.0.0.dev20221202 specifically, working on removing this dependency # only if you have a compatible gpu (e.g. not available for apple GPUs for now, see https://github.com/triton-lang/triton?tab=readme-ov-file#compatibility)
pip install scprint[dev] # for the dev dependencies (building etc..) OR
pip install scprint[flash] # to use flashattention2 with triton: only if you have a compatible gpu (e.g. not available for apple GPUs for now, see https://github.com/triton-lang/triton?tab=readme-ov-file#compatibility)
#OR pip install scPRINT[dev,flash]

lamin login <email> --key <API-key>
lamin init --storage <folder-name-where-lamin-data-will-be-stored> --schema bionty
```

if you start with lamin and had to do a `lamin init`, you will also need to populate your ontologies. This is because scPRINT is using ontologies to define its cell types, diseases, sexes, ethnicities, etc.

you can do it manually or with our function:

```python
from scdataloader.utils import populate_my_ontology

populate_my_ontology() #to populate everything (recommended) (can take 2-10mns)

populate_my_ontology( #the minimum for scprint to run some inferences (denoising, grn inference)
organisms: List[str] = ["NCBITaxon:10090", "NCBITaxon:9606"],
sex: List[str] = ["PATO:0000384", "PATO:0000383"],
celltypes = None,
ethnicities = None,
assays = None,
tissues = None,
diseases = None,
dev_stages = None,
)
```

We make use of some additional packages we developed alongside scPRint.

Please refer to their documentation for more information:

- [scDataLoader](https://github.com/jkobject/scDataLoader): a dataloader for training large cell models.
- [GRnnData](https://github.com/cantinilab/GRnnData): a package to work with gene networks from single cell data.
- [benGRN](https://github.com/jkobject/benGRN): a package to benchmark gene network inference methods from single cell data.

### lamin.ai
### pytorch and GPUs

scPRINT can run on machines without GPUs, but it will be slow. It is highly recommended to use a GPU for inference.

Once you have a GPU, and installed the required drivers, you might need to install a specific version of pytorch that is compatible with your drivers (e.g. nvidia 550 drivers will lead to a nvidia toolkit 11.7 or 11.8 which might mean you need to re-install a different flavor of pytorch for things to work. e.g. using the command:
`pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu118` on my case on linux
).

I was able to test it with nvidia 11.7, 11.8, 12.2.

⚠️ if you want to use the scDataloader's multi-dataset mode or if you want to preprocess datasets and other functions of the model, you will need to use lamin.ai.
### dev install

In that case, connect with google or github to [lamin.ai](https://lamin.ai/login), then be sure to connect before running anything (or before starting a notebook): `lamin login <email> --key <API-key>`. Follow the instructions on [their website](https://docs.lamin.ai/guide).
If you want to use the latest version of scPRINT and work on the code yourself use `git clone` and `pip -e` instead of `pip install`.

```bash
git clone https://github.com/jkobject/scPRINT
git clone https://github.com/jkobject/scDataLoader
git clone https://github.com/cantinilab/GRnnData
git clone https://github.com/jkobject/benGRN
pip install -e scPRINT[dev]
pip install -e scDataLoader[dev]
pip install -e GRnnData[dev]
pip install -e benGRN[dev]
```

## Usage

Expand Down Expand Up @@ -88,7 +172,7 @@ $ scprint fit/train/predict/test/denoise/embed/gninfer --config config/[medium|l

find out more about the commands by running `scprint --help` or `scprint [command] --help`.

more examples of using the command line are available in the [docs](./docs/usage.md).
more examples of using the command line are available in the [docs](usage.md).

### Notes on GPU/CPU usage with triton

Expand All @@ -102,6 +186,10 @@ model = scPrint.load_from_checkpoint(
transformer="normal")
```

### Simple tests:

An instalation of scPRINT and a simple test of the denoiser is performed during each commit to the main branch with a [Github action](https://github.com/jkobject/scPRINT/actions) and [pytest workflow](https://github.com/jkobject/scPRINT/blob/main/.github/workflows/main.yml). It also provides an expected runtime for the installation and run of scPRINT.

We now explore the different usages of scPRINT:

## FAQ
Expand All @@ -110,27 +198,27 @@ We now explore the different usages of scPRINT:

-> Refer to the section . gene network inference in [this notebook](./notebooks/cancer_usecase.ipynb#).

-> More examples in this notebook [notebooks/assessments/bench_omni.ipynb](../notebooks/bench_omni.ipynb).
-> More examples in this notebook [./notebooks/assessments/bench_omni.ipynb](https://github.com/jkobject/scPRINT/blob/main/notebooks/bench_omni.ipynb).

### I want to generate cell embeddings and cell label predictions from scRNAseq data:

-> Refer to the embeddings and cell annotations section in [this notebook](./notebooks/cancer_usecase.ipynb#).

### I want to denoising my scRNAseq dataset:
### I want to denoise my scRNAseq dataset:

-> Refer to the Denoising of B-cell section in [this notebook](./notebooks/cancer_usecase.ipynb).

-> More example in our benchmark notebook [notebooks/assessments/bench_denoising.ipynb](../notebooks/bench_denoising.ipynb).
-> More example in our benchmark notebook [./notebooks/assessments/bench_denoising.ipynb](https://github.com/jkobject/scPRINT/blob/main/notebooks/bench_denoising.ipynb).

### I want to generate an atlas-level embedding

-> Refer to the notebook [figures/nice_umap.ipynb](../figures/nice_umap.ipynb).
-> Refer to the notebook [nice_umap.ipynb](https://github.com/jkobject/scPRINT/blob/main/figures/nice_umap.ipynb).

### I need to generate gene tokens using pLLMs

To run scPRINT, you can use the option to define the gene tokens using protein language model embeddings of genes. This is done by providing the path to a parquet file of the precomputed set of embeddings for each gene name to scPRINT via "precpt_gene_emb"

-> To generate this file please refer to the notebook [notebooks/generate_gene_embeddings.ipynb](../notebooks/generate_gene_embeddings.ipynb).
-> To generate this file please refer to the notebook [generate_gene_embeddings](https://github.com/jkobject/scPRINT/blob/main/notebooks/generate_gene_embeddings.ipynb).

### I want to pre-train scPRINT from scratch on my own data

Expand Down Expand Up @@ -163,7 +251,7 @@ model = scPrint.load_from_checkpoint(
)
```

You can also recreate the gene embedding file through [this notebook](notebooks/generate_gene_embeddings.ipynb). Just call the functions, and it should recreate the file itself.
You can also recreate the gene embedding file through [this notebook](https://github.com/jkobject/scPRINT/blob/main/notebooks/generate_gene_embeddings.ipynb). Just call the functions, and it should recreate the file itself.

the file itself is also available on [hugging face](https://huggingface.co/jkobject/scPRINT/tree/main)

Expand All @@ -177,21 +265,23 @@ Model weights are available on [hugging face](https://huggingface.co/jkobject/sc

## Development

Read the [CONTRIBUTING.md](CONTRIBUTING.md) file.
Read the [CONTRIBUTING.md](https://github.com/jkobject/scPRINT/blob/main/CONTRIBUTING.md) file.

Read the [training runs](https://wandb.ai/ml4ig/scprint_scale/reports/scPRINT-trainings--Vmlldzo4ODIxMjgx?accessToken=80metwx7b08hhourotpskdyaxiflq700xzmzymr6scvkp69agybt79l341tv68hp) document to know more about how pre-training was performed and the its behavior.

code coverage is not right as I am using the command line interface for now. >50% of the code is covered by my current unit test.

Acknowledgement:
[python template](https://github.com/rochacbruno/python-project-template)
[laminDB](https://lamin.ai/)
[lightning](https://lightning.ai/)

## Work in progress:
## Work in progress (PR welcomed):

1. remove the triton dependencies
2. add version with additional labels (tissues, age) and organisms (mouse, zebrafish) and more datasets from cellxgene
3. version with separate transformer blocks for the encoding part of the bottleneck learning and for the cell embeddings
4. improve classifier to output uncertainties and topK predictions when unsure
5.
5. setup latest lamindb version

Awesome Large Cell Model created by Jeremie Kalfon.
Loading

0 comments on commit a0294a1

Please sign in to comment.