-
Notifications
You must be signed in to change notification settings - Fork 18
Using scDeepSort with GitHub code
-
The file name of test data should be named in this format: species_TissueNumber_data.csv. For example,
human_Pancreas11_data.csv
is a data file containing 11 human pancreas cells. -
The test single-cell transcriptomics csv data file should be pre-processed by first revising gene symbols according to NCBI Gene database updated on Jan. 10, 2020, wherein unmatched genes and duplicated genes will be removed. Then the data should be normalized with the defalut
LogNormalize
method inSeurat
(R package), detailed inpre-process.R
, wherein the column represents each cell and the row represent each gene for final test data, as shown below.Cell 1 Cell 2 Cell 3 ... Gene 1 0 2.4 5.0 ... Gene 2 0.8 1.1 4.3 ... Gene 3 1.8 0 0 ... ... ... ... ... ... -
All the test data should be included under the
test
directory. Human datasets should be under./test/human
and mouse datasets should be under./test/mouse
Use --evaluate
to reproduce the results as shown in our paper. For example,
to evaluate the data mouse_Testis199_data.csv
, you should execute the following command:
python predict.py --species human --tissue Testis --test_dataset 199 --gpu -1 --evaluate --filetype gz --unsure_rate 2
-
--species
The species of cells,human
ormouse
. -
--tissue
The tissue of cells. See wiki page -
--test_dataset
The number of cells in the test data. -
--gpu
Specify the GPU to use,0
for gpu,-1
for cpu. -
--filetype
The format of datafile,csv
for.csv
files andgz
for.gz
files. Seepre-process.R
-
--unsure_rate
The threshold to define the unsure type, default is 2. Set it as 0 to exclude the unsure type.
Output: the output named as species_Tissue_Number.csv
will be under the automatically generated result
directory, which contains four columns, the first is the cell id, the second is the original cell type, the third is the predicted main type, the fourth is the predicted subtype if applicable.
Note: to evaluate all testing datasets in our paper, please download them in release page
Use --test
to test your own datasets. For example,
to test the data human_Pancreas11_data.csv
, you should execute the following command:
python predict.py --species human --tissue Pancreas --test_dataset 11 --gpu -1 --test --filetype csv --unsure_rate 2
-
--species
The species of cells,human
ormouse
. -
--tissue
The tissue of cells. See wiki page -
--test_dataset
The number of cells in the test data. -
--gpu
Specify the GPU to use,0
for gpu,-1
for cpu. -
--filetype
The format of datafile,csv
for.csv
files andgz
for.gz
files. Seepre-process.R
-
--unsure_rate
The threshold to define the unsure type, default is 2. Set it as 0 to exclude the unsure type.
Output: the output named as species_Tissue_Number.csv
will be under the automatically generated result
directory, which contains three columns, the first is the cell id, the second is the predicted main type, the third is the predicted subtype if applicable.
To train your own model, you should prepare two files, i.e., a data file as descrived above, and a cell annotation file under the ./train
directory as the example files. Then execute the following command:
python train.py --species human --tissue Adipose --gpu -1 --filetype gz
python train.py --species mouse --tissue Muscle --gpu -1 --filetype gz
-
--species
The species of cells,human
ormouse
. -
--tissue
The tissue of cells. -
--gpu
Specify the GPU to use,0
for gpu,-1
for cpu. -
--filetype
The format of datafile,csv
for.csv
files andgz
for.gz
files. Seepre-process.R
Output: the trained model will be under the pretrained
directory, which can be used to test new datasets on the same tissue using predict.py
as described above.