diff --git a/README.md b/README.md index 5e64652..58560b1 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/sctagger/README.html) # scTagger -scTagger matches barcodes of short- and long-reads of single-cell RNA-seq experiments to achieve the information of both datasets. +scTagger matches barcodes of short- and long-reads of single-cell RNA-seq experiments to enable relating at the cell level gene expression (from short-reads) and RNA splicing (from the long-reads). ## Installation @@ -23,7 +23,7 @@ scTagger has a single python script containing different functions to match long The whole pipeline contains three steps that you can run each part separately: -#### Extract long-reads segment +#### *1) Extract long-reads segment* The first step of the scTagger pipeline is to extract a segment where the probability of seeing a barcode is more than in other places. To run this step, you can use the following command. @@ -37,13 +37,13 @@ To run this step, you can use the following command. * `-g`: Space separated of the ranges of where SR adapter should be found on the LR's (Optional, Default: Detect from data) * `-z`: Indicate input is gzipped (Optional, Default: Assume input is gzipped if it ends with \".gz\") * `-t`: Number of threads (Optional, Default: 1) -* `-sa`: Short-read adapter (Optional, Default: "CTACACGACGCTCTTCCGATCT") +* `-sa`: Short-read adapter (Optional, Default: `CTACACGACGCTCTTCCGATCT`) * `--num-bp-afte`: Number of bases after the end of the SR adapter alignment to generate (Optional, Default: 20) * `-o`: Path to output file * `-p`: Path to plot file (Optional, Default: No plotting) **Inputs** -* A list of fastQ files of long reads +* A list of FASTQ files of long-reads **Outputs** * A Tsv file: @@ -51,9 +51,9 @@ To run this step, you can use the following command. * Second column is the best edit distance with the short-read adapter * Third column is the starting point of long-read that matches with the adapter * Fourth column is the long-read segment that find. -* A plot of optimal alignment locations of the short read adapter to the long reads. +* A plot of optimal alignment locations of the short read adapter to the long-reads. -#### Extract short-reads barcodes +#### *2) Extract short-reads barcodes* The second step is to extract the top short-reads barcodes that cover most of the reads. @@ -78,8 +78,35 @@ The second step is to extract the top short-reads barcodes that cover most of th * Second column is the number of appearances of the barcode * A cumulative plot of SR coverage with batches of 1,000 barcodes -#### Match long-reads segment with short-reads barcode -The last step is to match long read segments with selected barcodes from short reads +#### *Alt. 2) Extract short-reads barcodes directly from long-reads* + +This is an alternative to the second step which avoids using the short-reads all together and inteads builds a whiltelist of cellular barcodes from the long-reads directly. +This is done by looking for exact matches of the 10x Chromium list of cellular barcodes on the long-read barcode segments. +The barcodes are sorted by frequency and the most frequent barcodes are kept using the strategy as the `extract_sr_bc` module. + +``` +./scTagger.py extract_sr_bc_from_lr -i "path/to/long-read-segments" -wl "/path/to/10x-barcode-list.txt" -o "path/to/output.txt"' +``` + +**Arguments** +* `-i`: Input TSV file containing the long-read segments file generated by `extract_lr_bc` step +* `-o`: Path to output file. +* `-wl`: Path to 10x Genomics cellular barcode whiltelist (e.g. 3M-february-2018.txt.gz). Accepts both txt.gz files and .txt files. +* `--thresh`: Percentage theshold required per step to continue adding read barcodes (Optional, Default: 0.005) +* `--step-size`: Number of barcodes processed at a time and whose sum is used to check against the theshold (Optional, Default: 1000) +* `--max-barcode-cnt`: Max number of barcodes to keep (Optional, Default: 25000) + +**Input** +* The output file of the `extract_lr_bc` step +* 10x Genomics cellular barcode whiltelist (e.g. 3M-february-2018.txt.gz) + +**Output** +* A TSV file + * First column is barcodes + * Second column is the number of appearances of the barcode + +#### *3) Match long-reads segment with short-reads barcodes* +The last step is to match long-read segments with selected barcodes from short reads ``` ./scTagger.py match_trie -lr "path/to/output/extract/long-read/segment" -sr "path/to/output/extract/top/short-read" -o "path/to/output/file" -t "number of threads" ``` @@ -96,15 +123,15 @@ The last step is to match long read segments with selected barcodes from short r **Inputs** -* Use the output of extracting long read segment and selecting top barcodes part as the inputs of this section +* Use the output of extracting long-read segment and selecting top barcodes part as the inputs of this section **Outputs** * A TSV file * First column is the read id * Second column is the minimum edit distance - * Third column is the number of short reads barcodes that match with the long read + * Third column is the number of short reads barcodes that match with the long-read * Fourth column is the long-read segment, and the Fifth column is a list of all short-read barcodes with minimum edit distance -* A bar plot that shows the number of long reads by the minimum edit distance of their match barcode +* A bar plot that shows the number of long-reads by the minimum edit distance of their match barcode ## Citing scTaggger scTagger was first accepted to RECOMB-seq 2022 and is now published by iScience: