Skip to content

Commit

Permalink
Merge pull request #10 from vpc-ccg/br_list_from_lr
Browse files Browse the repository at this point in the history
Updated README to doc extract_sr_bc_from_lr
  • Loading branch information
baraaorabi authored Jan 20, 2023
2 parents a8f8d80 + 1e056b6 commit 1e009cd
Showing 1 changed file with 38 additions and 11 deletions.
49 changes: 38 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/sctagger/README.html)

# scTagger
scTagger matches barcodes of short- and long-reads of single-cell RNA-seq experiments to achieve the information of both datasets.
scTagger matches barcodes of short- and long-reads of single-cell RNA-seq experiments to enable relating at the cell level gene expression (from short-reads) and RNA splicing (from the long-reads).

## Installation

Expand All @@ -23,7 +23,7 @@ scTagger has a single python script containing different functions to match long

The whole pipeline contains three steps that you can run each part separately:

#### Extract long-reads segment
#### *1) Extract long-reads segment*
The first step of the scTagger pipeline is to extract a segment where the probability of seeing a barcode is more than in other places.
To run this step, you can use the following command.

Expand All @@ -37,23 +37,23 @@ To run this step, you can use the following command.
* `-g`: Space separated of the ranges of where SR adapter should be found on the LR's (Optional, Default: Detect from data)
* `-z`: Indicate input is gzipped (Optional, Default: Assume input is gzipped if it ends with \".gz\")
* `-t`: Number of threads (Optional, Default: 1)
* `-sa`: Short-read adapter (Optional, Default: "CTACACGACGCTCTTCCGATCT")
* `-sa`: Short-read adapter (Optional, Default: `CTACACGACGCTCTTCCGATCT`)
* `--num-bp-afte`: Number of bases after the end of the SR adapter alignment to generate (Optional, Default: 20)
* `-o`: Path to output file
* `-p`: Path to plot file (Optional, Default: No plotting)

**Inputs**
* A list of fastQ files of long reads
* A list of FASTQ files of long-reads

**Outputs**
* A Tsv file:
* First column is read-id
* Second column is the best edit distance with the short-read adapter
* Third column is the starting point of long-read that matches with the adapter
* Fourth column is the long-read segment that find.
* A plot of optimal alignment locations of the short read adapter to the long reads.
* A plot of optimal alignment locations of the short read adapter to the long-reads.

#### Extract short-reads barcodes
#### *2) Extract short-reads barcodes*

The second step is to extract the top short-reads barcodes that cover most of the reads.

Expand All @@ -78,8 +78,35 @@ The second step is to extract the top short-reads barcodes that cover most of th
* Second column is the number of appearances of the barcode
* A cumulative plot of SR coverage with batches of 1,000 barcodes

#### Match long-reads segment with short-reads barcode
The last step is to match long read segments with selected barcodes from short reads
#### *Alt. 2) Extract short-reads barcodes directly from long-reads*

This is an alternative to the second step which avoids using the short-reads all together and inteads builds a whiltelist of cellular barcodes from the long-reads directly.
This is done by looking for exact matches of the 10x Chromium list of cellular barcodes on the long-read barcode segments.
The barcodes are sorted by frequency and the most frequent barcodes are kept using the strategy as the `extract_sr_bc` module.

```
./scTagger.py extract_sr_bc_from_lr -i "path/to/long-read-segments" -wl "/path/to/10x-barcode-list.txt" -o "path/to/output.txt"'
```

**Arguments**
* `-i`: Input TSV file containing the long-read segments file generated by `extract_lr_bc` step
* `-o`: Path to output file.
* `-wl`: Path to 10x Genomics cellular barcode whiltelist (e.g. 3M-february-2018.txt.gz). Accepts both txt.gz files and .txt files.
* `--thresh`: Percentage theshold required per step to continue adding read barcodes (Optional, Default: 0.005)
* `--step-size`: Number of barcodes processed at a time and whose sum is used to check against the theshold (Optional, Default: 1000)
* `--max-barcode-cnt`: Max number of barcodes to keep (Optional, Default: 25000)

**Input**
* The output file of the `extract_lr_bc` step
* 10x Genomics cellular barcode whiltelist (e.g. 3M-february-2018.txt.gz)

**Output**
* A TSV file
* First column is barcodes
* Second column is the number of appearances of the barcode

#### *3) Match long-reads segment with short-reads barcodes*
The last step is to match long-read segments with selected barcodes from short reads
```
./scTagger.py match_trie -lr "path/to/output/extract/long-read/segment" -sr "path/to/output/extract/top/short-read" -o "path/to/output/file" -t "number of threads"
```
Expand All @@ -96,15 +123,15 @@ The last step is to match long read segments with selected barcodes from short r


**Inputs**
* Use the output of extracting long read segment and selecting top barcodes part as the inputs of this section
* Use the output of extracting long-read segment and selecting top barcodes part as the inputs of this section

**Outputs**
* A TSV file
* First column is the read id
* Second column is the minimum edit distance
* Third column is the number of short reads barcodes that match with the long read
* Third column is the number of short reads barcodes that match with the long-read
* Fourth column is the long-read segment, and the Fifth column is a list of all short-read barcodes with minimum edit distance
* A bar plot that shows the number of long reads by the minimum edit distance of their match barcode
* A bar plot that shows the number of long-reads by the minimum edit distance of their match barcode

## Citing scTaggger
scTagger was first accepted to RECOMB-seq 2022 and is now published by iScience:
Expand Down

0 comments on commit 1e009cd

Please sign in to comment.