-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: first version in the new repository (#1)
- Loading branch information
Showing
26 changed files
with
1,479 additions
and
3 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
name: Unit Tests | ||
|
||
on: [ push, pull_request ] | ||
|
||
jobs: | ||
unit-tests: | ||
runs-on: ubuntu-latest | ||
strategy: | ||
matrix: | ||
suite: [ neodisambiguate ] | ||
java-version: [ 1.8, 11, 17, 21 ] | ||
steps: | ||
- uses: actions/checkout@v2 | ||
- uses: actions/setup-java@v1 | ||
with: | ||
java-version: ${{ matrix.java-version }} | ||
- name: Cache for Scala Dependencies | ||
uses: actions/cache@v2 | ||
with: | ||
path: | | ||
~/.mill/download | ||
~/.m2/repository | ||
~/.cache/coursier | ||
key: ${{ runner.os }}-java-mill-${{ matrix.java-version }}-${{ hashFiles('**/build.sc') }} | ||
restore-keys: ${{ runner.os }}-java-mill- | ||
- name: Compile Scala Code | ||
run: | | ||
./mill --no-server clean | ||
./mill --no-server --disable-ticker ${{ matrix.suite }}.compile | ||
- name: Test Scala Code | ||
run: | | ||
./mill --no-server --disable-ticker ${{ matrix.suite }}.test | ||
- name: Create Code Coverage Report | ||
if: matrix.java-version == '11' | ||
run: | | ||
./mill --no-server --disable-ticker ${{ matrix.suite }}.scoverage.xmlReport | ||
# TODO: build an HTML report and upload as an artifact |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,13 @@ | ||
.bsp/* | ||
.DS_Store | ||
.java-version | ||
.idea/* | ||
.metals/* | ||
.vscode/* | ||
*.class | ||
*.log | ||
|
||
# virtual machine crash logs, see http://www.java.com/en/download/help/error_hotspot.xml | ||
**/*.iml | ||
bin/* | ||
hs_err_pid* | ||
out/* | ||
target/* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
0.11.10 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
# neodisambiguate | ||
|
||
<!-- [![Install with bioconda](https://img.shields.io/badge/Install%20with-bioconda-brightgreen.svg)](http://bioconda.github.io/recipes/neodisambiguate/README.html) --> | ||
<!-- [![Anaconda Version](https://anaconda.org/bioconda/neodisambiguate/badges/version.svg)](http://bioconda.github.io/recipes/neodisambiguate/README.html) --> | ||
[![Unit Tests](https://github.com/clintval/neodisambiguate/actions/workflows/unit-tests.yml/badge.svg?branch=main)](https://github.com/clintval/neodisambiguate/actions/workflows/unit-tests.yml) | ||
[![Java Version](https://img.shields.io/badge/java-8,11,17,21-c22d40.svg)](https://github.com/AdoptOpenJDK/homebrew-openjdk) | ||
[![Language](https://img.shields.io/badge/language-scala-c22d40.svg)](https://www.scala-lang.org/) | ||
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/clintval/neodisambiguate/blob/master/LICENSE) | ||
|
||
Disambiguate reads that were mapped to multiple references. | ||
|
||
![Torres del Paine](.github/img/cover.jpg) | ||
|
||
Install with the Conda or Mamba package manager after setting your [Bioconda channels](https://bioconda.github.io/user/install.html#set-up-channels): | ||
|
||
```text | ||
❯ conda install neodisambiguate | ||
``` | ||
|
||
### Introduction | ||
|
||
Disambiguation of aligned reads is performed per-template and all information across primary, secondary, and supplementary alignments is used as evidence. | ||
Alignment disambiguation is commonly required when analyzing sequencing data from transduction, transfection, transgenic, or xenographic (including patient derived xenograft) experiments. | ||
This tool works by comparing various alignment scores between a template that has been aligned to many references in order to determine which reference is the most likely source. | ||
|
||
All templates which are positively assigned to a single source reference are written to a reference-specific output BAM file. | ||
Any templates with ambiguous reference assignment are written to an ambiguous input-specific output BAM file. | ||
Only BAMs produced from the Burrows-Wheeler Aligner (bwa) or STAR are currently supported. | ||
|
||
Input BAMs of arbitrary sort order are accepted, however, an internal sort to queryname will be performed unless the BAM is already in queryname sort order. | ||
All output BAM files will be written in the same sort order as the input BAM files. | ||
Although paired-end reads will give the most discriminatory power for disambiguation of short-read sequencing data, this tool accepts paired, single-end (fragment), and mixed pairing input data. | ||
|
||
### Features | ||
|
||
- Accepts SAM/BAM sources of any sort order | ||
- Will disambiguate an arbitrary number of BAMs, all aligned to different references | ||
- Writes the ambiguous alignments to a separate directory | ||
- Extensible implementation which supports alternative disambiguation strategies | ||
- Benchmarks show high accuracy: [Click Here](benchmarks/disambiguate.md) | ||
|
||
### Command Line Usage | ||
|
||
```bash | ||
❯ neodisambiguate -i infile1.bam infile2.bam -p out/disambiguated | ||
``` | ||
|
||
### Example Usage | ||
|
||
To disambiguate templates for sample `dna00001` that are aligned to human (A) and mouse (B): | ||
|
||
```bash | ||
❯ neodisambiguate -i dna00001.A.bam dna00001.B.bam -p out/dna00001 -n hg38 mm10 | ||
``` | ||
|
||
```console | ||
❯ tree out/ | ||
out/ | ||
├── ambiguous-alignments/ | ||
│ ├── dna00001.A.ambiguous.bai | ||
│ ├── dna00001.A.ambiguous.bam | ||
│ ├── dna00001.B.ambiguous.bai | ||
│ └── dna00001.B.ambiguous.bam | ||
├── dna00001.hg38.bai | ||
├── dna00001.hg38.bam | ||
├── dna00001.mm10.bai | ||
└── dna00001.mm10.bam | ||
``` | ||
|
||
### Local Installation | ||
|
||
Bootstrap compilation and build the executable with: | ||
|
||
```bash | ||
./mill neodisambiguate.executable | ||
./bin/neodisambiguate --help | ||
``` | ||
|
||
### Prior Art | ||
|
||
This project was inspired by AstraZeneca's `disambiguate`: | ||
|
||
- https://github.com/AstraZeneca-NGS/disambiguate |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
insilico/ | ||
output/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,134 @@ | ||
### Disambiguation Micro-Benchmark | ||
|
||
### Introduction | ||
|
||
As a preliminary test, we will attempt to disambiguate simulated paired-end reads from a homologous gene (`ABL1`) across human and murine species. | ||
|
||
| Assembly | Species | Coordinates | | ||
| --- | --- | --- | | ||
| `hg38` | _Homo sapiens_ | [`chr9:130713881-130887675`][hs38DH-reference] | | ||
| `mm10` | _Mus musculus_ | [`chr2:31688354-31807093`][mm10-reference] | | ||
| `rn6` | _Rattus norvegicus_ | [`chr3:10041820-10145076`][rn6-reference] | | ||
> Location of the `ABL1` gene in three model species. | ||
[hs38DH-reference]: https://rgd.mcw.edu/rgdweb/report/gene/main.html?id=1342972 | ||
[mm10-reference]: https://rgd.mcw.edu/rgdweb/report/gene/main.html?id=1332211 | ||
[rn6-reference]: https://rgd.mcw.edu/rgdweb/report/gene/main.html?id=1584969 | ||
|
||
### Dependencies | ||
|
||
- [`bwa`](https://bioconda.github.io/recipes/bwa/README.html) | ||
- [`dwgsim`](https://anaconda.org/bioconda/dwgsim) | ||
- [`parallel`](https://anaconda.org/conda-forge/parallel) | ||
- [`picard`](https://bioconda.github.io/recipes/picard/README.html) | ||
|
||
Install with: | ||
|
||
```bash | ||
mamba env create -f environment.yaml | ||
``` | ||
|
||
### Simulation of Training Data | ||
|
||
We will simulate 1,000 paired-end reads with uniform coverage over the target gene in all three species. | ||
For the sake of tracking read pairs, we use the `-P` flag and prefix each read name with the source assembly ID. | ||
|
||
```bash | ||
❯ mkdir -p insilico | ||
❯ parallel \ | ||
'dwgsim -N 1000 -c 0 -z 42 -x <( echo {2} ) -P {1} \ | ||
/pipeline/reference-data/references/{1}/{1}.fa \ | ||
insilico/{1}' \ | ||
::: hs38DH mm10 rn6 \ | ||
:::+ 'chr9 130713881 130887675' 'chr2 31688354 31807093' 'chr3 10041820 10145076' | ||
``` | ||
|
||
### Digitally Mixing All Training Data | ||
|
||
We concatenate all reads into a single FASTQ pair and then convert that file into an umapped BAM (uBAM) file. | ||
|
||
```bash | ||
❯ cat insilico/*read1*.fastq.gz > insilico/all.read1.fastq.gz | ||
❯ cat insilico/*read2*.fastq.gz > insilico/all.read2.fastq.gz | ||
❯ picard FastqToSam \ | ||
F1=insilico/all.read1.fastq.gz \ | ||
F2=insilico/all.read2.fastq.gz \ | ||
OUTPUT=insilico/all.unmapped.bam \ | ||
SAMPLE_NAME=all | ||
``` | ||
|
||
### Alignment | ||
|
||
We align all read pairs to all three reference genomes independently. | ||
Although verbose, we follow the [GATK best practice][gatk-reference] alignment guide so that we ensure the sequence dictionaries of the output BAMs are correct. | ||
|
||
[gatk-reference]: https://software.broadinstitute.org/gatk/best-practices/workflow?id=11165 | ||
|
||
```bash | ||
❯ parallel \ | ||
'picard SamToFastq \ | ||
INPUT=insilico/all.unmapped.bam \ | ||
INTERLEAVE=true \ | ||
FASTQ=/dev/stdout \ | ||
| bwa mem -t 8 -p \ | ||
/pipeline/reference-data/references/{}/{}.fa \ | ||
/dev/stdin \ | ||
| picard MergeBamAlignment \ | ||
ALIGNED=/dev/stdin \ | ||
UNMAPPED=insilico/all.unmapped.bam \ | ||
OUTPUT=insilico/all.{}.mapped.bam \ | ||
REFERENCE_SEQUENCE=/pipeline/reference-data/references/{}/{}.fa' \ | ||
::: hs38DH mm10 rn6 | ||
``` | ||
|
||
### Disambiguation | ||
|
||
We are now ready to disambiguate our aligned templates! | ||
|
||
```bash | ||
❯ neodisambiguate -i insilico/all.*.mapped.bam -o insilico/disambiguated | ||
``` | ||
|
||
### Validation | ||
|
||
Let's see how we did: | ||
|
||
```bash | ||
❯ \ls -1 insilico/disambiguated* | ||
insilico/disambiguated.hs38DH.bam | ||
insilico/disambiguated.mm10.bam | ||
insilico/disambiguated.rn6.bam | ||
``` | ||
|
||
```bash | ||
❯ parallel -k \ | ||
'echo Distribution of known alignments disambiguated into: {} \ | ||
&& picard ViewSam \ | ||
INPUT=insilico/disambiguated.{}.bam \ | ||
ALIGNMENT_STATUS=All \ | ||
PF_STATUS=All \ | ||
2>/dev/null \ | ||
| grep -v "@" \ | ||
| tr "_" "\t" \ | ||
| cut -f1 \ | ||
| sort \ | ||
| uniq -c \ | ||
&& echo' \ | ||
::: hs38DH mm10 rn6 | ||
|
||
Distribution of known alignments disambiguated into: hs38DH | ||
1898 hs38DH | ||
|
||
Distribution of known alignments disambiguated into: mm10 | ||
1906 mm10 | ||
6 rn6 | ||
|
||
Distribution of known alignments disambiguated into: rn6 | ||
2 mm10 | ||
1914 rn6 | ||
``` | ||
|
||
### Conclusion | ||
|
||
Because these are alignments, and not read pairs or templates, we cannot directly calculate performance metrics such as sensitivity or specificty of this method. | ||
We hope to improve these benchmarks. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
name: neodisambiguate_benchmarks | ||
channels: | ||
- bioconda | ||
- conda-forge | ||
- defaults | ||
dependencies: | ||
- bwa | ||
- dwgsim | ||
- parallel | ||
- picard |
Oops, something went wrong.