feat: first version in the new repository (#1)

clintval · Jul 26, 2024 · a0c8913 · a0c8913
1 parent 8ba3a85
commit a0c8913
Show file tree

Hide file tree

Showing 26 changed files with 1,479 additions and 3 deletions.
diff --git a/.github/img/cover.jpg b/.github/img/cover.jpg
diff --git a/.github/workflows/unit-tests.yml b/.github/workflows/unit-tests.yml
@@ -0,0 +1,37 @@
+name: Unit Tests
+
+on: [ push, pull_request ]
+
+jobs:
+  unit-tests:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        suite: [ neodisambiguate ]
+        java-version: [ 1.8, 11, 17, 21 ]
+    steps:
+      - uses: actions/checkout@v2
+      - uses: actions/setup-java@v1
+        with:
+          java-version: ${{ matrix.java-version }}
+      - name: Cache for Scala Dependencies
+        uses: actions/cache@v2
+        with:
+          path: |
+            ~/.mill/download
+            ~/.m2/repository
+            ~/.cache/coursier
+          key: ${{ runner.os }}-java-mill-${{ matrix.java-version }}-${{ hashFiles('**/build.sc') }}
+          restore-keys: ${{ runner.os }}-java-mill-
+      - name: Compile Scala Code
+        run: |
+          ./mill --no-server clean
+          ./mill --no-server --disable-ticker ${{ matrix.suite }}.compile
+      - name: Test Scala Code
+        run: |
+          ./mill --no-server --disable-ticker ${{ matrix.suite }}.test
+      - name: Create Code Coverage Report
+        if: matrix.java-version == '11'
+        run: |
+          ./mill --no-server --disable-ticker ${{ matrix.suite }}.scoverage.xmlReport
+      # TODO: build an HTML report and upload as an artifact
diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,13 @@
+.bsp/*
+.DS_Store
+.java-version
+.idea/*
+.metals/*
+.vscode/*
 *.class
 *.log
-
-# virtual machine crash logs, see http://www.java.com/en/download/help/error_hotspot.xml
+**/*.iml
+bin/*
 hs_err_pid*
+out/*
+target/*
diff --git a/.mill-version b/.mill-version
@@ -0,0 +1 @@
+0.11.10
diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,6 @@
 MIT License
 
-Copyright (c) 2024 Clint Valentine
+Copyright © 2019, 2024 Clint Valentine
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/README.md b/README.md
@@ -0,0 +1,83 @@
+# neodisambiguate
+
+<!-- [![Install with bioconda](https://img.shields.io/badge/Install%20with-bioconda-brightgreen.svg)](http://bioconda.github.io/recipes/neodisambiguate/README.html) -->
+<!-- [![Anaconda Version](https://anaconda.org/bioconda/neodisambiguate/badges/version.svg)](http://bioconda.github.io/recipes/neodisambiguate/README.html) -->
+[![Unit Tests](https://github.com/clintval/neodisambiguate/actions/workflows/unit-tests.yml/badge.svg?branch=main)](https://github.com/clintval/neodisambiguate/actions/workflows/unit-tests.yml)
+[![Java Version](https://img.shields.io/badge/java-8,11,17,21-c22d40.svg)](https://github.com/AdoptOpenJDK/homebrew-openjdk)
+[![Language](https://img.shields.io/badge/language-scala-c22d40.svg)](https://www.scala-lang.org/)
+[![License](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/clintval/neodisambiguate/blob/master/LICENSE)
+
+Disambiguate reads that were mapped to multiple references.
+
+![Torres del Paine](.github/img/cover.jpg)
+
+Install with the Conda or Mamba package manager after setting your [Bioconda channels](https://bioconda.github.io/user/install.html#set-up-channels):
+
+```text
+❯ conda install neodisambiguate
+```
+
+### Introduction
+
+Disambiguation of aligned reads is performed per-template and all information across primary, secondary, and supplementary alignments is used as evidence.
+Alignment disambiguation is commonly required when analyzing sequencing data from transduction, transfection, transgenic, or xenographic (including patient derived xenograft) experiments.
+This tool works by comparing various alignment scores between a template that has been aligned to many references in order to determine which reference is the most likely source.
+
+All templates which are positively assigned to a single source reference are written to a reference-specific output BAM file.
+Any templates with ambiguous reference assignment are written to an ambiguous input-specific output BAM file.
+Only BAMs produced from the Burrows-Wheeler Aligner (bwa) or STAR are currently supported.
+
+Input BAMs of arbitrary sort order are accepted, however, an internal sort to queryname will be performed unless the BAM is already in queryname sort order.
+All output BAM files will be written in the same sort order as the input BAM files.
+Although paired-end reads will give the most discriminatory power for disambiguation of short-read sequencing data, this tool accepts paired, single-end (fragment), and mixed pairing input data.
+
+### Features
+
+- Accepts SAM/BAM sources of any sort order
+- Will disambiguate an arbitrary number of BAMs, all aligned to different references
+- Writes the ambiguous alignments to a separate directory
+- Extensible implementation which supports alternative disambiguation strategies
+- Benchmarks show high accuracy: [Click Here](benchmarks/disambiguate.md)
+
+### Command Line Usage
+
+```bash
+❯ neodisambiguate -i infile1.bam infile2.bam -p out/disambiguated
+```
+
+### Example Usage
+
+To disambiguate templates for sample `dna00001` that are aligned to human (A) and mouse (B):
+
+```bash
+❯ neodisambiguate -i dna00001.A.bam dna00001.B.bam -p out/dna00001 -n hg38 mm10
+```
+
+```console
+❯ tree out/
+  out/
+  ├── ambiguous-alignments/
+  │  ├── dna00001.A.ambiguous.bai
+  │  ├── dna00001.A.ambiguous.bam
+  │  ├── dna00001.B.ambiguous.bai
+  │  └── dna00001.B.ambiguous.bam
+  ├── dna00001.hg38.bai
+  ├── dna00001.hg38.bam
+  ├── dna00001.mm10.bai
+  └── dna00001.mm10.bam
+```
+
+### Local Installation
+
+Bootstrap compilation and build the executable with:
+
+```bash
+./mill neodisambiguate.executable
+./bin/neodisambiguate --help
+```
+
+### Prior Art
+
+This project was inspired by AstraZeneca's `disambiguate`:
+
+- https://github.com/AstraZeneca-NGS/disambiguate
diff --git a/benchmarks/.gitignore b/benchmarks/.gitignore
@@ -0,0 +1,2 @@
+insilico/
+output/
diff --git a/benchmarks/disambiguate.md b/benchmarks/disambiguate.md
@@ -0,0 +1,134 @@
+### Disambiguation Micro-Benchmark
+
+### Introduction
+
+As a preliminary test, we will attempt to disambiguate simulated paired-end reads from a homologous gene (`ABL1`) across human and murine species.
+
+| Assembly   | Species             | Coordinates                                    |
+| ---        | ---                 | ---                                            |
+| `hg38`     | _Homo sapiens_      | [`chr9:130713881-130887675`][hs38DH-reference] |
+| `mm10`     | _Mus musculus_      | [`chr2:31688354-31807093`][mm10-reference]     |
+| `rn6`      | _Rattus norvegicus_ | [`chr3:10041820-10145076`][rn6-reference]      |
+> Location of the `ABL1` gene in three model species.
+
+[hs38DH-reference]: https://rgd.mcw.edu/rgdweb/report/gene/main.html?id=1342972
+[mm10-reference]:   https://rgd.mcw.edu/rgdweb/report/gene/main.html?id=1332211
+[rn6-reference]:    https://rgd.mcw.edu/rgdweb/report/gene/main.html?id=1584969
+
+### Dependencies
+
+- [`bwa`](https://bioconda.github.io/recipes/bwa/README.html)
+- [`dwgsim`](https://anaconda.org/bioconda/dwgsim)
+- [`parallel`](https://anaconda.org/conda-forge/parallel)
+- [`picard`](https://bioconda.github.io/recipes/picard/README.html)
+
+Install with:
+
+```bash
+mamba env create -f environment.yaml
+```
+
+### Simulation of Training Data
+
+We will simulate 1,000 paired-end reads with uniform coverage over the target gene in all three species.
+For the sake of tracking read pairs, we use the `-P` flag and prefix each read name with the source assembly ID.
+
+```bash
+❯ mkdir -p insilico
+❯ parallel \
+    'dwgsim -N 1000 -c 0 -z 42 -x <( echo {2} ) -P {1} \
+         /pipeline/reference-data/references/{1}/{1}.fa \
+         insilico/{1}' \
+    ::: hs38DH mm10 rn6 \
+    :::+ 'chr9 130713881 130887675' 'chr2 31688354 31807093' 'chr3 10041820 10145076'
+```
+
+### Digitally Mixing All Training Data
+
+We concatenate all reads into a single FASTQ pair and then convert that file into an umapped BAM (uBAM) file.
+
+```bash
+❯ cat insilico/*read1*.fastq.gz > insilico/all.read1.fastq.gz
+❯ cat insilico/*read2*.fastq.gz > insilico/all.read2.fastq.gz
+❯ picard FastqToSam \
+    F1=insilico/all.read1.fastq.gz \
+    F2=insilico/all.read2.fastq.gz  \
+    OUTPUT=insilico/all.unmapped.bam \
+    SAMPLE_NAME=all 
+```
+
+### Alignment
+
+We align all read pairs to all three reference genomes independently.
+Although verbose, we follow the [GATK best practice][gatk-reference] alignment guide so that we ensure the sequence dictionaries of the output BAMs are correct.
+
+[gatk-reference]: https://software.broadinstitute.org/gatk/best-practices/workflow?id=11165
+
+```bash
+❯ parallel \
+    'picard SamToFastq \
+         INPUT=insilico/all.unmapped.bam \
+         INTERLEAVE=true \
+         FASTQ=/dev/stdout \
+     | bwa mem -t 8 -p \
+         /pipeline/reference-data/references/{}/{}.fa \
+         /dev/stdin \
+     | picard MergeBamAlignment \
+         ALIGNED=/dev/stdin \
+         UNMAPPED=insilico/all.unmapped.bam \
+         OUTPUT=insilico/all.{}.mapped.bam \
+         REFERENCE_SEQUENCE=/pipeline/reference-data/references/{}/{}.fa' \
+  ::: hs38DH mm10 rn6
+```
+
+### Disambiguation
+
+We are now ready to disambiguate our aligned templates!
+
+```bash
+❯ neodisambiguate -i insilico/all.*.mapped.bam -o insilico/disambiguated
+```
+
+### Validation
+
+Let's see how we did:
+
+```bash
+❯ \ls -1 insilico/disambiguated*
+insilico/disambiguated.hs38DH.bam
+insilico/disambiguated.mm10.bam
+insilico/disambiguated.rn6.bam
+```
+
+```bash
+❯ parallel -k \
+    'echo Distribution of known alignments disambiguated into: {} \
+     && picard ViewSam \
+          INPUT=insilico/disambiguated.{}.bam \
+          ALIGNMENT_STATUS=All \
+          PF_STATUS=All \
+        2>/dev/null \
+        | grep -v "@" \
+        | tr "_" "\t" \
+        | cut -f1 \
+        | sort \
+        | uniq -c \
+     && echo' \
+    ::: hs38DH mm10 rn6
+
+Distribution of known alignments disambiguated into: hs38DH
+1898 hs38DH
+
+Distribution of known alignments disambiguated into: mm10
+1906 mm10
+   6 rn6
+
+Distribution of known alignments disambiguated into: rn6
+   2 mm10
+1914 rn6
+```
+
+### Conclusion
+
+Because these are alignments, and not read pairs or templates, we cannot directly calculate performance metrics such as sensitivity or specificty of this method.
+We hope to improve these benchmarks.
diff --git a/benchmarks/environment.yaml b/benchmarks/environment.yaml
@@ -0,0 +1,10 @@
+name: neodisambiguate_benchmarks
+channels:
+  - bioconda
+  - conda-forge
+  - defaults
+dependencies:
+  - bwa
+  - dwgsim
+  - parallel
+  - picard