Skip to content

Commit

Permalink
feat: first version in the new repository (#1)
Browse files Browse the repository at this point in the history
  • Loading branch information
clintval authored Jul 26, 2024
1 parent 8ba3a85 commit a0c8913
Show file tree
Hide file tree
Showing 26 changed files with 1,479 additions and 3 deletions.
Binary file added .github/img/cover.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
37 changes: 37 additions & 0 deletions .github/workflows/unit-tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
name: Unit Tests

on: [ push, pull_request ]

jobs:
unit-tests:
runs-on: ubuntu-latest
strategy:
matrix:
suite: [ neodisambiguate ]
java-version: [ 1.8, 11, 17, 21 ]
steps:
- uses: actions/checkout@v2
- uses: actions/setup-java@v1
with:
java-version: ${{ matrix.java-version }}
- name: Cache for Scala Dependencies
uses: actions/cache@v2
with:
path: |
~/.mill/download
~/.m2/repository
~/.cache/coursier
key: ${{ runner.os }}-java-mill-${{ matrix.java-version }}-${{ hashFiles('**/build.sc') }}
restore-keys: ${{ runner.os }}-java-mill-
- name: Compile Scala Code
run: |
./mill --no-server clean
./mill --no-server --disable-ticker ${{ matrix.suite }}.compile
- name: Test Scala Code
run: |
./mill --no-server --disable-ticker ${{ matrix.suite }}.test
- name: Create Code Coverage Report
if: matrix.java-version == '11'
run: |
./mill --no-server --disable-ticker ${{ matrix.suite }}.scoverage.xmlReport
# TODO: build an HTML report and upload as an artifact
12 changes: 10 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,13 @@
.bsp/*
.DS_Store
.java-version
.idea/*
.metals/*
.vscode/*
*.class
*.log

# virtual machine crash logs, see http://www.java.com/en/download/help/error_hotspot.xml
**/*.iml
bin/*
hs_err_pid*
out/*
target/*
1 change: 1 addition & 0 deletions .mill-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
0.11.10
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2024 Clint Valentine
Copyright © 2019, 2024 Clint Valentine

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
83 changes: 83 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# neodisambiguate

<!-- [![Install with bioconda](https://img.shields.io/badge/Install%20with-bioconda-brightgreen.svg)](http://bioconda.github.io/recipes/neodisambiguate/README.html) -->
<!-- [![Anaconda Version](https://anaconda.org/bioconda/neodisambiguate/badges/version.svg)](http://bioconda.github.io/recipes/neodisambiguate/README.html) -->
[![Unit Tests](https://github.com/clintval/neodisambiguate/actions/workflows/unit-tests.yml/badge.svg?branch=main)](https://github.com/clintval/neodisambiguate/actions/workflows/unit-tests.yml)
[![Java Version](https://img.shields.io/badge/java-8,11,17,21-c22d40.svg)](https://github.com/AdoptOpenJDK/homebrew-openjdk)
[![Language](https://img.shields.io/badge/language-scala-c22d40.svg)](https://www.scala-lang.org/)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/clintval/neodisambiguate/blob/master/LICENSE)

Disambiguate reads that were mapped to multiple references.

![Torres del Paine](.github/img/cover.jpg)

Install with the Conda or Mamba package manager after setting your [Bioconda channels](https://bioconda.github.io/user/install.html#set-up-channels):

```text
❯ conda install neodisambiguate
```

### Introduction

Disambiguation of aligned reads is performed per-template and all information across primary, secondary, and supplementary alignments is used as evidence.
Alignment disambiguation is commonly required when analyzing sequencing data from transduction, transfection, transgenic, or xenographic (including patient derived xenograft) experiments.
This tool works by comparing various alignment scores between a template that has been aligned to many references in order to determine which reference is the most likely source.

All templates which are positively assigned to a single source reference are written to a reference-specific output BAM file.
Any templates with ambiguous reference assignment are written to an ambiguous input-specific output BAM file.
Only BAMs produced from the Burrows-Wheeler Aligner (bwa) or STAR are currently supported.

Input BAMs of arbitrary sort order are accepted, however, an internal sort to queryname will be performed unless the BAM is already in queryname sort order.
All output BAM files will be written in the same sort order as the input BAM files.
Although paired-end reads will give the most discriminatory power for disambiguation of short-read sequencing data, this tool accepts paired, single-end (fragment), and mixed pairing input data.

### Features

- Accepts SAM/BAM sources of any sort order
- Will disambiguate an arbitrary number of BAMs, all aligned to different references
- Writes the ambiguous alignments to a separate directory
- Extensible implementation which supports alternative disambiguation strategies
- Benchmarks show high accuracy: [Click Here](benchmarks/disambiguate.md)

### Command Line Usage

```bash
❯ neodisambiguate -i infile1.bam infile2.bam -p out/disambiguated
```

### Example Usage

To disambiguate templates for sample `dna00001` that are aligned to human (A) and mouse (B):

```bash
❯ neodisambiguate -i dna00001.A.bam dna00001.B.bam -p out/dna00001 -n hg38 mm10
```

```console
tree out/
out/
├── ambiguous-alignments/
│ ├── dna00001.A.ambiguous.bai
│ ├── dna00001.A.ambiguous.bam
│ ├── dna00001.B.ambiguous.bai
│ └── dna00001.B.ambiguous.bam
├── dna00001.hg38.bai
├── dna00001.hg38.bam
├── dna00001.mm10.bai
└── dna00001.mm10.bam
```

### Local Installation

Bootstrap compilation and build the executable with:

```bash
./mill neodisambiguate.executable
./bin/neodisambiguate --help
```

### Prior Art

This project was inspired by AstraZeneca's `disambiguate`:

- https://github.com/AstraZeneca-NGS/disambiguate
2 changes: 2 additions & 0 deletions benchmarks/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
insilico/
output/
134 changes: 134 additions & 0 deletions benchmarks/disambiguate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
### Disambiguation Micro-Benchmark

### Introduction

As a preliminary test, we will attempt to disambiguate simulated paired-end reads from a homologous gene (`ABL1`) across human and murine species.

| Assembly | Species | Coordinates |
| --- | --- | --- |
| `hg38` | _Homo sapiens_ | [`chr9:130713881-130887675`][hs38DH-reference] |
| `mm10` | _Mus musculus_ | [`chr2:31688354-31807093`][mm10-reference] |
| `rn6` | _Rattus norvegicus_ | [`chr3:10041820-10145076`][rn6-reference] |
> Location of the `ABL1` gene in three model species.
[hs38DH-reference]: https://rgd.mcw.edu/rgdweb/report/gene/main.html?id=1342972
[mm10-reference]: https://rgd.mcw.edu/rgdweb/report/gene/main.html?id=1332211
[rn6-reference]: https://rgd.mcw.edu/rgdweb/report/gene/main.html?id=1584969

### Dependencies

- [`bwa`](https://bioconda.github.io/recipes/bwa/README.html)
- [`dwgsim`](https://anaconda.org/bioconda/dwgsim)
- [`parallel`](https://anaconda.org/conda-forge/parallel)
- [`picard`](https://bioconda.github.io/recipes/picard/README.html)

Install with:

```bash
mamba env create -f environment.yaml
```

### Simulation of Training Data

We will simulate 1,000 paired-end reads with uniform coverage over the target gene in all three species.
For the sake of tracking read pairs, we use the `-P` flag and prefix each read name with the source assembly ID.

```bash
❯ mkdir -p insilico
❯ parallel \
'dwgsim -N 1000 -c 0 -z 42 -x <( echo {2} ) -P {1} \
/pipeline/reference-data/references/{1}/{1}.fa \
insilico/{1}' \
::: hs38DH mm10 rn6 \
:::+ 'chr9 130713881 130887675' 'chr2 31688354 31807093' 'chr3 10041820 10145076'
```

### Digitally Mixing All Training Data

We concatenate all reads into a single FASTQ pair and then convert that file into an umapped BAM (uBAM) file.

```bash
❯ cat insilico/*read1*.fastq.gz > insilico/all.read1.fastq.gz
❯ cat insilico/*read2*.fastq.gz > insilico/all.read2.fastq.gz
❯ picard FastqToSam \
F1=insilico/all.read1.fastq.gz \
F2=insilico/all.read2.fastq.gz \
OUTPUT=insilico/all.unmapped.bam \
SAMPLE_NAME=all
```

### Alignment

We align all read pairs to all three reference genomes independently.
Although verbose, we follow the [GATK best practice][gatk-reference] alignment guide so that we ensure the sequence dictionaries of the output BAMs are correct.

[gatk-reference]: https://software.broadinstitute.org/gatk/best-practices/workflow?id=11165

```bash
❯ parallel \
'picard SamToFastq \
INPUT=insilico/all.unmapped.bam \
INTERLEAVE=true \
FASTQ=/dev/stdout \
| bwa mem -t 8 -p \
/pipeline/reference-data/references/{}/{}.fa \
/dev/stdin \
| picard MergeBamAlignment \
ALIGNED=/dev/stdin \
UNMAPPED=insilico/all.unmapped.bam \
OUTPUT=insilico/all.{}.mapped.bam \
REFERENCE_SEQUENCE=/pipeline/reference-data/references/{}/{}.fa' \
::: hs38DH mm10 rn6
```

### Disambiguation

We are now ready to disambiguate our aligned templates!

```bash
❯ neodisambiguate -i insilico/all.*.mapped.bam -o insilico/disambiguated
```

### Validation

Let's see how we did:

```bash
\ls -1 insilico/disambiguated*
insilico/disambiguated.hs38DH.bam
insilico/disambiguated.mm10.bam
insilico/disambiguated.rn6.bam
```

```bash
❯ parallel -k \
'echo Distribution of known alignments disambiguated into: {} \
&& picard ViewSam \
INPUT=insilico/disambiguated.{}.bam \
ALIGNMENT_STATUS=All \
PF_STATUS=All \
2>/dev/null \
| grep -v "@" \
| tr "_" "\t" \
| cut -f1 \
| sort \
| uniq -c \
&& echo' \
::: hs38DH mm10 rn6

Distribution of known alignments disambiguated into: hs38DH
1898 hs38DH

Distribution of known alignments disambiguated into: mm10
1906 mm10
6 rn6

Distribution of known alignments disambiguated into: rn6
2 mm10
1914 rn6
```

### Conclusion

Because these are alignments, and not read pairs or templates, we cannot directly calculate performance metrics such as sensitivity or specificty of this method.
We hope to improve these benchmarks.
10 changes: 10 additions & 0 deletions benchmarks/environment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
name: neodisambiguate_benchmarks
channels:
- bioconda
- conda-forge
- defaults
dependencies:
- bwa
- dwgsim
- parallel
- picard
Loading

0 comments on commit a0c8913

Please sign in to comment.