-
Notifications
You must be signed in to change notification settings - Fork 3
/
README.md
232 lines (162 loc) · 11.2 KB
/
README.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
[![CI](https://github.com/gbouras13/dnaapler/actions/workflows/ci.yaml/badge.svg)](https://github.com/gbouras13/dnaapler/actions/workflows/ci.yaml)
[![codecov](https://codecov.io/gh/gbouras13/dnaapler/branch/main/graph/badge.svg?token=4B1T2PGM9V)](https://codecov.io/gh/gbouras13/dnaapler)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![DOI](https://zenodo.org/badge/550095292.svg)](https://zenodo.org/doi/10.5281/zenodo.10039420)
[![Anaconda-Server Badge](https://anaconda.org/bioconda/dnaapler/badges/version.svg)](https://anaconda.org/bioconda/dnaapler)
[![Bioconda Downloads](https://img.shields.io/conda/dn/bioconda/dnaapler)](https://img.shields.io/conda/dn/bioconda/dnaapler)
[![PyPI version](https://badge.fury.io/py/dnaapler.svg)](https://badge.fury.io/py/dnaapler)
[![Downloads](https://static.pepy.tech/badge/dnaapler)](https://pepy.tech/project/dnaapler)
# dnaapler
Dnaapler is a simple tool that reorients complete circular microbial genomes.
## Quick Start
```
# creates conda environment with dnaapler
conda create -n dnaapler_env dnaapler
# activates conda environment
conda activate dnaapler_env
# runs dnaapler all
dnaapler all -i input_mixed_contigs.fasta -o output_directory_path -p my_bacteria_name -t 8
# runs dnaapler chromosome
dnaapler chromosome -i input_chromosome.fasta -o output_directory_path -p my_bacteria_name -t 8
```
## Table of Contents
- [dnaapler](#dnaapler)
- [Quick Start](#quick-start)
- [Table of Contents](#table-of-contents)
- [Description](#description)
- [Documentation](#documentation)
- [Commands](#commands)
- [Installation](#installation)
- [Conda](#conda)
- [Pip](#pip)
- [Usage](#usage)
- [Example Usage](#example-usage)
- [Databases](#databases)
- [Motivation](#motivation)
- [Acknowledgements](#acknowledgements)
## Description
<p align="center">
<img src="paper/Dnaapler_figure.png" alt="Dnaapler Figure">
</p>
`dnaapler` is a simple python program that takes a single nucleotide input sequence (in FASTA format), finds the desired start gene using `blastx` against an amino acid sequence database, checks that the start codon of this gene is found, and if so, then reorients the chromosome to begin with this gene on the forward strand.
It was originally designed to replicate the reorientation functionality of [Unicycler](https://github.com/rrwick/Unicycler/blob/main/unicycler/gene_data/repA.fasta) with dnaA, but for for long-read first assembled chromosomes. We have extended it to work with plasmids (`dnaapler plasmid`) and phages (`dnaapler phage`), or for any input FASTA desired with `dnaapler custom`, `dnaapler mystery` or `dnaapler nearest`.
For bacterial chromosomes, `dnaapler chromosome` should ensure the chromosome breakpoint never interrupts genes or mobile genetic elements like prophages. It is intended to be used with good-quality completed bacterial genomes, generated with methods such as [Trycycler](https://github.com/rrwick/Trycycler/wiki), [Dragonflye](https://github.com/rpetit3/dragonflye) or my own pipeline [hybracter](https://github.com/gbouras13/hybracter).
Additionally, you can also reorient multiple bacterial chromosomes/plasmids/phages at once using the `dnaapler bulk` subcommand.
If your input FASTA is mixed (e.g. has chromosome and plasmids), you can also use `dnaapler all`, with the option to ignore some contigs with the `--ignore` parameter.
## Documentation
The full documentation for `dnaapler` can be found [here](https://dnaapler.readthedocs.io).
## Commands
* `dnaapler all`: Reorients 1 or more contigs to begin with any of dnaA, terL, repA.
- Practically, this should be the most useful command for most users.
* `dnaapler chromosome`: Reorients your sequence to begin with the dnaA chromosomal replication initiator gene
* `dnaapler plasmid`: Reorients your sequence to begin with the repA plasmid replication initiation gene
* `dnaapler phage`: Reorients your sequence to begin with the terL large terminase subunit gene
* `dnaapler custom`: Reorients your sequence to begin with a custom amino acid FASTA format gene that you specify
* `dnaapler mystery`: Reorients your sequence to begin with a random CDS
* `dnaapler mystery`: Reorients your sequence to begin with a random CDS
* `dnaapler nearest`: Reorients your sequence to begin with the first CDS (nearest to the start). Designed for fixing sequences where a CDS spans the breakpoint.
* `dnaapler bulk`: Reorients multiple contigs to begin with the desired start gene - either dnaA, terL, repA or a custom gene.
## Installation
`dnaapler` requires only BLAST v2.10 or higher as an external dependency.
Installation from conda is recommended as this will install BLAST automatically.
### Conda
`dnaapler` is available on bioconda.
```
conda install -c bioconda dnaapler
```
### Pip
You can also install `dnaapler` with pip.
```
pip install dnaapler
```
You will need to install BLAST v2.10 or higher separately.
e.g.
```
conda install -c bioconda blast>2.9
```
## Usage
```
Usage: dnaapler [OPTIONS] COMMAND [ARGS]...
Options:
-h, --help Show this message and exit.
-V, --version Show the version and exit.
Commands:
all Reorients contigs to begin with any of dnaA, repA...
bulk Reorients multiple genomes to begin with the same gene
chromosome Reorients your genome to begin with the dnaA chromosomal...
citation Print the citation(s) for this tool
custom Reorients your genome with a custom database
largest Reorients your genome the begin with the largest CDS as...
mystery Reorients your genome with a random CDS
nearest Reorients your genome the begin with the first CDS as...
phage Reorients your genome to begin with the terL large...
plasmid Reorients your genome to begin with the repA replication...
```
```
Usage: dnaapler all [OPTIONS]
Reorients contigs to begin with any of dnaA, repA or terL
Options:
-h, --help Show this message and exit.
-V, --version Show the version and exit.
-i, --input PATH Path to input file in FASTA format [required]
-o, --output PATH Output directory [default: output.dnaapler]
-t, --threads INTEGER Number of threads to use with BLAST [default: 1]
-p, --prefix TEXT Prefix for output files [default: dnaapler]
-f, --force Force overwrites the output directory
-e, --evalue TEXT e value for blastx [default: 1e-10]
--ignore PATH Text file listing contigs (one per row) that are to
be ignored
-a, --autocomplete TEXT Choose an option to autocomplete reorientation if
BLAST based approach fails. Must be one of: none,
mystery, largest, or nearest [default: none]
--seed_value INTEGER Random seed to ensure reproducibility. [default:
13]
```
The reoriented output FASTA will be `{prefix}_reoriented.fasta` in the specified output directory.
## Example Usage
```
dnaapler all -i input.fasta -o output_directory_path -p my_genome_name --ignore list_of_contigs_to_ignore.txt
```
```
dnaapler chromosome -i input.fasta -o output_directory_path -p my_bacteria_name -t 8
```
```
dnaapler phage -i input.fasta -o output_directory_path -p my_phage_name -t 8
```
```
dnaapler plasmid -i input.fasta -o output_directory_path -p my_plasmid_name -t 8
```
```
dnaapler custom -i input.fasta -o output_directory_path -p my_genome_name -t 8 -c my_custom_database_file
```
```
dnaapler mystery -i input.fasta -o output_directory_path -p my_genome_name
```
```
dnaapler nearest -i input.fasta -o output_directory_path -p my_genome_name
```
```
dnaapler largest -i input.fasta -o output_directory_path -p my_genome_name
```
```
# to reorient multiple bacterial chromosomes
dnaapler bulk -i input_file_with_multiple_chromosomes.fasta -m chromosome -o output_directory_path -p my_genome_name
```
## Databases
`dnaapler chromosome` uses 584 proteins downloaded from Swissprot with the query "Chromosomal replication initiator protein DnaA" on 24 May 2023 as its database for dnaA. All hits from the query were also filtered to ensure "GN=dnaA" was included in the header of the FASTA entry.
`dnaapler plasmid` uses the repA database curated by Ryan Wick in [Unicycler](https://github.com/rrwick/Unicycler/blob/main/unicycler/gene_data/repA.fasta).
`dnaapler phage` uses a terL database curated using [PHROGs](https://phrogs.lmge.uca.fr). All the AA sequences of the 55 phrogs annotated as 'large terminase subunit' were downloaded, combined and depduplicated using [seqkit](https://github.com/shenwei356/seqkit) `seqkit rmdup -s -o terL.faa phrog_terL.faa`.
`dnaapler all` uses all three databases combined into one.
`dnaapler custom` uses a custom amino acid FASTA format file that you specify using `-c`.
The matching is strict - it requires a strong BLASTx match (default e-value 1E-10), and the first amino acid of a BLASTx hit gene to be identified as Methionine, Valine or Leucine, the 3 most used start codons in bacteria/phages.
For the most commonly studied microbes (ESKAPE pathogens, etc), the dnaA database should suffice.
If you try `dnaapler` on a more novel or under-studied microbe with a dnaA gene that has little sequence similarity to the database, you may need to provide your own dnaA gene(s) in amino acid FASTA format using `dnaapler custom`.
After this [issue](https://github.com/gbouras13/dnaapler/issues/1), `dnaapler mystery` was added. It predicts all ORFs in the input using [pyrodigal](https://github.com/althonos/pyrodigal), then picks a random gene to re-orient your sequence with.
## Motivation
1. I couldn't get [Circlator](https://sanger-pathogens.github.io/circlator/) to work and it is no longer supported.
2. [berokka](https://github.com/tseemann/berokka) doesn't orient chromosomes to begin with dnaa.
3. After reading Ryan Wick's masterful bacterial genome assembly [tutorial](https://github.com/rrwick/Perfect-bacterial-genome-tutorial/wiki), I realised that it is probably optimal to run 2 polishing steps, once before then once after rotating the chromosome, to ensure the breakpoint is polished. Further, for some "complete" long read bacterial assemblies that didn't circularise properly, I figured that as long as you have a complete assembly (even if not "circular" as marked as in Flye), polishing after a re-orientation would be likely to circularise the chromosome. A bit like Ryan's [rotate_circular_gfa.py](https://github.com/rrwick/Perfect-bacterial-genome-tutorial/blob/main/scripts/rotate_circular_gfa.py) script, without the requirement of strict circularity.
4. While researching MGEs in _S. aureus_ whole genome sequences, I repeatedly found instances where MGEs were interrupted by the chromosome breakpoint. So I thought I'd add a tool to automate it in my pipeline.
5. It's probably good to have all your sequences start at the same location for synteny analyses.
## Acknowledgements
Thanks to Torsten Seemann, Ryan Wick and the Circlator team for their existing work in the space. Also to [Michael Hall](https://github.com/mbhall88), whose repository [tbpore](https://github.com/mbhall88/tbpore) we took and adapted a lot of scaffolding code from because he writes really nice code.