BISER (🦪🔮; Brisk Inference of Segmental duplication Evolutionary stRucture) is a fast tool for detecting and decomposing segmental duplications (SDs) in a single genome or multiple genomes. BISER is SEDEF's successor.
BISER needs Python 3.7+ and Samtools to run.
To install BISER, just run:
pip install biser
If you wish to build BISER from source, you will also need Codon programming language with the Seq plugin. To install BISER from source, run:
pip install git+https://github.com/0xTCG/biser.git
See Dockerfile for detailed instructions to build BISER from source.
To find SDs in a single genome, just run:
biser -o <output> -t <threads> <genome.fa>
BISER will also produce a file called output.elem
that will contain the elementary SD
decomposition of the found SDs.
All genomes should be indexed beforehand with samtools faidx genome.fa
.
⚠️ : BISER requires a soft-masked or a hard-masked genome assemblies for the optimal performance. Check for the presence of lowercase bases in your genome; if you have them, you are good to go.
⚠️ : If you are experiences crashes on Linux machines (especially in cluster environments), try setting --gc-heap 1G (or higher).
To find SDs in multiple genomes, just run:
biser -o <output> -t <threads> <genome1.fa> <genome2.fa> ...
Usage: biser [-h] [--temp TEMP] [--threads THREADS] --output OUTPUT [--hard]
[--keep-contigs] [--keep-temp] [--no-decomposition]
genomes [genomes ...]
Positional arguments:
genomes Indexed genomes in FASTA format.
Optional arguments:
-h, --help show this help message and exit
--temp TEMP, -T TEMP Temporary directory location
--threads THREADS, -t THREADS
Number of threads
--output OUTPUT, -o OUTPUT
Indexed genomes in FASTA format.
--hard, -H Are input genomes already hard-masked?
--keep-contigs Do not ignore contigs, unplaced sequences, alternate
alleles, patch chromosomes and mitochondrion sequences
(i.e., chrM and chromosomes whose name contains
underscore). Enable this when running BISER on
scaffolds and custom assemblies.
--keep-temp, -k Keep temporary directory after the execution. Useful
for debugging.
--resume RESUME Resume the previously interrupted run (that was run
with --keep-temp; needs the temp directory for
resume).
--no-decomposition Skip SD decomposition step.
--max-error MAX_ERROR
Maximum SD error (large gaps includes).
--max-edit-error MAX_EDIT_ERROR
Maximum SD edit error (large gaps NOT included).
--max-chromosome-size MAX_CHROMOSOME_SIZE
Maximum chromosome size.
--kmer-size KMER_SIZE
Search k-mer size.
--winnow-size WINNOW_SIZE
Search winnow size.
--version, -v show program's version number and exit
--gc-heap GC_HEAP Set GC_INITIAL_HEAP_SIZE.
The output follows the BEDPE file format.
The first six (6) fields are the standard BEDPE fields describing the coordinates of SD mates:
chr1
,start1
andend1
chr2
,start2
andend2
(both intervals are semi-open and 0-indexed).
Other fields are as follows:
Field | Description |
---|---|
reference |
Reference genome names of the first and the second mate, separated by : . |
score |
Total alignment error (0--100%): the number of mismatches and indels divided by the total alignment span. |
strand1 |
Strand (+ or - ) of the first SD mate. |
strand2 |
Strand (+ or - ) of the second SD mate. |
max_len |
Length of the longer mate. |
aln_len |
Alignment span (mate length with gaps included) |
cigar |
CIGAR string that describes the alignment |
optional |
Optional fields in the format NAME=VALUE;... . Currently contains the mismatch rate (starts with X= ) and the gap rate (starts with ID= ). |
In addition to BEDPE output, BISER might also output the decomposition file (with the .elem
extension) as well.
This file contains the list of core SD regions in the analyzed reference genomes.
The format of decomposition file is as follows:
Field | Description |
---|---|
reference |
Reference genome name. |
start |
Start position of the core region (0-indexed). |
end |
End position of the core region. |
id |
Core region. Note that many regions share the same core ID because core regions are duplicated across the genome(s). |
len |
Length of the core region. |
score |
Core region score (internal use only). |
strand |
Strand (+ or - ) of the core region. |
BISER was published in the Algorithms for Molecular Biology and was presented at the WABI 2021.
Please cite:
Išerić, H., Alkan, C., Hach, F. et al. Fast characterization of segmental duplication structure in multiple genome assemblies. Algorithms Mol Biol 17, 4 (2022). https://doi.org/10.1186/s13015-022-00210-2
BibTeX entry:
@article{ivseric2022fast,
title={Fast characterization of segmental duplication structure in multiple genome assemblies},
author={I{\v{s}}eri{\'c}, Hamza and Alkan, Can and Hach, Faraz and Numanagi{\'c}, Ibrahim},
journal={Algorithms for Molecular Biology},
volume={17},
number={1},
pages={1--15},
year={2022},
publisher={Springer}
}
Paper simulations are available in paper directory.
- BISER v1.4 (Mar 2023):
- Change of alignment refinement heuristics (should be faster now)
- Note: SDs generated with v1.4 might be slightly different than those generated by the earlier version
- Switch to Codon
- Minor bugfixes
- Change of alignment refinement heuristics (should be faster now)
Please reach out to Ibrahim Numanagić.