All notable changes to this project will be documented in this file. This project adheres to Semantic Versioning.
- A new filter for discarding contigs that result in too many ambiguous variant calls (see #361).
- A new filter for discarding very long reference targets composed of tandem arrays of spaced repeats that evade the
--max-diff
filter (see #366).
- Corrected a bug in the VCF reader that choked on filters not supported internally by kevlar (see #359).
- Reformulated the probability of an observed k-mer abundance conditioned on a 0/0 genotype (see #363).
- The homopolymer filter was made slightly less aggressive (see #365).
- Corrected a bug with in reporting multiple optimal calls due to floating point comparison (see #365).
- A new Snakemake workflow for preprocessing BAM inputs for analysis with kevlar (see #305, #355).
- A new Snakemake workflow for kevlar's standard processing procedure (see #306, #355).
- New
unband
module to merge augmented Fastq files produced with a k-mer banding strategy (see #316). - New
varfilter
module to filter out preliminary variant calls overlapping with problematic/unwanted loci or features (see #318, #342, #354). - New dependency:
intervaltree
package (see #318). - A new
sandbox
directory with convenience scripts for development and analysis (see #335). - A new
--min-like-score
filter for thesimlike
module (see #343). - A new
--drop-outliers
filter for thesimlike
module (see #350).
- Added a new flag to print to the terminal (stderr) and a logfile simultaneously (see #308).
- The functionality of the previous
filter
module is now split between the newunband
module and a reimplementation of thefilter
module (see #316). - Added a "fast mode" to the
simlike
module, prematurely halting computations for calls already marked for filtering (see #328). - Added a filter for problematic short indels adjacent to homopolymers (see #336, #338, #339).
- Implemented new filters in the
simlike
module based on thresholds and k-mer abundances: theControlAbundance
filter for predictions with too many high-abundance parent/control k-mers spanning the variant, and theCaseAbundance
filter for predictions with too many consecutive proband/child k-mers spanning the variant (see #327, #339).
- Corrected a bug that reported the reference target sequence instead of the assembled contig sequence in the
CONTIG
attribute of indel calls in the VCF (see #304). - Corrected a bug that called adjacent substitutions as independent SNVs rather than an aggregate MNV (see #332).
- The
effcount
,dump
, andsimplex
modules have been dropped (see #308, #316). - Internal handling of interesting read mate sequences has been dropped (see #353).
- Updated
setup.py
so that the README markdown is included in the long description attribute for rendering on PyPI (see commit 9f51024898). - Removed direct calls to fixures that are no longer supported by pytest (see commit dab6418b9f).
- Updated the Makefile so that
kevlar/tests/__init__.py
is not included when running the test suite. Now compatible with pytest>=4.0.0 (see commit 965bd0da48).
- The
kevlar count
operation now supports masks and 8-, 4-, or 1-bit counters (see #277 and #291). - A Jupyter notebook and supporting code and data for evaluating kevlar's performance on a simulated data set (see #271).
- New flags for filtering gDNA cutouts or calls from specified sequences (see #285).
- New filter that discards any contig/gDNA alignment with more than 4 mismatches (see #288).
- A new feature that generates a Nodetable containing only variant-spanning k-mers to support re-counting k-mers and computing likelihood scores in low memory (see #289, #292, #302).
- A new
ProgressIndicator
class that provides gradually less frequent updates over time (see #299).
- Ported augfastx handling from
kevlar.seqio
module to a new Cython module (see #279). - Dynamic error model for likelihood calculations is now an configurable option (see #286).
- Cleaned up overlap-related code with a new
ReadPair
class (see #283). - Updated
kevlar assemble
,kevlar localize
, andkevlar call
to accept streams of partitioned reads; previously, only reads for a single partition were permitted (see #294). - Overhauled the
kevlar localize
command to compute seed locations for all seeds in all partitions with a single BWA call, massively improving efficiency (see #294 and #301). - Updated the variant calling procedure to discard alignment blocks less than
ksize
in length (see #303).
- Minor bug with .gml output due to a change in the networkx package (see #278).
- Buggy home-grown greedy assembler dropped (see #279). Some parts of the overlap code retained and refactored (see #283).
- Refined handling of mate read alignments (see #247, #250, #251, #255, and #263).
- Multithreading is now supported natively in
kevlar alac
(see #249 and unmergedfeed-thread
branch). - A limited-scope VCF reader (see #256).
- Script for computing likelihood scores is now a first-class kevlar citizen as
kevlar simlike
(see #259). - New
kevlar dist
subcommand for computing average and standard deviation of k-mer abundances for likelihood calculations (see #264). - Paired-end awareness for
kevlar dump
(see #265). - New
LikelihoodFail
filter for variant calls with a negative likelihood score (see #266).
- Much cleaner and more concise calling code from implementing "center aligned" gap alignment strategy, facilitated by new CIGAR tokenizer (see #242).
- Improved default behavior for
maxdiff
and splitting seed hits into distinct bins (see #246).
- Incorrect handling of VCF
FILTER
field resolved (see #235). - A bug causing some calls to be erroneously filtered (see #237).
- A minor bug with banded mode in
kevlar novel
, various other minor fixes, and code that should have been removed previously (#239).
- New
kevlar gentrio
command for a more realistic similation of trios for testing and evaluation (#171). - New filter for
kevlar alac
for discarding partitions with a small number of interesting k-mers (#189). - New
kevlar split
subcommand for splitting a partitioned augfastq file into N chunks (see #206). - New
-p/--part-id
flag inkevlar alac
for processing a single partition in a partitioned augfastq file (see #206). - New reader/parser for parititioned augfastx files (see #206).
- New strategy for discriminating between variants and off-target calls using pairing information (see #210).
- New optional "fallback" assembly strategy: if fermi-lite fails, try our homegrown greedy assembly algorithm (see #214 and #219).
- New parameter for excluding SNV calls too near to the end of a contig (see #222).
- Replaced
pep8
withpycodestyle
for enforcing code style in development (see #167). - The
--refr
argument of thekevlar dump
command is now optional, and when no reference is explicitly specifiedkevlar dump
acts primarily as a BAM to Fastq converter (see #170). - Split the functionality of the
count
subcommand: simple single-sample k-mer counting was kept incount
with a much simplified interface, while the memory efficient multi-sample "masked counting" strategy was split out to a new subcommandeffcount
(see #185). - Replaced
kevlar reaugment
with a more generalizablekevlar augment
subcommand (see #188). - Replaced
--ksize
with--seed-size
inkevlar localize
so thatkevlar alac
can now support different values for k-mers and localizing seeds/anchors (see #198). - Improved variant sorting, scoring, and reporting strategy (see #199).
- The augmented Fastx format now permits annotation of 1 or more mate sequences (see #210).
- Split
vcf.py
andvarmap.py
modules off from thecall.py
module (see #229).
- Incorrect file names in the quick start documentation page (see 9f6bec06d4).
- The
kevlar alac
procedure now accepts a stream of read partitions (instead of a stream of reads) at the Python API level, and correctly handles a single partition-labeled sequence file at the CLI level (see #165). - CIGARs that begin with I blocks (alternate allele contig is longer than reference locus) are now handled properly (see #191).
- Bug with how
kevlar alac
handles "no reference match" scenarios resolved (see #192). - Bug with
kevlar count
when reading from multiple input files (see #202). - Can now call SNVs near INDELs (see #229).
- The JCA assembly mode is no longer supported (see #231).
This release includes many new features, some refactoring of the core codebase, and the first end-to-end analysis workflow implemented in a single command. Details are included below.
- Abundances reported by
kevlar filter
now correctly show re-computed proband k-mer abundances, not pre-filtering abundances (see #111). - The
kevlar localize
andkevlar call
procedures now handle multiple assembled contigs, calling variants from the best reference match for each contig (see #124, #126, and #147).
- New abundance screen now a part of
kevlar novel
. If any k-mer in a read is below some abundance threshold, the entire read is discarded (see #106). - Better error reporting and handling of various issues with assembly, localization, and alignment (see #113, #114).
- Support for VCF output (see #130 and #144), including "windows" with all k-mers containing the reference allele (RW) and alternate allele (VW) to facilitate distinguishing inherited mutations from novel mutations (see #144 and #152).
- New subcommands
alac
: assembles, localizes, aligns, and calls variants on a single partition basissimplex
: invokes the entire simplex analysis workflow
- The
kevlar filter
procedure now handles both contamination and reference matches under a single "mask" interface (see #103). - Explicitly dropped support for Python 2.7. Now supports only Python >=3.5 (see #125).
- Main methods for each core subcommand are now implemented as minimal wrappers around generator functions, to facilitate composing different steps of the workflow or invoking them from third-party Python code (see #95, #126, #133, #148, #149, #150, #159, #161).
- The home-grown greedy assembly implementation has been replaced by calls to the
fermi-lite
library, which is now bundled with kevlar (see #156). - The default behavior of
kevlar partition
is now to output a single stream of reads. Writing each partition to a distinct file is still supported with the--split
option.
- The
kevlar collect
command and associated tests. Its functionality has now been fully distributed to other subcommands.- Read filtering to
kevlar filter
- Junction count contig assembly to
kevlar filter
as an optional mode
- Read filtering to
- New subcommands
partition
: group reads by shared interesting k-merslocalize
: determine an assembled contig's location in the reference genomecall
: align assembled contigs to reference and call variant
- Documentation suite in
docs/
, hosted at https://kevlar.readthedocs.io - New third-party dependency
ksw2
for computing alignments. Wrapped with Cython, which is a new development-time dependency (but not install or run time). - The
pandas
package is now a dependency, andpysam
andnetworkx
are now hard dependencies (rather than conditional).
- Bug with assembly when the order of a read pair was swapped and they had the opposite orientation (see #85).
- Command-line interface with 8 subcommands
dump
: discard reads that match reference completelycount
: compute k-mer abundances for all samplesnovel
: identify "interesting" (potentially novel) k-mersfilter
: re-compute k-mer abundances, discard false positives and contaminationassemble
: assemble reads for a single variantcollect
: collect and filter (legacy)mutate
: simulate variants on a genomereaugment
: re-attach interesting k-mer annotations to reads
- Extensive test suite
- Continuous integration configuration