-
Notifications
You must be signed in to change notification settings - Fork 49
Updates
in progress
- Reference context sequence comparison is now deprecated and sequence similarity calculation improved by also checking lexicographically minimum rotation's similarity. details
- Symbolic variants (
<DEL>
,<INV>
,<DUP>
) can now be resolved for sequence comparison when a--reference
is provided. The function for resolving the sequences is largely similar to this discussion - Symbolic variants can now match to resolved variants, even with
--pctseq 0
, with or without the new sequence resolving procedure. - Symbolic variant sub-types are ignored e.g.
<DUP:TANDEM> == <DUP>
-
--sizemax
now default to-1
, meaning all variant ≥--sizemin / --sizefilt
are compared - Redundant variants which are collapsed into kept (a.k.a. removed) variants now more clearly labeled (
--removed-output
instead of--collapsed-output
) - Fixed 'Unknown error' caused by unset TMPDIR (#229 and #245)
- Fixes to minor record keeping bugs in refine/ga4gh better ensure all variants are counted/preserved
- BND variants are now compared by bench (details)
- Cleaner outputs by not writing matching annotations (e.g.
PctSeqSimilarity
) that areNone
- Major refactor of Truvari package API for easy reuse of SV comparison functions (details)
September 9, 2024
-
bench
- Correctly filtering
ALT=*
alleles (details) and monomorphic reference- including test coverage this time
- Correctly filtering
-
stratify
- Default behavior is to count variants within (#221)
-
collapse
- Faster sub-chunking operations by dropping use of pyintervaltree
-
anno chunks
- New command for identifying windows with a high number of SVs (details)
July 31, 2024
-
refine
&stratify
- Fixed variant and bed boundary overlapping issue
- general
March 28, 2024
-
collapse
- Fewer comparisons needed per-chunk on average
- Fixed
--chain
functionality (details) - Fixed
--gt
consolidation of format fields
-
bench
- Faster result at the cost of less complete annotations with
--short
flag
- Faster result at the cost of less complete annotations with
-
refine
- Assures variants are sequence resolved before incorporating into consensus
-
bench --passonly --sizemax
parameters are used when building consensus for a region. Useful forrefine --use-original-vcfs
- When a refined region has more than 5k variants, it is skipped and a warning of the region is written to the log
- Flag
--use-region-coords
now expands--region
coordinates by 100bp (phab --buffer
default) to allow variants to harmonize out of regions.
- general
- Dynamic bed/vcf parsing tries to choose faster of streaming/fetching variants
February 6, 2024
-
collapse
- Faster handling of genotype data for
--gt
and--keep common
- Faster handling of genotype data for
- general
- Fix to bed end position bug for including variants (details)
- Fix to Dockerfile
-
refine
- Changes to
--recount
that accompany the fix to bed end positions.
- Changes to
- New command
ga4gh
to convert Truvari results into GA4GH truth/query VCFs with intermediates tags
January 12, 2024
-
collapse
- New parameter
--gt
disallows intra-sample events to collapse (details) - New parameter
--intra
for consolidating SAMPLE information during intra-sample collapsing (details) - Preserve phasing information when available
- Faster O(n-1) algorithm instead of O(n^2)
- Faster sub-chunking strategy makes smaller chunks of variants needing fewer comparisons
- Fixed rare non-determinism error in cases where multiple variants are at the same position and equal qual/ac could be ordered differently.
- New parameter
-
phab
- Correct sample handling with
--bSamples
--cSamples
parameters - Faster generation of consensus sequence
- Resolved 'overlapping' variant issue causing variants to be dropped
- New
poa
approach to harmonization. Faster than mafft but less accurate. Slower than wfa but more accurate.
- Correct sample handling with
-
bench
- New, easier
MatchId
field to track which baseline/comparison variants match up details -
entry_is_present
method now considers partial missing variants (e.g../1
) as present - Removed the 'weighted' metrics from
summary.json
- New, easier
-
consistency
- Fixed issue with counting duplicate records
- Added flag to optionally ignore duplicate records
-
anno svinfo
now overwrites existing SVLEN/SVTYPE info fields - general
- Reduced fn matches for unroll sequence similarity by reporting maximum of multiple manipulations of variant sequence (roll up/down/none). Comes at a small, but reasonable, expense of some more fp matches.
- Bump pysam version
- Fixed bug in
unroll
sequence similarity that sometimes rolled from the wrong end - Fixed bug for handling of None in ALT field
-
truvari.compress_index_vcf
forces overwriting of tabix index to prevent annoying crashes
August 7, 2023
-
bench
-
phab
- ~2x faster via reduced IO from operating in stages instead of per-region
- Removed most external calls (e.g. samtools doesn't need to be in the environment anymore)
- new
--align wfa
allows much faster (but slightly less accurate) variant harmonization - increased determinism of results detals
-
refine
- Faster bed file intersection of
--includebed
and--regions
- Refine pre-flight check
- Correct refine.regions.txt end position from IntervalTree correction
- Better refine region selection with
--use-original
-
--use-includebed
switched to--use-region-coords
so that default behavior is to prefer the includebed's coordinates -
--use-original-vcfs
to use the original pre-bench VCFs -
refine.variant_summary.json
is cleaned of uninformative metrics
- Faster bed file intersection of
-
stratify
- parallel parsing of truvari directory to make processing ~4x faster
-
msa2vcf
Fixed REPL decomposition bug to now preserve haplotypes -
anno grpaf
- expanded annotation info fields -
anno density
- new parameter--stepsize
for sliding windows -
collapse
- New optional
--median-info
fields #146
- New optional
- Minor updates
- Fix some
anno
threading on macOS #154 - Monomorphic/multiallelic check fix in
bench
-
PHAB_WRITE_MAFFT
environment variable to facilitate updating functional test answer key - Slightly slimmer docker container
- Fix some
March 13, 2023
As part of the GIAB TR effort, we have made many changes to Truvari's tooling to enable comparison of variants in TR regions down to 5bp. Additionally, in order to keep Truvari user friendly we have made changes to the UI. Namely, we've updated some default parameters, some command-line arguments, and some outputs. There are also a few new tools and how a couple of tools work has changed. Therefore, we decided to bump to a new major release. If you're using Truvari in any kind of production capacity, be sure to test your pipeline before moving to v4.0.
- New
refine
command for refining benchmarking results. Details -
bench
- Unroll is now the default sequence comparison approach.
- New
--pick
parameter to control the number of matches a variant can participate in details - The
summary.txt
is now namedsummary.json
- Outputs parameters to
params.json
- Output VCFs are sorted, compressed, and indexed
- Ambiguous use of 'call' in outputs corrected to 'comp' (e.g.
tp-call.vcf.gz
is nowtp-comp.vcf.gz
) - Renamed
--pctsim
parameter to--pctseq
- Fixed bug where FP/FN weren't getting the correct, highest scoring match reported
- Fixed bug where
INFO/Multi
wasn't being properly applied - Fixed bug where variants spanning exactly one
--includebed
region were erroneously being counted. - Removed parameters:
--giabreport
,--gtcomp
,--multimatch
,--use-lev
,--prog
,--unroll
-
collapse
- Renamed
--pctsim
parameter to--pctseq
- Runtime reduction by ~40% with short-circuiting during
Matcher.build_match
- Better output sorting which may allow pipelines to be a little faster.
- Renamed
-
vcf2df
- More granular sizebins for
[0,50)
including better handling of SNPs -
--multisample
is removed. Now automatically add all samples with--format
- key index column removed and replaced by chrom, start, end. Makes rows easier to read and easier to work with e.g. pyranges
- More granular sizebins for
-
anno
- Simplified ui. Commands that work on a single VCF and can stream (stdin/stdout) no longer use
--input
but a positional argument. - Added
addid
- Simplified ui. Commands that work on a single VCF and can stream (stdin/stdout) no longer use
-
consistency
- Slight speed improvement
- Better json output format
-
segment
- Added
--passonly
flag - Changed UI, including writing to stdout by default
- Fixed END and 1bp DEL bugs, now adds N to segmented variants' REF, and info fields SVTYPE/SVLEN
- Added
- API
- Began a focused effort on improving re-usability of Truvari code.
- Entry point to run benchmarking programmatically with Bench object.
- Better development version tracking. details
- Improved developer documentation. See readthedocs
- general
- msa2vcf now left-trims and decomposes variants into indels
- Functional tests reorganization
- Fix for off-by-one errors when using pyintervaltree. See ticket
- Removed progressbar and Levenshtein dependencies as they are no longer used.
August 27, 2022
-
bench
-
--dup-to-ins
flag automatically treats SVTYPE==DUP as INS, which helps compare some programs/benchmarks - New
--unroll
sequence comparison method forbench
andcollapse
(details)
-
- Major
anno trf
refactor (TODO write docs) including:- annotation of DEL is fixed (was reporting the ALT copy numbers, not the sample's copy numbers after incorporating the ALT
- allow 'denovo' annotation by applying any TRF annotations found, not just those with corresponding annotations
- New
anno grpaf
annotates vcf with allele frequency info for groups of samples - New
phab
for variant harmonization (details) - backend
-
truvari.entry_size
returns the length of the event in the cases where len(REF) == len(ALT) (e.g. SNPs entry_size is 1) - New key utility for
truvari.build_anno_trees
-
- general
- Float metrics written to the VCF (e.g. PctSizeSimilarity) are rounded to precision of 4
- Nice colors in some
--help
with rich
-
divide
- output shards are now more easily sorted (i.e.
ls divide_result/*.vcf.gz
will return the shards in the order they were made) - compression/indexing of sub-VCFs in separate threads, reducing runtime
- output shards are now more easily sorted (i.e.
- user issues
July 7, 2022
- Improved performance of
consistency
(see #127) - Added optional json output of
consistency
report - Allow GT to be missing, which is allowed by VCF format specification
- TRF now uses
truvari.entry_variant_type
instead of trying to usepysam.VariantRecord.info["SVLEN"]
directly which allows greater flexibility. - vcf2df now parses fields with
Number=\d
(e.g. 2+), which is a valid description -
truvari.seqsim
is now case insensitive (see #128) - Collapse option to skip consolidation of genotype information so kept alleles are unaltered
-
truvari anno dpcnt --present
will only count the depths of non ./. variants - New collapse annotation
NumConsolidate
records how many FORMATs were consolidated - Official conda support
May 25, 2022
- New utilities
vcf_ranges
andmake_temp_filename
- New annotations
dpcnt
andlcr
- Fixed a bug in
truvari collapse --keep
that prevented themaxqual
orcommon
options from working - Increased determinism for
truvari collapse
so that in cases of tied variant position the longer allele is returned. If the alleles also have the same length, they are sorted alphabetically by the REF - New
truvari bench --extend
functionality. See discussion for details
Apr 1, 2022
- Removed
truvari.copy_entry
forpysam.VariantRecord.translate
a 10x faster operation - Faster
truvari collapse
(@c8b319b) - When building
MatchResult
between variants with shared start/end positions, we save processing work by skipping haplotype creation and just compare REFs/ALTs directly. - Updated documentation to reference the paper https://doi.org/10.1101/2022.02.21.481353
- New
truvari anno density
for identifying regions with 'sparse' and 'dense' overlapping SVs (details) - Better
bench
genotype reporting withsummary.txt
having agt_matrix
of Base GT x Comp GT for all Base calls' best, TP match. - New
truvari anno bpovl
for intersecting against tab-delimited files (details) - New
truvari divide
command to split VCFs into independent parts (details) - Replaced
--buffer
parameter with--minhaplen
for slightly better matching specificity - Bugfix -
truvari anno trf
no longer duplicates entries spanning multple parallelization regions - Bugfix -
collapse
MatchId/CollapseId annotation wasn't working - Bugfixes - from wwliao (@4dd9968 @ef2cfb3)
- Bugfixes - Issues #107, #108
Dec 22, 2021
-
bench
now annotates FPs by working a little differently. See bench for details. - Recalibrated TruScore and new reciprocal overlap measurement for sequence resolved
INS
(details) - Match objects are now usable via the SDK. See #94 for an example of using Truvari programmatically
-
file_zipper
VCF iteration strategy (GenomeTree
->RegionVCFIterator
) that improves speed, particularly when using--includebed
-
collapse
refactored to use Match object and for prettier code, cleaner output. -
anno remap
now optionally addsINFO
field of the location of the top N hits. - An experimental tool
truvari segment
added to help SV association analysis. -
vcf2df
now supports pullingFORMAT
fields from multiple samples. -
vcf2df
now adds('_ref', '_alt')
, or('_ref', '_het', '_hom')
forINFO,Number=[R|G]
fields, respectively. - Improved documentation, including http://truvari.readthedocs.io/ for developers.
- Increasing/diversifying test coverage exposed minor bugs which were fixed.
-
bench --no-ref --cSample
bug fixes. - Minor usability feature implemented in
help_unknown_cmd
.
Sep 15, 2021
As Truvari's adoption and functionality grows, we decided to spend time working on sustainability and performance of the tool. Multiple Actions for CI/CD have been added. Many components have been refactored for speed, and other 'cruft' code has been removed. Some of these changes (particularly the switch to using edlib for sequence similarity) affects the results. Therefore, we've bumped to a new major release version.
- Working on speed improvements
- Added edlib as the default when calculating pctseq_sim, keeping Levenstein as an option (
--use-lev
). -
truvari bench
summary's gt_precision/gt_recall are replaced by gt_concordance, which is just the percent of TP-comp calls with a concordant genotype.--no-ref
has better functionality.--giabreport
is different. - Added
—keep common
totruvari collapse
, which allows one to choose to keep the allele with the highest MAC. -
truvari collapse --hap
wasn't working correctly. The assumptions about the calls being phased wasn't being properly used (e.g. don't collapse 1|1) and the NumCollapsed was being populated before the single-best match was chosen. The latter is a reporting problem, but the former had an effect on the results with ~3% of collapsed calls being mis-collapsed. -
truvari anno trf
is now faster and simpler in its approach and whats reported.. and hopefully more useful. -
truvari anno grm
has min_size and regions arguments added. - truv2df has become
truvari vcf2df
where the default is vcf conversion with options to run on atruvari bench
output directory. It also allows a specific sample to be parsed with--format
and better Number=A handling. - NeighId added to
truvari anno numneigh
, which works like bedtools cluster. - The method af_calc now makes MAC/AC.
- Added 'partial' to
truvari anno remap
. - Added
truvari anno svinfo
. - Removed
truvari stats
astruvari vcf2df
is better and began building community-driven summaries. - Ubiquitous single version.
- Added a Dockerfile and instructions for making a Truvari docker container.
- Code and repository cleaning.
- Github actions for automated pylint, testing, and releases to pypi.
- Preserving per-version documentation from the wiki in
docs/
.
Jan 27, 2021
We've expanded and improved Truvari's annotations. We've added an SV "collapsing" tool. And we've added a way to turn VCFs into pandas DataFrames easily for downstream analysis/QC.
May 14, 2020
After performing a drastic code refactor, we were able to create several helper methods from Truvari's core functionality around SV comparisons and VCF manipulations. This reusable code gave us an opportunity to create tools relevant for SV analysis.
Truvari now contains multiple subcommands. In addition to the original benchmarking functionality (truvari bench
), Truvari can generate SV relevant summary statistics, compute consistency of calls within VCFs, and we've begun to develop annotations for SVs. Details on these tools are on the WIKI.
We are committed to continually improving Truvari with the hopes of advancing the study and analysis of structural variation.
September 25th, 2019
Truvari has some big changes. In order to keep up with the o deement of Python 2.7 https://pythonclock.org/ We're now only supporting Python 3.
Additionally, we now package Truvari so it and its dependencies can be installed directly. See Installation below. This will enable us to refactor the code for easier maintenance and reusability.
Finally, we now automatically report genotype comparisons in the summary stats.