Truth_set information for benchmarking #27

poddarharsh15 · 2023-11-21T10:51:46Z

Hi,
I'm currently working on benchmarking VCF files generated from HG002_data(test_run just one sample) for SV calling(Manta, lumpy, GRIDSS, nf-core/sarek) against a truth set. I aligned the BAM files using GRCh38. Any ideas on how to effectively benchmark my results on which truth set? I have one confusion can I use the truth_sets from SV_0.6/ for bench the vcf_files(aligned on GRCh38) generated from SV caller tools. I am using truvari and SVanalyzer for benchmarking.
Thank you.

jzook · 2023-11-21T14:21:52Z

Hi - good question. You generally can only use v0.6 on GRCh37. For GRCh38, we have a published GIAB benchmark for challenging medically relevant genes that includes ~200 SVs (https://rdcu.be/cGwVA). We also have a preliminary draft benchmark from the HG002 T2Tv0.9 assembly, which includes more challenging SVs, but we haven't evaluated it much yet so recommend curating some of the FPs and FNs https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.011-20230725/. We are working on a new HG002-T2T assembly-based SV benchmark that we will evaluate in the coming months.

poddarharsh15 · 2023-11-24T13:44:11Z

Hello again and thanks you for answering, I was curious to know if I can benchmark .vcf files on v1.0 T2T-HG002 Assembly that is recently published on GIAB website.

nate-d-olson · 2023-11-27T13:35:55Z

Hi, The vcf in the provided link contains the changes that were made between v0.9 of the HG002 Q100 assembly and v1.0. Therefore the vcf is not suitable for benchmarking. We have a new draft benchmark using v1.0 of the assembly that we expect to post of the ftp site this week.

poddarharsh15 · 2023-11-28T09:14:01Z

Hi, Thank you for clarifying my previous doubt. However, I'm now more confused. To assist you in helping me, I've attached the results from my latest benchmark using this truth_set: [https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.011-20230725/]. Could you please review the results? I seem to be encountering a high number of false negatives, and I'm wondering if there might be errors in my approach.
delly_summary.json
Manta_summary.json

nate-d-olson · 2023-11-28T13:20:57Z

I took a quick look at your results. Can you provide the truvari command and the specific files on the ftp site you used for benchmarking to help with interpreting the results. Thanks!

poddarharsh15 · 2023-11-28T13:31:48Z

truvari bench -b GRCh38_HG002-T2TQ100-V0.9_stvar.vcf.gz -c delly_sorted.vcf.gz --includebed GRCh38_HG002-T2TQ100-V0.9_stvar.benchmark.bed -o delly_cmrg

(https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.011-20230725/GRCh38_HG002-T2TQ100-V0.9_stvar.vcf.gz)

Hello,
Please find the command line and ftp link.

jzook · 2023-11-28T14:12:38Z

Hi @poddarharsh15 - thanks for testing out this draft benchmark. We've not evaluated this extensively yet, but I do expect that standard short read-based methods will have much lower recall for this benchmark, because it includes many more challenging SVs, with many in long tandem repeat regions. You also may want to take a look at the FPs to see if it makes sense to change the matching parameters in truvari to make them more lenient

poddarharsh15 · 2023-11-30T15:50:12Z

delly.json
Manta.json
Hi @jzook,
I've benchmarked some data on CMRG genes using truvari. The VCF file is available here: (https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/CMRG_v1.00/GRCh38/StructuralVariant/HG002_GRCh38_CMRG_SV_v1.00.vcf.gz) The BAM files are aligned on GRCh38. Could you please provide insights into these results?
Thank you.

jzook · 2023-12-04T03:07:23Z

Hi @poddarharsh15 - The CMRG SVs include many in tandem repeats and some in segmental duplications, so I expect the high FN rate reflects this. I recommend you examine some of the FPs and FNs in IGV with your bam and vcf alongside the CMRG vcf and a long read bam file like
https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb_20kb_chemistry2/GRCh38/HG002.SequelII.merged_15kb_20kb.pbmm2.GRCh38.haplotag.10x.bam
and assembly bam files like
https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.012-20231107/dipcall_output/

poddarharsh15 · 2024-01-08T09:11:41Z

Hello @jzook,
Thank you for your previous clarification. I have another question regarding benchmarking. This time, I am working with NA12878/HG001 data with 20X-30X coverage files.
I downloaded multiple fastq files, aligned them to the GRCh37 reference genome using the bwa-mem aligner, and then used several pipelines to generate my.vcf files. For benchmarking, I'm utilizing tuvari bench with the HG002_truth set file. However, I'm consistently experiencing very low recall in my results.
Could you please provide any advice or suggestions to improve this?
Please find a log file attached to this message for the reference.

log.txt

nate-d-olson · 2024-01-08T11:59:57Z

Hi @poddarharsh15, HG001 and HG002 are two separate individuals/ genomes. You will want to use vcfs generated from HG002 fastqs / bams when using an HG002 benchmark. Here is a link to where you can find comparable fastqs for HG002 to the ones you used in your analysis for HG001, https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/HG002_HiSeq300x_fastq/140528_D00360_0018_AH8VC6ADXX/. Let us know if you have any further questions.

Best!
Nate

poddarharsh15 · 2024-01-08T13:17:57Z

Thank you for your response. I had a bit of confusion regarding the truth files. As I am benchmarking pipelines specifically designed for Structural Variations (SVs) and not for small INDELs and SNVs, my understanding is that I can only use HG002 data for this purpose. Is my understanding correct?

nate-d-olson · 2024-01-08T14:59:12Z

That is correct. We currently only have SV benchmarks for HG002.

poddarharsh15 · 2024-01-08T15:11:50Z

Thank you so much.

poddarharsh15 · 2024-01-17T10:23:55Z

Hello Good afternoon,
I am currently working on benchmarking insertions (INS) and deletions (DEL) using HG002 truth set files. However, I am encountering difficulties while attempting to bench deletions and insertions separately.
Could you kindly provide suggestions on whether it's advisable to bench them separately to achieve better precision and recall? I have attached a log file output generated from truvari-bench for your reference. Thanks in advance.

Log files from dysgupipeline
All_svs.txt
DEL_log.txt
INS_log.txt

log files from Manta pipeline

All_sv.txt
DEL.txt
INS.txt

nate-d-olson · 2024-01-22T18:33:42Z

Hi,
You will want to benchmark insertions and deletions together, then parse the travail output to get separate performance metrics for insertions and deletions. You might want to also try Truvari Refine, https://github.com/ACEnglish/truvari/wiki/refine. Which helps account for differences in variant representation and provides more accurate benchmarking results. Truvari refine is compute intensive, see ACEnglish/truvari#182 and ACEnglish/truvari#175, so it can run for awhile or hang. Another option is hap-eval from Sentieon, https://github.com/Sentieon/hap-eval, which seems to better handle comparing complex SVs with different representations. We also find it helpful to manually curate (view in IGV along with bams for multiple technologies) a random subset of FPs and FNs to see if the discrepancy is an error in the query or truth callset or an artifact of the benchmarking method.

Hope this helps.

poddarharsh15 · 2024-01-23T15:44:04Z

Hello,
Thank you for your response. As I am relatively new to these studies, I am a bit confused about what you mean by "performing separate performance metrics." Does this involve separating deletions and insertions (using bcftools) from base and call files and then benchmarking them individually?
For your reference, I have attached a table generated using several SV callers. I am also using truvari refine to improve benchmark results on HG002.

Any insights you can provide would be greatly appreciated.
bench_GIABv0.6.ods

poddarharsh15 · 2024-05-27T08:57:11Z

Hi @nate-d-olson,
I am conducting a new benchmarking test using the CMRG truth set with several SV-callers and aligners. I have a query regarding how SVTYPE is described in the .VCF files of the CMRG truth set. Is it necessary to use any special parameters while utilizing benchmarking tools such as Truvari and SV-bench?
Thank you for your assistance.

##INFO=<ID=REPTYPE,Number=1,Type=String,Description="Type of SV, with designation of uniqueness of new or deleted sequence:SIMPLEDEL=Deletion of at least some unique sequence, SIMPLEINS=Insertion of at least some unique sequence, CONTRAC=Contraction, or deletion of sequence entirely similar to remaining sequence, DUP=Duplication, or insertion of sequence entirely similar to pre-existing sequence, INV=Inversion, SUBSINS=Insertion of new sequence with alteration of some pre-existing sequence, SUBSDEL=Deletion of sequence with alteration of some remaining sequence">

nate-d-olson · 2024-05-29T14:16:01Z

@poddarharsh15 I have not used sv-bench. I don't use any specific truvari arguments or parameters for the SVTYPE annotation but here is the truvari bench command I use.

        truvari bench \
            -b {benchmark variants} \
            -c {comparison vcf} \
            --includebed {benchmark regions} \
            --dup-to-ins \
            --pick ac \
            -o {output_dir}

When comparing the benchmark set to phased variant callsets truvari refine does a nice job comparing complex variants with different representations. Not this step can be compute intensive and slow for whole genome benchmark sets. It will run on unphased vcfs but is compute-intensive and not advised as it is unclear it correctly accounts for differences in variant representations.

        truvari refine \
            --recount \
            --use-region-coords \
            --use-original-vcfs \
            --reference {ref} \
            --regions {candidate refine bed (generated by truvari bench} \
            {output_dir}

We are working on a new assembly-based SV benchmark set. Are you willing to share any of the HG002 SV callsets you have generated for us to use as part of our internal evaluations of the new benchmark set?

poddarharsh15 · 2024-05-29T14:28:32Z

Thank you @nate-d-olson for your response.
I used a similar command line for Truvari as you mentioned, except for running Truvari refine due to the extended runtime and multiple MAFFT errors. Could you please clarify what you mean by HG002 SV callsets? Are you referring to the .VCF files generated from SV-callers that I have used for my study?

nate-d-olson · 2024-05-29T14:33:01Z

Sorry, I was unclear. Yes, I mean the vcfs generated by the different SV-callers you have used. Feel free to email me at [email protected] to take this offline. Best! Nate

poddarharsh15 · 2024-05-29T14:39:58Z

Thank you for clarifying.
I would be happy to share the .VCF file, but I need to consult with my supervisor first. I will get back to you as soon as possible.
further info I have ran the sv callers on lower coverage WGS (150bp and 250bp) data so it will be around 42 vcf files :(

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Truth_set information for benchmarking #27

Truth_set information for benchmarking #27

poddarharsh15 commented Nov 21, 2023

jzook commented Nov 21, 2023

poddarharsh15 commented Nov 24, 2023

nate-d-olson commented Nov 27, 2023

poddarharsh15 commented Nov 28, 2023

nate-d-olson commented Nov 28, 2023

poddarharsh15 commented Nov 28, 2023 •

edited

Loading

jzook commented Nov 28, 2023

poddarharsh15 commented Nov 30, 2023 •

edited

Loading

jzook commented Dec 4, 2023

poddarharsh15 commented Jan 8, 2024 •

edited

Loading

nate-d-olson commented Jan 8, 2024

poddarharsh15 commented Jan 8, 2024

nate-d-olson commented Jan 8, 2024

poddarharsh15 commented Jan 8, 2024

poddarharsh15 commented Jan 17, 2024

nate-d-olson commented Jan 22, 2024

poddarharsh15 commented Jan 23, 2024

poddarharsh15 commented May 27, 2024

nate-d-olson commented May 29, 2024

poddarharsh15 commented May 29, 2024

nate-d-olson commented May 29, 2024

poddarharsh15 commented May 29, 2024 •

edited

Loading

Truth_set information for benchmarking #27

Truth_set information for benchmarking #27

Comments

poddarharsh15 commented Nov 21, 2023

jzook commented Nov 21, 2023

poddarharsh15 commented Nov 24, 2023

nate-d-olson commented Nov 27, 2023

poddarharsh15 commented Nov 28, 2023

nate-d-olson commented Nov 28, 2023

poddarharsh15 commented Nov 28, 2023 • edited Loading

jzook commented Nov 28, 2023

poddarharsh15 commented Nov 30, 2023 • edited Loading

jzook commented Dec 4, 2023

poddarharsh15 commented Jan 8, 2024 • edited Loading

nate-d-olson commented Jan 8, 2024

poddarharsh15 commented Jan 8, 2024

nate-d-olson commented Jan 8, 2024

poddarharsh15 commented Jan 8, 2024

poddarharsh15 commented Jan 17, 2024

nate-d-olson commented Jan 22, 2024

poddarharsh15 commented Jan 23, 2024

poddarharsh15 commented May 27, 2024

nate-d-olson commented May 29, 2024

poddarharsh15 commented May 29, 2024

nate-d-olson commented May 29, 2024

poddarharsh15 commented May 29, 2024 • edited Loading

poddarharsh15 commented Nov 28, 2023 •

edited

Loading

poddarharsh15 commented Nov 30, 2023 •

edited

Loading

poddarharsh15 commented Jan 8, 2024 •

edited

Loading

poddarharsh15 commented May 29, 2024 •

edited

Loading