Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Truth_set information for benchmarking #27

Open
poddarharsh15 opened this issue Nov 21, 2023 · 22 comments
Open

Truth_set information for benchmarking #27

poddarharsh15 opened this issue Nov 21, 2023 · 22 comments

Comments

@poddarharsh15
Copy link

Hi,
I'm currently working on benchmarking VCF files generated from HG002_data(test_run just one sample) for SV calling(Manta, lumpy, GRIDSS, nf-core/sarek) against a truth set. I aligned the BAM files using GRCh38. Any ideas on how to effectively benchmark my results on which truth set? I have one confusion can I use the truth_sets from SV_0.6/ for bench the vcf_files(aligned on GRCh38) generated from SV caller tools. I am using truvari and SVanalyzer for benchmarking.
Thank you.

@jzook
Copy link
Contributor

jzook commented Nov 21, 2023

Hi - good question. You generally can only use v0.6 on GRCh37. For GRCh38, we have a published GIAB benchmark for challenging medically relevant genes that includes ~200 SVs (https://rdcu.be/cGwVA). We also have a preliminary draft benchmark from the HG002 T2Tv0.9 assembly, which includes more challenging SVs, but we haven't evaluated it much yet so recommend curating some of the FPs and FNs https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.011-20230725/. We are working on a new HG002-T2T assembly-based SV benchmark that we will evaluate in the coming months.

@poddarharsh15
Copy link
Author

Hello again and thanks you for answering, I was curious to know if I can benchmark .vcf files on v1.0 T2T-HG002 Assembly that is recently published on GIAB website.

@nate-d-olson
Copy link

Hi, The vcf in the provided link contains the changes that were made between v0.9 of the HG002 Q100 assembly and v1.0. Therefore the vcf is not suitable for benchmarking. We have a new draft benchmark using v1.0 of the assembly that we expect to post of the ftp site this week.

@poddarharsh15
Copy link
Author

Hi, Thank you for clarifying my previous doubt. However, I'm now more confused. To assist you in helping me, I've attached the results from my latest benchmark using this truth_set: [https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.011-20230725/]. Could you please review the results? I seem to be encountering a high number of false negatives, and I'm wondering if there might be errors in my approach.
delly_summary.json
Manta_summary.json

@nate-d-olson
Copy link

I took a quick look at your results. Can you provide the truvari command and the specific files on the ftp site you used for benchmarking to help with interpreting the results. Thanks!

@poddarharsh15
Copy link
Author

poddarharsh15 commented Nov 28, 2023

truvari bench -b GRCh38_HG002-T2TQ100-V0.9_stvar.vcf.gz -c delly_sorted.vcf.gz --includebed GRCh38_HG002-T2TQ100-V0.9_stvar.benchmark.bed -o delly_cmrg

(https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.011-20230725/GRCh38_HG002-T2TQ100-V0.9_stvar.vcf.gz)

Hello,
Please find the command line and ftp link.

@jzook
Copy link
Contributor

jzook commented Nov 28, 2023

Hi @poddarharsh15 - thanks for testing out this draft benchmark. We've not evaluated this extensively yet, but I do expect that standard short read-based methods will have much lower recall for this benchmark, because it includes many more challenging SVs, with many in long tandem repeat regions. You also may want to take a look at the FPs to see if it makes sense to change the matching parameters in truvari to make them more lenient

@poddarharsh15
Copy link
Author

poddarharsh15 commented Nov 30, 2023

delly.json
Manta.json
Hi @jzook,
I've benchmarked some data on CMRG genes using truvari. The VCF file is available here: (https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/CMRG_v1.00/GRCh38/StructuralVariant/HG002_GRCh38_CMRG_SV_v1.00.vcf.gz) The BAM files are aligned on GRCh38. Could you please provide insights into these results?
Thank you.

@jzook
Copy link
Contributor

jzook commented Dec 4, 2023

Hi @poddarharsh15 - The CMRG SVs include many in tandem repeats and some in segmental duplications, so I expect the high FN rate reflects this. I recommend you examine some of the FPs and FNs in IGV with your bam and vcf alongside the CMRG vcf and a long read bam file like
https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb_20kb_chemistry2/GRCh38/HG002.SequelII.merged_15kb_20kb.pbmm2.GRCh38.haplotag.10x.bam
and assembly bam files like
https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.012-20231107/dipcall_output/

@poddarharsh15
Copy link
Author

poddarharsh15 commented Jan 8, 2024

Hello @jzook,
Thank you for your previous clarification. I have another question regarding benchmarking. This time, I am working with NA12878/HG001 data with 20X-30X coverage files.
I downloaded multiple fastq files, aligned them to the GRCh37 reference genome using the bwa-mem aligner, and then used several pipelines to generate my.vcf files. For benchmarking, I'm utilizing tuvari bench with the HG002_truth set file. However, I'm consistently experiencing very low recall in my results.
Could you please provide any advice or suggestions to improve this?
Please find a log file attached to this message for the reference.

log.txt

@nate-d-olson
Copy link

Hi @poddarharsh15, HG001 and HG002 are two separate individuals/ genomes. You will want to use vcfs generated from HG002 fastqs / bams when using an HG002 benchmark. Here is a link to where you can find comparable fastqs for HG002 to the ones you used in your analysis for HG001, https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/HG002_HiSeq300x_fastq/140528_D00360_0018_AH8VC6ADXX/. Let us know if you have any further questions.

Best!
Nate

@poddarharsh15
Copy link
Author

Thank you for your response. I had a bit of confusion regarding the truth files. As I am benchmarking pipelines specifically designed for Structural Variations (SVs) and not for small INDELs and SNVs, my understanding is that I can only use HG002 data for this purpose. Is my understanding correct?

@nate-d-olson
Copy link

That is correct. We currently only have SV benchmarks for HG002.

@poddarharsh15
Copy link
Author

Thank you so much.

@poddarharsh15
Copy link
Author

Hello Good afternoon,
I am currently working on benchmarking insertions (INS) and deletions (DEL) using HG002 truth set files. However, I am encountering difficulties while attempting to bench deletions and insertions separately.
Could you kindly provide suggestions on whether it's advisable to bench them separately to achieve better precision and recall? I have attached a log file output generated from truvari-bench for your reference. Thanks in advance.

Log files from dysgupipeline
All_svs.txt
DEL_log.txt
INS_log.txt

log files from Manta pipeline

All_sv.txt
DEL.txt
INS.txt

@nate-d-olson
Copy link

Hi,
You will want to benchmark insertions and deletions together, then parse the travail output to get separate performance metrics for insertions and deletions. You might want to also try Truvari Refine, https://github.com/ACEnglish/truvari/wiki/refine. Which helps account for differences in variant representation and provides more accurate benchmarking results. Truvari refine is compute intensive, see ACEnglish/truvari#182 and ACEnglish/truvari#175, so it can run for awhile or hang. Another option is hap-eval from Sentieon, https://github.com/Sentieon/hap-eval, which seems to better handle comparing complex SVs with different representations. We also find it helpful to manually curate (view in IGV along with bams for multiple technologies) a random subset of FPs and FNs to see if the discrepancy is an error in the query or truth callset or an artifact of the benchmarking method.

Hope this helps.

@poddarharsh15
Copy link
Author

Hello,
Thank you for your response. As I am relatively new to these studies, I am a bit confused about what you mean by "performing separate performance metrics." Does this involve separating deletions and insertions (using bcftools) from base and call files and then benchmarking them individually?
For your reference, I have attached a table generated using several SV callers. I am also using truvari refine to improve benchmark results on HG002.

Any insights you can provide would be greatly appreciated.
bench_GIABv0.6.ods

@poddarharsh15
Copy link
Author

Hi @nate-d-olson,
I am conducting a new benchmarking test using the CMRG truth set with several SV-callers and aligners. I have a query regarding how SVTYPE is described in the .VCF files of the CMRG truth set. Is it necessary to use any special parameters while utilizing benchmarking tools such as Truvari and SV-bench?
Thank you for your assistance.

##INFO=<ID=REPTYPE,Number=1,Type=String,Description="Type of SV, with designation of uniqueness of new or deleted sequence:SIMPLEDEL=Deletion of at least some unique sequence, SIMPLEINS=Insertion of at least some unique sequence, CONTRAC=Contraction, or deletion of sequence entirely similar to remaining sequence, DUP=Duplication, or insertion of sequence entirely similar to pre-existing sequence, INV=Inversion, SUBSINS=Insertion of new sequence with alteration of some pre-existing sequence, SUBSDEL=Deletion of sequence with alteration of some remaining sequence">

CMRG_1

@nate-d-olson
Copy link

@poddarharsh15 I have not used sv-bench. I don't use any specific truvari arguments or parameters for the SVTYPE annotation but here is the truvari bench command I use.

        truvari bench \
            -b {benchmark variants} \
            -c {comparison vcf} \
            --includebed {benchmark regions} \
            --dup-to-ins \
            --pick ac \
            -o {output_dir} 

When comparing the benchmark set to phased variant callsets truvari refine does a nice job comparing complex variants with different representations. Not this step can be compute intensive and slow for whole genome benchmark sets. It will run on unphased vcfs but is compute-intensive and not advised as it is unclear it correctly accounts for differences in variant representations.

        truvari refine \
            --recount \
            --use-region-coords \
            --use-original-vcfs \
            --reference {ref} \
            --regions {candidate refine bed (generated by truvari bench} \
            {output_dir}

We are working on a new assembly-based SV benchmark set. Are you willing to share any of the HG002 SV callsets you have generated for us to use as part of our internal evaluations of the new benchmark set?

@poddarharsh15
Copy link
Author

Thank you @nate-d-olson for your response.
I used a similar command line for Truvari as you mentioned, except for running Truvari refine due to the extended runtime and multiple MAFFT errors. Could you please clarify what you mean by HG002 SV callsets? Are you referring to the .VCF files generated from SV-callers that I have used for my study?

@nate-d-olson
Copy link

Sorry, I was unclear. Yes, I mean the vcfs generated by the different SV-callers you have used. Feel free to email me at [email protected] to take this offline. Best! Nate

@poddarharsh15
Copy link
Author

poddarharsh15 commented May 29, 2024

Thank you for clarifying.
I would be happy to share the .VCF file, but I need to consult with my supervisor first. I will get back to you as soon as possible.
further info I have ran the sv callers on lower coverage WGS (150bp and 250bp) data so it will be around 42 vcf files :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants