-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Truth_set information for benchmarking #27
Comments
Hi - good question. You generally can only use v0.6 on GRCh37. For GRCh38, we have a published GIAB benchmark for challenging medically relevant genes that includes ~200 SVs (https://rdcu.be/cGwVA). We also have a preliminary draft benchmark from the HG002 T2Tv0.9 assembly, which includes more challenging SVs, but we haven't evaluated it much yet so recommend curating some of the FPs and FNs https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.011-20230725/. We are working on a new HG002-T2T assembly-based SV benchmark that we will evaluate in the coming months. |
Hello again and thanks you for answering, I was curious to know if I can benchmark .vcf files on v1.0 T2T-HG002 Assembly that is recently published on GIAB website. |
Hi, The vcf in the provided link contains the changes that were made between v0.9 of the HG002 Q100 assembly and v1.0. Therefore the vcf is not suitable for benchmarking. We have a new draft benchmark using v1.0 of the assembly that we expect to post of the ftp site this week. |
Hi, Thank you for clarifying my previous doubt. However, I'm now more confused. To assist you in helping me, I've attached the results from my latest benchmark using this truth_set: [https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.011-20230725/]. Could you please review the results? I seem to be encountering a high number of false negatives, and I'm wondering if there might be errors in my approach. |
I took a quick look at your results. Can you provide the truvari command and the specific files on the ftp site you used for benchmarking to help with interpreting the results. Thanks! |
Hello, |
Hi @poddarharsh15 - thanks for testing out this draft benchmark. We've not evaluated this extensively yet, but I do expect that standard short read-based methods will have much lower recall for this benchmark, because it includes many more challenging SVs, with many in long tandem repeat regions. You also may want to take a look at the FPs to see if it makes sense to change the matching parameters in truvari to make them more lenient |
delly.json |
Hi @poddarharsh15 - The CMRG SVs include many in tandem repeats and some in segmental duplications, so I expect the high FN rate reflects this. I recommend you examine some of the FPs and FNs in IGV with your bam and vcf alongside the CMRG vcf and a long read bam file like |
Hello @jzook, |
Hi @poddarharsh15, HG001 and HG002 are two separate individuals/ genomes. You will want to use vcfs generated from HG002 fastqs / bams when using an HG002 benchmark. Here is a link to where you can find comparable fastqs for HG002 to the ones you used in your analysis for HG001, https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/HG002_HiSeq300x_fastq/140528_D00360_0018_AH8VC6ADXX/. Let us know if you have any further questions. Best! |
Thank you for your response. I had a bit of confusion regarding the truth files. As I am benchmarking pipelines specifically designed for Structural Variations (SVs) and not for small INDELs and SNVs, my understanding is that I can only use HG002 data for this purpose. Is my understanding correct? |
That is correct. We currently only have SV benchmarks for HG002. |
Thank you so much. |
Hello Good afternoon, Log files from dysgupipeline log files from Manta pipeline |
Hi, Hope this helps. |
Hello, Any insights you can provide would be greatly appreciated. |
Hi @nate-d-olson,
|
@poddarharsh15 I have not used sv-bench. I don't use any specific truvari arguments or parameters for the SVTYPE annotation but here is the truvari bench command I use.
When comparing the benchmark set to phased variant callsets truvari refine does a nice job comparing complex variants with different representations. Not this step can be compute intensive and slow for whole genome benchmark sets. It will run on unphased vcfs but is compute-intensive and not advised as it is unclear it correctly accounts for differences in variant representations.
We are working on a new assembly-based SV benchmark set. Are you willing to share any of the HG002 SV callsets you have generated for us to use as part of our internal evaluations of the new benchmark set? |
Thank you @nate-d-olson for your response. |
Sorry, I was unclear. Yes, I mean the vcfs generated by the different SV-callers you have used. Feel free to email me at [email protected] to take this offline. Best! Nate |
Thank you for clarifying. |
Hi,
I'm currently working on benchmarking VCF files generated from HG002_data(test_run just one sample) for SV calling(Manta, lumpy, GRIDSS, nf-core/sarek) against a truth set. I aligned the BAM files using GRCh38. Any ideas on how to effectively benchmark my results on which truth set? I have one confusion can I use the truth_sets from SV_0.6/ for bench the vcf_files(aligned on GRCh38) generated from SV caller tools. I am using truvari and SVanalyzer for benchmarking.
Thank you.
The text was updated successfully, but these errors were encountered: