-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mutect2: drop in SNV specificity/precision between v.4.1.8.1 and 4.1.9.0 #7921
Comments
From what I have heard from Dmitriy, this loss in Mutect2 precision between Mutect2 v.4.1.8.1. and v.4.1.9.0 seems to be perfectly reproducible using default parameters on the HCC1395 somatic benchmark gold-standard from the Sequencing Quality Control Phase II Consortium. The drop in performance seems to still persist in the current Mutect2 version. If true, then this could have dramatic affects on Mutect2 clinical variant calling quality since v.4.1.9.0 until now. It would appear that isolating the root cause of this drop in performance has high importance for maintaining the clinical validity of Mutect2. It seems quite a number of Mutect2 code details have changed between v.4.1.8.1. and v.4.1.9.0, as for instance this commit concerning quality filtering, or this one concerning soft-clipped bases. Perhaps the circumstance of the drop in precision affecting WES but not WGS data may point to the culprit? Also, may I ask whether the GATK team is doing real-world regression test on gold-standard variant callsets between version releases, and have these tests passed between v.4.1.8.1. and v.4.1.9.0? |
Glad to see we're not the only one's noticing this. We have been using 4.0.11.x for quite a while with success. During a recent revalidation effort, we have been working with 4.2.x versions and missing quite a few calls we consider "truth", while making these calls with other variant callers. We have since verified v4.1.8.0 is performing acceptably, so our data supports the issue occurring during the 4.1.9.0 update. |
Same here, @jhl667 - good to get additional confirmation. May I ask whether you used the same gold-standard call set or another one, and whether yours was WGS or WES? I’m linking @droazen here as well since he acted as the release manager of the affected releases. I think it would be good to get clarity soon, since probably tens of thousands of clinical samples have been processed with the affected versions in the last two years, and a reduction in precision of 10-20% (absolute) may have been relevant for clinical decisions in at least some of these cases. Also, I know how hard it is to avoid such things in what is essentially research software (and such a great one to boot), so this is not about blaming anyone but about fixing the root cause as soon as possible. If it turns out that the issue only affects this particular reference call set and no patients (for some arcane reason), then all the better in my view. |
@schelhorn We have a large clinical "truth" set we utilize during workflow validations. We also utilize spike-in samples from SeraCare and perform dilutions using a couple of the common Coriell cell lines. We noticed the Mutect2 calling inconsistency while validating a small targeted panel. |
To clarify our experimental setup: for the benchmark we are using the latest release (v.1.2) somatic "ground truth" of the HCC1395 cell line from https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/seqc/Somatic_Mutation_WG/release/latest/ It contains ~40k SNVs and ~2k INDELs in ~2.4Gb high-confidence regions. Therefore, even in the WES analysis scenario, the SNV counts should be high enough to draw reliable conclusions when comparing performance between different callers and releases. |
I think this issue is very important to me. When I deal with the CAP data (high capture cfDNA data), GATK4.1.0.8's performance is much better than GATK4.2 ... |
Thanks for the report! I am contacting the lead developer of Mutect2, who will reply here. |
Hi @droazen and @davidbenjamin, could you please give us an estimate about when you will be able to look at this issue in more detail to identify the root cause? Thank you. |
@davidbenjamin is away at the moment, but should be able to look into this issue when he returns next week |
Hi @droazen, is there perhaps already any news on the matter? It’s been a month, and we are wondering about when this may be looked into. Thanks! |
@schelhorn @davidbenjamin has been out for most of this past month, but he's promised to look into this report as soon as he can find the time. I anticipate an initial reply from him next week. Thanks for your patience. I'll add that much of our recent development work on the somatic variation front has gone into a new ML-based tool we're calling "Mutect3" (but which may end up with a different name once it's released). This new tool is already giving much better results than |
@ddrichel @jhl667 @schelhorn @xiucz I have a theory that, depending on the answers to the following questions, I may pursue further.
|
Hi David,
In the minimal example, we do not use PON. In the production pipeline, we use the exome PON file from: gs://gatk-best-practices/somatic-hg38/1000g_pon.hg38.vcf.gz as recommended here: https://gatk.broadinstitute.org/hc/en-us/articles/360035890631-Panel-of-Normals-PON- Some more data from the production pipeline, just in case it could help.
The results: We also conducted experiments with a subset of the WES FD sample (FD_1, 1/3 of the full ~100x FD dataset):
The results (please note that the labeling conventions are now different compared to WGS experiments, apologies for the inconvenience):
This was found the following way.
Find new FPs present in the raw output of v.4.1.8.1, again matching by CHROM, POS, REF, ALT:
For the 8 new FPs that have been detected, but filtered in v.4.1.8.1 (all of which are SNVs), find FILTER classification: Hope this helps, Dmitriy |
Thank you @ddrichel! That's really helpful and seems to support my suspicion. The most relevant changes to Mutect2 between versions 4.1.8.1 and 4.1.9.0 are #6520, #6696, and #6821, which have in common the property that they are bug fixes restoring sensitivity in edge cases. I don't think that any of these are regressions because the job of Mutect2 is to output any candidate variant. Filtering is the responsibility of FilterMutectCalls. Because our panel of normals is generated by running Mutect2, a panel of normals generated before 4.1.9.0 is blind to artifacts that occur in these edge cases. I suspect that re-generating our panels of normals using a more recent release of Mutect2 will solve the problem. I'm going to need to see how long it will take to gather the samples to re-generate our hg38 panel of normals (it might not take long at all) and triage that against the timetable for Mutect3, which does away with the panel of normals entirely. |
Excellent, thank you @davidbenjamin and @ddrichel!
|
Hi, my apologies for not answering these questions sooner. We are utilizing a panel of normals that was created in house from about 40 or so samples. Reference assembly is still hg19. In 4.1.9.0, we noticed multiple high confidence variants that were no longer being called using equivalent parameter sets. So anyways, in our case we are dealing with a false negative problem. |
Hi @davidbenjamin, may I ask whether there has been any progress on addressing the bug and/or generating the new PoN? Thank you. |
Sorry for not letting this go, but this a severe bug with repercussions on clinical diagnostics that still has not been communicated to other users, nor fixed, for more than three months. @droazen, is there anything else that we could do to hasten this along? Is there any workaround other than downgrading Mutect2 to the version from more than two years ago? |
I am going to be able to identify samples for a new panel of normals this week, after which generating and validating the panel will take another few days. |
Excellent news, thank you @davidbenjamin! |
Happy New Year, @davidbenjamin! Coming back to your comment from October, did you perhaps have the chance to look at the new panel of normals and whether is solves the issue? It has been almost exactly half a year now, since we reported the bug, and given the severity of the issue regarding clinical applications it would be important to get a better understanding of whether the updated PON fixes it. Thanks a bunch! |
@schelhorn We also develop clinical workflows, and came to the conclusion we would hard-stop at 4.1.8.0 and wait on Mutect3 to be ready for primetime. My only comment is that it would be strange to me if a PON update solved this, considering we use a custom in-house PON... |
@jhl667, yep, I understand. Still, this issue is too big to be left alone. I think the Broad has to act here, since patient's lives are at stake and we didn't receive any actionable response for half a year. Mutect2 has a good reputation, and the Broad profits from that, but it is also medical software and it should be treated as such. Mutect2 will continue to be used for some time, and this has to be fixed. @droazen, is there anything else you people can do? Is the wider GATK dev team aware of the issue? Is this something I should escalate to someone else so that you get support from your management to fix it? We have pharma collabs with the Broad in place, I could try doing it that way, or via the genomics/pharma community. I'm not trying to be threatening or anything, just thinking out loud how we can help you to get the resources to solve this. |
@schelhorn Sorry for the long wait on this issue, which we do take seriously! Unfortunately as mentioned above we have extremely limited resources for Mutect2 maintenance at the moment due to our focus on Mutect3, which we see as the long-term solution to this and other issues. You should definitely continue to run 4.1.8.0 until this is resolved, if at all possible. Having said that, we are generating a new M2 PON now, and should know soon whether it resolves this issue as @davidbenjamin hopes. Stay tuned for an update on the outcome of this evaluation within the next week or two. |
@droazen +1 for being affected by this issue in production. As this is in production (same as @schelhorn , with big pharma which have very strict security requirements), and as 4.1.8.0 contains critical security vulnerabilities that were mitigated in subsequent releases, we are in a serious pickle here. @jhl667 how did you conclude that 4.1.8.0 performs better than newer versions? What we see is that it emits more variants, but after filtering and intersecting with other callers (i.e. Strelka), we get more variants and a "better" result (we can't really define "better" - it's merely an observation) with 4.2.4.1. |
@ury the performance evaluation is based on the HCC1395 benchmark, as described in my original report (see above). More variants in newer versions is in line with our analysis, but the new variants are likely to be mostly false positives. |
@ddrichel so in our case it seems to be the other way around: 4.1.8.0 produces much more variants than 4.2.4.1. After filtering and intersection, it's the other way around. |
Update: @davidbenjamin has generated a new panel of normals, and is in the process of running some analysis on it. |
Okay, I have a new panel for hg38 here: gs://broad-dsde-methods-davidben/mutect2-2023-panel-of-normals/mutect2-hg38-pon.vcf.gz It has all the variants of the old panel, plus more that arose in more recent versions of Mutect2. It is also generally somewhat more conservative, with a greater bias toward precision than the previous one. This panel is intended to be used at your own risk. I can vouch that it doesn't wreck the results of our own validations but I do not have time to vet it thoroughly enough to put it in the best practices google bucket. Likewise, I cannot promise that it will improve specificity in any particular set of samples. Within several months I hope we are all running the next version of Mutect and never need to see a panel of normals again. |
@davidbenjamin @droazen unfortunately the new PON does not make up for the precision loss introduced in v4.1.9.0.
Any chance to get this issue fixed? With Mutect3 not being available and v4.1.8.1 being affected by the log4j vulnerability, it is quite regrettable to be stuck with inferior precision. Extended methods, code, and data to reproduce the issue are here: |
Thanks a lot, @ddrichel. Given that the current Mutect2 release is still broken on both tumor-normal and tumor-only WES data, and downgrade is not possible on production systems due to the log4j vulnerability: is there any path forward for users that care for both accuracy and security, @davidbenjamin and @droazen ? I fear waiting for Mutect3 isn't an option since even when it is finished there won't be independent benchmarks available for it for quite a while. Also, I suspect (as any other software product) the new version will have bugs, too, until it has matured in production. Therefore I'd suggest that identifying, understanding and fixing the bug in the current Mutect2 release would be the wisest path forward - do you agree? |
@schelhorn @ddrichel Thanks for evaluating the new PON, and sorry that it didn't resolve the precision issue! I agree that in an ideal world where grant-imposed deadlines didn't exist, and we had unlimited developer resources, fixing the issue in M2 would be the best path forward. Since we unfortunately don't live in that world, and are unlikely to have developer bandwidth to work on this issue in the near future, let me suggest an alternate path: Since you are satisfied with the output of 4.1.8.1, and are only prevented from running that version by the log4j issue, I think your best option for now is to run a build of 4.1.8.1 with the log4j vulnerability patched out. This is very simple to create, and just involves changing the log4j version in our build file and rebuilding GATK. If you'd like to pursue this option, we'd be happy to create such a build for you, or provide instructions for creating it yourself if you prefer. |
Thanks, @droazen. While I understand the effects of the funding landscape on academic resources, it seems to me this is a full capitulation of the GATK developer team given a serious bug, especially in light of the fact that the team seems to have enough resources to continue working on Mutect3. Mutect2 has been one of the best performing variant callers of the last years and is a major reason for the Broad's good reputation in the oncology bioinformatics field. GATK and Mutect2 are used by hundreds of institutions in clinical practice, affecting thousands of real patients' lives. Almost all of these institutions are likely to use clinical WES assays due to cost reasons and will thus have been directly affected by this issue for the last three years. Also, almost all of these institutions will never learn of this bug since they likely trusted in the developers to have proper functional regression tests in place. If this is indeed the best the Broad can do as an institution, then I will take your offer of providing a build of Mutect2 4.1.8.1 with the log4j vulnerability patched out - thank you. The one thing that I am asking for in addition (for the sake of the overall oncology bioinformatics community), however, is that you conduct a best effort to notify organizations (universities, hospitals, and biotechs/pharmaceuticals that you know are using Mutect2) and best-practise workflow owners (Nextflow, Snakemake, WDL, CWL etc. that include Mutect2) of the forced downgrade. Also, I think it makes sense to include a very prominent warning into the Mutect2 READMEs and GATK best practice documentations and guides. I know that this is work, too, but with success comes responsibility, and I can just hope that providing proper warnings uses less developer bandwidth than applying binary search to find out which of these 10 commits between 4.1.8.1 and 4.1.9.0 that are touching variant filtering (see below) broke your flagship product enough to abandon it. (For anyone looking at this issue later, these are the commits I think are most likely to be related to this issue, and which I would propose to systematically leave out of the 4.1.9.0 build to test whether variant calling specificity is restored; assuming the 10 commits are independent and leaving each out in turn produces a working build, this would mean producing 10 Mutect2 builds for functional regression testing (the latter of which @ddrichel could do if we would receive the 10 builds from the GATK team)): |
@droazen @schelhorn Let me try to identify the problematic commit with git bisect. I tried to build commit-specific versions of GATK, seems to work. |
Done, the problematic commit was a304725 Below are the results with versions suggested by git bisect. As stated in the pull request #6821, the change was evaluated with the DREAM3 benchmark and made sense at the time. To be fair, in the HCC1395 benchmark there is still a gain in recall (<2%, see the first figure in my original report above), which I assume to be mostly due to this commit, but the recall/precision tradeoff looks different now with the better benchmark callset. I have not figured out what exactly the code in the commit does, maybe we can fine-tune the changes @davidbenjamin ? |
@schelhorn / @ddrichel: I've tagged a patched version of 4.1.8.1 with the log4j fix applied: https://github.com/broadinstitute/gatk/releases/tag/4.1.8.1.log4jfix A build of this version is available here: https://github.com/broadinstitute/gatk/releases/download/4.1.8.1/gatk-4.1.8.1.log4jfix.zip You may also build it yourself by running I've forwarded your In the meantime, if you wish to propose a patch to |
Thanks @droazen for the patched 4.1.8.1, the effort is appreciated. FYI I ran an experiment by setting In v4.4.0.0. for SNVs, this results in ~2% loss of recall and a striking ~13% gain in precision (from 0.8286 to 0.9609), ~4% improvement over v4.1.8.1. |
@ddrichel It is regrettable that the tradeoff between precision and recall wasn't beneficial in your data, but in other data it has been. I can't call it a bug because mathematically speaking it is an improvement. At worst it can be considered an arbitrary movement along the ROC curve. Regardless, we have always developed Mutect2 and its successor under the principle that theoretical soundness is the best long-term policy. That is not to deny or disregard that some changes harm performance on some data. There is a chance that adjusting the |
@davidbenjamin the benchmark data is still not ours, it is the current somatic "gold standard" HCC1395 from https://www.nature.com/articles/s41587-021-00993-6, based on data from multiple WGS sequencing runs with combined 1,500x coverage, etc.... Apart from that, we are on the same page here. Now that the change leading to the apparent differences in performance is found, and holds up to scrutiny so far (thanks for looking into it), it is a more real possibility than ever that this is not a bug but a case of Mutect2 outperforming the "gold standard". Let's keep the issue open for now, I am still investigating and would like to have a place to report my progress. |
Hey @davidbenjamin, just a heads-up that we have generated some additional inights on this issue that relate to the effects of sample prep on false-positive rates under the changed error model. In short, it seems as if the error model is now much more (perhaps overly?) sensitive to sample-prep noise typical in real-world clinical sequencing data, leading to a 2-3x increase of false-positives in such WES (and, assuming, also other high-coverage) samples. This may have been an undesired side effect of overfitting the error model to the synthetic (and thus much cleaner) DREAM3 benchmark two years ago. We are presenting the results today at ISMB in Lyon, the world bioinformatics conference. Let me know if any of you guys are participating so that we can have a chat over a glass of wine, if you like! |
Dear @davidbenjamin and @droazen, please find here the poster that we presented this Monday at ISMB 2023 in Lyon. It sheds some additional light on this issue and explains why we believe that fine-tuning of the error model introduced in Mutect2 4.1.9.0 led to overcalling of variants in samples with increased (but not pathologic) DNA degradation. Presentation of the poster led to several interesting discussions with other users of Mutect2 at the conference. Given these results, I would propose that you could expose the parameter you newly set in commit a304725 Would that be an acceptable compromise? |
@schelhorn Exposing this parameter (perhaps as an |
@schelhorn @droazen That sounds extremely reasonable to me! |
@ldgauthier has volunteered to open a PR to expose this parameter -- should be part of the next GATK release |
We noticed a strong drop in precision for SNVs (~10% for tumor-normal mode, ~20% in tumor-only mode) between releases 4.1.7.0 and 4.2.6.0.
With more testing using the HCC1395 somatic benchmark (https://pubmed.ncbi.nlm.nih.gov/34504347/) and sequencing data provided by the Somatic Mutation Working Group (Fudan university WES tumor-normal data set, 2 x 100x coverage), the drop in performance can be traced to changes between 4.1.8.1 and 4.1.9.0. Here are the performance metrics for selected gatk releases:
The calling was done with essentially default parameters:
tools/gatk-${version}/gatk Mutect2 --normal-sample WES_FD_N --output $outvcf --intervals $wesbed --interval-padding 0 --input $inbam_t --input $inbam_n --reference $ref
tools/gatk-${version}/gatk FilterMutectCalls --output ${outvcf%.vcf}_filtered.vcf --variant $outvcf --intervals $wesbed --reference $ref --stats ${outvcf}.stats --threshold-strategy OPTIMAL_F_SCORE --f-score-beta 1.0
som.py was used for calculating performance metrics.
Curiously, we do not observe a such a substantial drop in precision in WGS data, neither in tumor-only nor in tumor-normal mode.
In the foillowing, our "v04" corresponds to gatk 4.1.7.0 and out "v05" corresponds to gatk 4.2.6.0:
Tumor-normal:
Tumor-only:
In my opinion, the small gains in recall between 4.1.8.1 and 4.1.9.0. do not justify the drop in precision. This and the fact that WES data is affected, while WGS data is not, suggests that this might be a bug rather than a feature.
Any suggestions on how to get the WES precision back to v4.1.8.1 levels are appreciated.
Thanks in advance,
Dmitriy
The text was updated successfully, but these errors were encountered: