-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[E::sam_parse1] CIGAR and query sequence are of different length #245
Comments
I may also mention that the speed difference between the three versions on this 2*250bp dataset is marginal. Pasted below. I am starting to believe the v0.7.1 time measures of the NAM sorting is wrong. Commit eb7ef72 (installed as newest instructions with cmake, i.e.
v0.8.0 (cmake)
v0.7.1
|
Perhaps related, I also get a segmentation fault error in bcftools mpileup for SIM250 dataset (simulated 2*250bp reads)
command:
Interestingly, On the simulated 2*250bp reads, the time difference in alignments are a lot larger almost 1.4x speedup: v0.7.1
Commit eb7ef72
|
Final note, the cigar error is also present in the BIO150 dataset. Run as:
|
I was able to reproduce this. Both the warning message ("The alignment path of one pair ...") and the inconsistent CIGAR come from some changes made in SSW. I updated our copy of SSW as part of PR #242. Let me see what can be done about this. Perhaps I should just go back to the older SSW version. |
Okay great. I guess SSW is still used in the rescue alignment step. Didn't think of that. |
Yes, it’s still used there. I can probably replace it with ksw2 as well. |
Should be fixed now, see my comment in #242. |
Okay I will start a benchmark of commit edeb390 |
The EXMAPLE COMMAND BCFTOOLS
The four strobealign bam files are found at ERROR:
the error happens after 30s, 1m20s, 1m30s, and 14m in the four files it it is of any help. |
I want to make sure I test this with the same bcftools version as you. Did you just use |
I was able to reproduce the problem with bcftools 1.14, 1.15 and 1.16. |
Ah, this is samtools/bcftools#1805, fixed in bcftools 1.17. I was able to reproduce the problem with a SAM file that looks (shortened) like this:
The description in the issue isn’t clear about what the actual problem is, so I can only guess that the long deletion has something to do with it. Having a CIGAR end with deletions doesn’t make much sense so maybe we should fix that on our side as well. (It does not appear to be the direct cause, though, since there are more alignments with deletions at the end preceding the above records.) It is a result of the NAM being longer on the reference than it should be. Perhaps this is another of these "edge effects", but for the 3' end this time? |
Yes, probably. At least it matches my intuition. So we have the STR edge effects both in 5' and 3' ends. Since alignments seem to always be left-shifted, this results in an internal deletion in the 5' case. I am wondering why it does not become a softclipp in the 3', and I am in general confused about the cigar string |
I updated to use bcftools v1.17.
Below are the results I currently have comparing v0.7.1, v0.8.0, and commit edeb390 (called main in the table below). Sorry for the current formatting. Main takaways:
I want to wait until the simulated data is in hand first before making strong conclusions, since the biological indel calling analysis may not be 100% accurate. I simply calculated the intersection (
Edit: |
Well, lo and behold, I updated the table above with the SNV and indel calling on the SIM250 data. INDEL calling results with main are so much better than previous versions of strobealign as well as other aligners (see fig from paper below, >10%!) that I am skeptical. I did not change anything in the analysis (same datasets) but previous results were run with bcftools 1.15 and main is now using v1.17 due to above described bug. I looked at the changelog of v1.16 and v1.17 and nothing suggests a change in calling algorithm in I do not have any insights at the moment. As you have been woking on the new extension level alignments @marcelm, do you have an idea of what could explain these results? Would better softclipping decisions improve the results this drastically? Variants in STR regions? (the simulations and biodata are using hg38, not chm13) |
As for runtime, here are some metrics. Speedgain on simulated data is substantial(!) for edeb390 but it doesn't seem translate as well to biological data.
|
I also added SIM150 results to the table above. The recall on this dataset has gone up by nearly 15 percentage points.. (Previous mplieup that error I reported was likely an uppmax diskIO related as a restart finished without errors) |
"bcftools call" rarely changes. It's mostly mpileup where improvements occur, which adds the tags used by call. It's worth rerunning the older aligner results with bcftools 1.17 so you have a safer comparison. Bcftools is regularly getting updates, although the most major revamp to indel calling was done by myself in 1.13 (see graphs at samtools/bcftools#1474). Since then there have been fixes to some of the FORMAT fields, so potentially you may get changes if you're applying post-call filtering. We also reversed a slow down that had accidentally incorporated some experimental code with a major performance cost. But that's just speed of running bcftools rather than results changing I think. It's also possible some other changes had the unintented (but apparently good!) effect of improving calling. [The primary author doesn't usually run large truth set comparisons to assess the impact of code changes, so wouldn't spot such things.] Edit: there's also the open PR of samtools/bcftools#1679 which hugely improves indel calling further (between 10-30% fewer indel FP and 5-10% fewer indel FN), but it's been rejected in favour of a major rewrite I think. This ongoing rewrite is visible in 1.17 as |
Thanks you for the feedback @jkbonfield ! I am currently rerunning all the bcftools steps (mpileup, call, sort, index, isec) on all strobealign versions as well as BWA MEM. Will update once I have those results. |
So I reran bwa mem as well as strobealign v0.7.1, 'v0.8.0' (which is really commit 8e53e41) and 'main' (which is really commit edeb390). This rerun triggers all the bcftools steps using v1.17. Results below. In summary: commit edeb390 is so much better at indel recall (15 and 17 percent points for 150PE and 250PE) than previous versions that it is nearly unbelievable. I verified that all files are updated looking at dates etc. I hope I have not made a silly mistake elsewhere. I tried comparing cigar strings but realised that my pipeline discards the SAM files once sorted bams are created, so that prevents a quick comparison of differing lines on my end. @marcelm, do you have any ideas why we see such huge improvements in indel recall, or if not, could you look a bit into this? (one thing I saw changed was that we trim away
|
I don’t know why this would differ so much either. As a first thing, I’d try to open the BAM files in IGV and look at some regions with differently called indels. I can do this next week. |
One idea: the TP and FP calls are simply derived as However, if there for some reason are many redundant (overlapping) indel calls overlapping with a true indel, I think a TP will be counted one time for each overlapping indel call with isec. Maybe One thing that lead me to this suspicion is that the BIO results are as follows, suggesting lower precision in indel calls. Although in BIO data ground truth is even more difficult to decipher because some larger indels are present in ground truth which should probably be filtered out (they may create spurious TP).
|
I took a look at the true positive predictions for indels (produced with Below I have pasted the first 50 calls classified as true positives by Rows with stars indicate calls that are not found by bwa mem and strobealign 0.8.0. All these calls match perfectly a simulated variant in the truth set (pasted furthest down). (Also, I added results for bwa mem in the table above. No significant change with bcftools v1.17) STROBEALIGN MAIN
BWA MEM
STROBEALIGN V0.8.0
SIMULATED VARIANTS
|
Since I’m bad at reading raw CSVs, here are the result tables formatted in a nicer way. SNVs
Indels
I need to reproduce your results on a small portion of the data (I’ll probably use a single chromosome and SIM150). First I’ll test whether |
I don't know for sure, but I could guess. What I can say is that I only filtered the GIAB truth call dataset into SNP and Indels using Edit: I realized that this response does not answer your question :) It only states that there is more uncertainty in the TP and FP estimates in the biological datasets. I would assume that the TP, hence Recall, would go up also for BIO, but it drops.. Btw, I am not sure I follow your boldfaced system in the table. |
Just commenting on this until I know more:
Sorry, forgot to write a sentence about that: The bold numbers are those that are somehow "interesting" in that they are different compared to the version preceding it. |
One difference between v0.8.0 and
So v0.8.0 left-aligns indels, but On chrEBV, 15 indels are called from the v0.8.0 BAM, but 20 from the (I keep getting confused by the "main" name because it isn’t main but actually the
Assuming the above observation holds not only on chrEBV, I think that is the issue! |
With #251, where I added a fix to ensure that indels are left-aligned, the alignments for the above example look as they did in v0.8.0. |
Great that you found the(/a) cause for the drastic difference! I am however a bit surprised. In the 50 first indel calls I pasted above for bwa mem main and v0.8.0, there were a lot of calls uniquely detected by main that did not seem to overlap other calls and where also correct calls (8/50 which is roughly 15% - the increase in recall). Was this just a very unlikely coincidence? Did you have a look at them? |
Let me know if you want me to rerun a new commit as 'main' on our SIM and BIO datasets to check if it goes back to previous performance. |
I manually looked at the first variant (chr1:535531 A→AGTGAA) that you marked as "not found by BWA". The alignments covering that variant look fine for all programs (visual inspection in IGV), and the variant is called for all tools. What is different is the way the variant is represented. In the simulation VCF, the variants is written like this:
In the BWA-MEM, strobealign v0.7.1 and strobealign v0.8.0 VCF result files, the variant looks like this in the resulting VCF:
In the strobealign "main" (wfa) results file, it is represented like this:
It’s the same variant in all cases. I think this can be taken care of by running the VCFs through |
Excellent that you tracked down the cause. It looked to good to be true :) I will integrate |
I integrated Since a lot has happened I suggest to simply rerun the benchmark for v0.7.1, v0.8.0, v0.9.0, and commit 8311650
|
I have the results for the benchmark on benchmark for v0.7.1, v0.8.0, v0.9.0, and commit 8311650 (called strobealign_wfa2). Perhaps we should close this one and start a new issue? Anyways, below I have pasted the new results. Some significant remarks:
(runtime is pasted at bottom) Edit: I also checked BWA MEMs results and it typically has higher recall than strobealign and much worse precision (see added lines below, so maybe strobealign_v0.9.0 is 'moving in the right direction' (assuming bwa mem does a good job)?
Points on runtime
|
Re-posting the accuracy table so it’s possibly easier to digest:
|
I evaluated commit b5305a0 which essentially is v0.9.0 with M as cigar strings instead of =/X, which we implemented after discovering that bcftools mpilup behaves differently for M vs =/X as poster here. Results for b5305a0 are given in table below. With M cigar strings:
So it turns out the cigars were responsible for most of the decrease in precision. Overall I think we can close this issue.
|
Ok, then let’s close this. This issue has become a potpourri a various not fully related things anyway. If anything reappears, it’s probably best to open another issue. |
Hi @marcelm,
I am currently benchmarking v0.7.1 vs 0.8.0 vs commit eb7ef72 for the SNP/indel calling and get the below error
Here is the run command
Full output of the strobealign job below. Also note the warnings (that I have not seen before).
The text was updated successfully, but these errors were encountered: