Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcftools mpileup behaves differently for cigar M vs =/X #1597

Closed
marcelm opened this issue Mar 31, 2023 · 1 comment · Fixed by #1607 or samtools/bcftools#1914
Closed

bcftools mpileup behaves differently for cigar M vs =/X #1597

marcelm opened this issue Mar 31, 2023 · 1 comment · Fixed by #1607 or samtools/bcftools#1914
Assignees

Comments

@marcelm
Copy link

marcelm commented Mar 31, 2023

I’m working on the read mapper strobealign. While investigating some false positive SNVs that were called on strobealign’s output but not on BWA-MEM’s, I noticed that it matters whether the CIGAR field in the SAM output uses the = and X operations or M to encode alignments. (Strobealign uses =/X.)

Here is a small example. The rightmost SNV at position 96305 is the relevant one. The bottommost read has CIGAR 96=1X40=1X11=1X, and when I change this to 150M, the SNV disappears.
igv

Corresponding BAM file: cigar-m-vs-eqx.gz (please change extension to .bam). I was not able to make this any shorter: As soon as I remove any read, the SNV disappears.

The reference is GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.

These commands will reproduce the problem:

bcftools mpileup --fasta-ref GCA_000001405.15_GRCh38_no_alt_analysis_set.fna cigar-m-vs-eqx.bam | grep 96305
samtools view -h cigar-m-vs-eqx.bam | sed 's|96=1X40=1X11=1X|150M|' | samtools view --write-index -o m.bam
bcftools mpileup --fasta-ref GCA_000001405.15_GRCh38_no_alt_analysis_set.fna m.bam | grep 96305

Output is

chr20   96305   .       C       A,<*>   0       .       DP=1;I16=0,0,0,1,0,0,40,1600,0,0,60,3600,0,0,0,0;QS=0,1,0;SGB=-0.379885;MQ0F=0  PL      40,3,0,40,3,40

vs

chr20   96305   .       C       <*>     0       .       DP=1;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;QS=0,0;MQ0F=0  PL      0,0,0

I don’t have an opinion on which of the above is correct, but I would assume the intention is for it to be consistent.

@whitwham
Copy link
Contributor

whitwham commented Apr 12, 2023

So here is something interesting.

The vcf output for the CIGARs 150M, 150= and 150X give the same result.
Also 96=1X40=1X11=1X and 96M1X40M1X11M1X and 149=1M give the same results as each other.

@jkbonfield noticed that 96=1X40=1X11=1X and 150M give the same result if BAQ is disabled (-B).

I need to dig deeper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants