Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNPs should be proportional to graph height, not just linear scale #2838

Closed
gringer opened this issue Mar 21, 2022 · 13 comments · Fixed by #2839
Closed

SNPs should be proportional to graph height, not just linear scale #2838

gringer opened this issue Mar 21, 2022 · 13 comments · Fixed by #2839

Comments

@gringer
Copy link

gringer commented Mar 21, 2022

I like that SNPs are represented on a linear scale even on log plots (because, for example, it's really confusing to see a SNP at 50% frequency that is almost at the top of a graph), but the total graph heights should match. This does not seem to be the case with the current view:

// note: the snps are drawn on linear scale even if the data is drawn in log
// scape hence the two different scales being used

Image to demonstrate the issue, above is linear scale, below is log scale; the 'A' variant should occupy 46% of the vertical height, but it only occupies about 33% of the height in the log scale:

Screen Shot 2022-03-22 at 11 26 21

@cmdcolin
Copy link
Collaborator

thanks for reporting this, I think you're right about this. there was a little fancy programming that didnt have the right result here but i think coding it more straightforwardly as a linear drawing that is percentage of the score at that position will work

@cmdcolin
Copy link
Collaborator

This is on the main branch now if you want to test out! jbrowse create --nightly newinstance

@gringer
Copy link
Author

gringer commented Mar 23, 2022

Yep, this seems like it's working well for me so far. I've noticed that the chrM-specific remapping of the reads has changed the variant frequency, which I'll probably need to investigate at a later date.

Screen Shot 2022-03-23 at 12 39 20

@cmdcolin
Copy link
Collaborator

Hmm...I tried out the sample data you posted in #2833 but it seemed to be the same at that position (chrM:9347) in both v1.6.6 and main branch. It would definitely be concerning if there is a different in the frequencies. Could it be a different file?

@gringer
Copy link
Author

gringer commented Mar 23, 2022

Yeah, the first image is from a different file, which was a whole-genome mapping. I subset the reads to chrM using samtools sort and samtools fastq, then remapped them to just chrM (so I could reduce the reference genome size), assuming that the mapping would be the same... but it wasn't.

I've just now updated the BAM file to be a filtered BAM file with a modified header, without remapping, and it's recovered the apparent heteroplasmy:

Screen Shot 2022-03-23 at 19 57 44

@gringer
Copy link
Author

gringer commented Mar 23, 2022

Okay, I think I see what's happening here: JBrowse2 is including secondary alignments when calculating total coverage, but ignoring them when calculating variant frequency (I recall mention of something like this when looking at the code). This will be happening because there are other genomic regions that contain a portion of the mitochondrial genome, and it's essentially random whether the mitochondrial genome or that other location is considered the primary alignment.

Here's what reads within this region look with Tablet; you can see about half the reads mapped to the region have '???'s (because there's no asssociated sequence in the BAM file):

Screen Shot 2022-03-23 at 20 28 54

@cmdcolin
Copy link
Collaborator

The default set of flags that are filtered for both Pileup and SNPCoverage is "1540" which should retain secondary alignments https://broadinstitute.github.io/picard/explain-flags.html If interested feel free to make an issue for this. Also got inspired by the tablet screenshot and made a per-base viewer here! #2847

@gringer
Copy link
Author

gringer commented Mar 24, 2022

Yes, but the problem is that bases aren't known for these secondary reads. The default output for minimap2 is to set the sequence and quality fields to '*' when the sequence for a secondary read is present elsewhere in the BAM file, and the CIGAR sequence only reports if there is no change in sequence length at that location. In this case I had filtered to only include chrM mappings, so those sequences aren't present anywhere in the BAM file: there's no way to get access to the base values for that read at the target location. Here's an example showing the text representation, with a few secondary reads highlighted:

Screen Shot 2022-03-24 at 13 06 57

@gringer
Copy link
Author

gringer commented Mar 24, 2022

Here's a plot showing the problem visually. These plots are showing identical reads, but in the top case it includes secondary reads with '*' as the sequence, whereas in the bottom case the reads have been remapped specifically to chrM and all have sequence in the BAM files:

Screen Shot 2022-03-24 at 14 34 30

@cmdcolin
Copy link
Collaborator

If the sequence is not available (on the secondary reads) and there is no MD tag, there is probably no way that it could call the SNPs on the secondary reads. JBrowse either uses MD tags or compares the reference sequence with the read sequence to get SNPs! There might be some option with either calmd or the aligner to get these back though

Thanks for all the detailed feedback also!

@cmdcolin
Copy link
Collaborator

(I have seen in some cases the sequence is preserved on secondary alignments in other BAM files, but wouldn't know the options to get that)

@gringer
Copy link
Author

gringer commented Mar 24, 2022

This is an approach decided on by the mapper (in this case, minimap2); preserving sequence for secondary alignments is basically a #WONTFIX for minimap2, despite the obvious benefit of it in alignment visualisation tools:

lh3/minimap2#458 (comment)

@cmdcolin
Copy link
Collaborator

cmdcolin commented Sep 11, 2022

@gringer I made a little program to try to add SEQ and QUAL fields to minimap2 secondary alignments in BAM/CRAM as this problem always bugged me. might be of interest :)

https://github.com/cmdcolin/secondary_rewriter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants