Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chromosome label inconsistencies in --cytosine_report #88

Closed
jeffbhasin opened this issue Oct 2, 2019 · 2 comments
Closed

Chromosome label inconsistencies in --cytosine_report #88

jeffbhasin opened this issue Oct 2, 2019 · 2 comments

Comments

@jeffbhasin
Copy link

Hello Authors,
We have found that running MethylDackel for certain genomes (hg38 is one example) seems to produce inconsistencies with the chromosomes reported in the --cytosine_report option. We don't observe this behavior when using data aligned to hg19.

We attempted to sort the cytosine reports for each sample by chromosome and position, and then join the data into a matrix by rows. In this process, we checked if the chromosome names matched, and found that they did not. We then made a count table for the frequency of each chromosome name appearing in the first column of each cytosine report for each sample and found that some chromosome names never appeared in some samples and some chromosomes had different counts in different samples. This is in spite of each cytosine report file from MethylDackel having the same number of rows.

We then called methylation using bismark_methylation_extractor and produced cytosine_report format files using that tool. In this case, we find that chromosome names and positions match perfectly, after sorting, among all samples. As a result, we suspect MethylDackel is mislabeling or mis-assigning chromosome names in the cytosine_report format for hg38 data.

We have reproduced this issue using a set of publicly available RRBS data to enable the authors to re-create our exact observations on the same dataset, if necessary. The dataset we used were the 3 RRBS samples from GEO accession GSE55159. We aligned the data to hg38 (from UCSC) using Bismark v0.22.1 (using Bowtie2 v2.3.5) and called methylation using MethylDackel 0.4.0 (using HTSlib version 1.9). Some examples of the issue in this dataset are chr15 (inconsistent counts of 1819980, 1779406, and 1819980 occurrences among the 3 samples in the cytosine_reports) and chrUn_KI270589v1 (occurs 0, 5912, and 2664 across the 3 samples). Some chr are not effected such as chrX (2645418 occurrences in all 3 samples). All 3 files have the same number of total rows (61959486). In the bismark_methylation_extractor output, computing the same frequencies, we find all chromosome names have the same frequency in all samples.

We would be happy to provide any of our alignments, cytosine reports, or intermediate files if that would be helpful. Thank you for helping us with this issue. We have found MethylDackel to be very useful software and hope this issue could be fixed.

Best Regards,
Jeff

@dpryan79
Copy link
Owner

I must have overlooked this when you reported it, thanks for that and also for mentioning public datasets. I'm able to fully reproduce this issue locally and I agree that for some reason the chromosome is sometimes simply wrong in the --cytosine_report output. I'll investigate exactly how this is happening.

@dpryan79
Copy link
Owner

While the fix for this turned out to be quite simple, I don't want to mention how long it took to track down what was actually going wrong :p

This is now fixed in the next release. In short, this problem only occurred when there were very large (>=megabase) regions of no coverage and the --cytosine_report option was used. In this case each worker thread had to write out a large number of empty (0 counts) entries, but it did so using whatever the last chromosome name it had processed was. So the position, counts and I think even context were correct, just the chromosome name was wrong. The fix was just changing 2 lines of code, but I definitely only considered WGBS and not RRBS when I wrote that code.

Thanks again for reporting the issue, my goal is to push out a new release that fixes this today or tomorrow at the latest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants