Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add distinct minimizer heatmap for Kraken2 #1380

Merged
merged 13 commits into from
Mar 5, 2021
Merged

Conversation

maxibor
Copy link
Contributor

@maxibor maxibor commented Feb 23, 2021

Since version 2.1.0, Kraken2 integrated the distinct minimizer count ability of KrakenUniq
This new function comes with a slightly altered report (see Kraken2 docs) format.

This PR adds a heatmap of the duplication rate of the minimizers (i.e. kmers) which is very useful to detect read stacking, and false positive hits.

  • When encountering a "old" report, it behaves as before
  • When encountering a "new report" (with distinct minimizer count) it adds the duplication heatmap.

PR checklist:

  • This comment contains a description of changes (with reason)
  • CHANGELOG.md has been updated

@maxibor
Copy link
Contributor Author

maxibor commented Feb 23, 2021

Screenshot 2021-02-23 at 18 29 58

@maxibor
Copy link
Contributor Author

maxibor commented Feb 23, 2021

I didn't see it before, but this should solve #1379

@ewels
Copy link
Member

ewels commented Feb 24, 2021

I could have sworn I added Black CI to MultiQC, but I can't see it now. Could you please run black on your code?

Also, please some test data to https://github.com/ewels/MultiQC_TestData/tree/master/data/modules/kraken for this.

Otherwise, LGTM from quick read of the code 👍🏻 Will review again once we have test data.

@aeu79
Copy link

aeu79 commented Feb 24, 2021

@maxibor Testing with my data, nice plots! (now I need to figure them out)

BTW, the unclassified disappeared from the top 5 taxa plot. Was it intended?
image
image
And that is, of course, is changing the percentages.

@aeu79
Copy link

aeu79 commented Feb 24, 2021

Another issue is that it crashes when the reports are mixed (I mean, std and with minimizers in the same folder). Sorry, I was being picky with my tests for #1379 😄

@maxibor
Copy link
Contributor Author

maxibor commented Feb 24, 2021

@aeu79 adc7087 should fix these issues

  • The unclassified reads were brought back thanks to a f.seek(0) (reading the file to check the 'log version' was consuming the first line)
  • The mixed reports are now nicely handled

Screenshot 2021-02-24 at 14 35 13

@ewels This is now black formatted

@maxibor
Copy link
Contributor Author

maxibor commented Feb 24, 2021

I've added test data with this PR MultiQC/test-data#187

@aeu79
Copy link

aeu79 commented Feb 24, 2021

Guauu, that was fast, congratulations! Looks great:
image
I reached to the same conclusion by comparing the reads and unique minimizers:
image
(the one in red from your plot doesn't even show up in the top 50 taxa when I order them by minimizers)

BTW, with many files I'm getting this warning (only with your version):

version:  1.10.dev0 (adc7087)
All std
[INFO   ]         multiqc : This is MultiQC v1.10.dev0 (adc7087)
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching   : /home/agu/Desktop/testMQC
[INFO   ]          kraken : Found 104 reports
/home/agu/GitHub/multiqc/MultiQC/multiqc/plots/bargraph.py:486: UserWarning: FixedFormatter should only be used together with FixedLocator
  axes.set_xticklabels(["{:.0f}%".format(x) for x in vals])
[WARNING]          kraken : Kraken2 reports of different versions were found

@ewels ewels mentioned this pull request Mar 4, 2021
11 tasks
@ewels
Copy link
Member

ewels commented Mar 4, 2021

Merge conflicts after merging #1347 - will take a look now.

@ewels
Copy link
Member

ewels commented Mar 4, 2021

ok, conflicts fixed and code working. I can see that the TestData sample metagenome.kreport now has 100% Unclassified reads in the General Stats table (instead of empty cells) and that it's reads are now Other instead of Unclassified for some of the levels in the Top Taxa plot. But the new heatmap looks kind of uninspiring for me 😓

Screenshot 2021-03-04 at 22 39 22

  • Is this just that the test data is not super useful or am I doing something wrong?
  • If it's possible to generate an empty plot like this, can we check for that and skip it if not useful?
  • If it is the test data, please could we add some more interesting test data into the TestData repo?

Thanks all!

Phil

@ewels
Copy link
Member

ewels commented Mar 4, 2021

BTW, with many files I'm getting this warning (only with your version):

version:  1.10.dev0 (adc7087)
All std
[INFO   ]         multiqc : This is MultiQC v1.10.dev0 (adc7087)
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching   : /home/agu/Desktop/testMQC
[INFO   ]          kraken : Found 104 reports
/home/agu/GitHub/multiqc/MultiQC/multiqc/plots/bargraph.py:486: UserWarning: FixedFormatter should only be used together with FixedLocator
  axes.set_xticklabels(["{:.0f}%".format(x) for x in vals])
[WARNING]          kraken : Kraken2 reports of different versions were found

@aeu79 - this warning was unrelated to the changes in this PR. It's a warning from matplotlib due to the core MultiQC code. It's only just started appearing due to an update in their end. I suspect it only showed for you here because you were running with more samples, triggering MultiQC to generate flat image bargraphs instead of interactive.

Anyway, it should now be fixed in master in de0f01b so you should hopefully not see it again. Thanks for reporting 👍🏻

@maxibor
Copy link
Contributor Author

maxibor commented Mar 5, 2021

@ewels
Copy link
Member

ewels commented Mar 5, 2021

Thanks @maxibor! I brought back my check to avoid repeated noticed about mixed reports, which had gotten lost.

I'm running tests on the data in MultiQC_TestData > data/modules/kraken and after your update the report now crashes with a JavaScript error (fairly difficult to do usually!).

I'll take a look now, but generally speaking it's good to also try running on just this data as then you'll come across these same issues as me and can probably save some time 😉

@ewels
Copy link
Member

ewels commented Mar 5, 2021

Ok added an extra loop in c2faaeb to strip out any heatmap samples with zero values. This fixes the report for me and I now get a heatmap with two cells. I also tweaked the heatmap config as I figure it doesn't need to be square (the x- and y-categories are not the same) and I also specified that the xcats are not sample names, so that they are not affected by hiding / renaming / highlighting in the report.

kraken-topfive-duplication_plot

I'm happy with this so will merge as soon as the tests are passing - thank you! 👍🏻 (especially for speedy response after the long delay).

Please check that you're ok with my changes and let me know ASAP if not as we can fix in a follow-up PR.

Phil

@ewels
Copy link
Member

ewels commented Mar 5, 2021

@aeu79 - note that your updated regex from #1383 is now in this PR also, so should hopefully address your comment: #1383 (comment)

Shout if there's anything missing still.

Phil

@ewels ewels merged commit 9086510 into MultiQC:master Mar 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants