Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module Order of DeDup Qualimap DamageProfiler #227

Closed
ktmeaton opened this issue Jun 27, 2019 · 9 comments
Closed

Module Order of DeDup Qualimap DamageProfiler #227

ktmeaton opened this issue Jun 27, 2019 · 9 comments
Assignees
Labels
easier-tasks enhancement New feature or request good first issue Good for newcomers

Comments

@ktmeaton
Copy link

Describe the Question
Hi Eager Team, this isn't a bug so much as it is a question. I'm curious about why the qualimap and damageprofiler modules run on the output of samtools_filter rather than dedup? Perhaps it's a matter of personal preference, but I'd rather my coverage/depth estimates and allele frequency/substitution rates be calculated independent of PCR duplicates. For depth, it's to avoid overinflation of confidence and for damage calculation, it's to avoid duplicate molecules which are not independent allele observations. However, I'm not very familiar with either of these particular tools so any clarity you can provide is greatly appreciated!

Test Data
An ancient mitochondrial genome enrichment (low abundance, high duplication). By default, the pipeline output reports I have a ~1800x genome with very messy damage signatures. This sample was chosen because it's an extreme example to highlight the difference.

"Expected" behavior
Again, perhaps inexperience since I'm not familiar with these modules. I'd rather these run on the dedup output which reports that I have a ~18X genome with the expected terminal damage to match my library prep method.

To Reproduce
I attached my pipeline command here:
command.txt
Running revision: ace20a0 [dev]

DamageProfiler Comparison
Default EAGER Pipeline: DamagePlot_Default.pdf
Plotted from DeDup Output: DamagePlot_DeDup.pdf

Qualimap Comparison
Default EAGER Pipeline:
Qualimap_Default
Plotted from DeDup Output:
Qualimap_DeDup

Additional context
Runlog: nextflow.log

Thank you!

@jfy133
Copy link
Member

jfy133 commented Jun 27, 2019

That's a very good question as I would also have the same logic. @apeltzer would be best to answer this!

@apeltzer apeltzer self-assigned this Jun 27, 2019
@apeltzer apeltzer added easier-tasks enhancement New feature or request good first issue Good for newcomers labels Jun 27, 2019
@apeltzer
Copy link
Member

Actually, that is something that we had differently in EAGERv1 but I agree that it probably makes more sense to compute this AFTER dedup for the reasons mentioned above by @ktmeaton . Wouldn't be a huge change to do so, we would, however, have to have a switch when people turn off dedup that then the non-deduped output will be used for qualimap/damageprofiler nevertheless.

If that is something we anyways tend to agree upon, I guess I could draft a PR for this.

@jfy133
Copy link
Member

jfy133 commented Aug 12, 2019

From the technical side: I'm writing a GATK genotyping section, and I encountered a similar issue that maybe worth considering here.

How would the switch work, without duplicating processes (and thus code)? AFAIK/can tell you can't have optional/conditional input channels. So how would you tell a given process to use non-dedup vs dedup output?

@apeltzer
Copy link
Member

apeltzer commented Sep 4, 2019

You can assume that you have either non-dedup or dedupped files that you want to take downstream for analysis, there are some options available using Channel Operators in Nextflow that we can apply/use here I believe. Could even be a simple mix as turning off dedup will end up having an empty channel for dedup output, whereas the other channel will contain data so mixing these won't harm at all and we can use the "mixed channel" as input for downstream analysis for example. Thats basically no duplication of any code..

@jfy133
Copy link
Member

jfy133 commented Sep 4, 2019

The issue with your second suggestion is if you're mixing you would be running e.g. damage profiler on both pre-dedup and post-dedup. How would you exclude the pre-dedup channel files? Which is my main question basically.

@jfy133
Copy link
Member

jfy133 commented Sep 14, 2019

An attempt to address this is: #236

@ktmeaton if you're feeling adventurous, could you try it out to check it does what you want? run the nextflow command with nextflow run jfy133/eager -r downstream_switch [...]

EDIT: please hold off on this for the moment, Alex pointed out forcing everything through the bam filtering process is not nice computational resource-wise when someone doesn't want to run that step. I'm trying to re-write that now.

@apeltzer apeltzer added this to the V2.1 "Ulm" milestone Sep 16, 2019
@ktmeaton
Copy link
Author

@jfy133 Thanks for the update! I'm happy to help test out the new implementation, but I will hold off for now as it's rewritten.

@jfy133 jfy133 mentioned this issue Sep 21, 2019
25 tasks
@jfy133 jfy133 self-assigned this Sep 21, 2019
@jfy133
Copy link
Member

jfy133 commented Sep 21, 2019

@ktmeaton OK done. I've already done extensive tests my self, but it would be great if someone else independently does this as well in case i've missed a certain use-test case.

My branch is here:

https://github.com/jfy133/eager/tree/skip_logic

@jfy133
Copy link
Member

jfy133 commented Oct 2, 2019

@ktmeaton this has now been integrated into the dev branch with #242

In principle, by default if you turn on Deduplication, those bams will passed into all downstream processes.

I'll close this for now - but feel free to re-open (or comment again) if you have issues.

@jfy133 jfy133 closed this as completed Oct 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
easier-tasks enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants