-
Notifications
You must be signed in to change notification settings - Fork 709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minimize memory footprint of merge_featurecounts
#243
Minimize memory footprint of merge_featurecounts
#243
Conversation
@olgabot Can we not just change the process requirement for 300 samples is actually alot! I think you have to remember that very few people will actually be running the pipeline at this scale - the default parameters are set for mainly this purpose. This means that you will most likely need to customise them anyway using a custom config. |
@drpatelh It's certainly possible to change the memory label. However, using unix As there is no need to do a true Here's an execution report showing that each cleaning operation took ~3.MB in peak memory and the merging operation took 6.6.MB, 4 orders of magnitude less than the original version of 21.GB peak RAM: execution_trace_featurecounts-memory.txt And here's a screenshot from the execution report HTML: I agree that it may be overkill to have completely separate operations for this, but the other alternative is a bash for loop within the |
FYI this is failing due to a bug fixed by #242 |
👍 Let's get that merged first then so we can get the tests passing on this PR. Will need to add the |
I'll have a better look at this PR tomorrow. |
Yes, I agree that it would be better to replace Line 1143 in a77dd45
This will also mean we can remove |
Yeah we replaced the former Python script with the csvtk as it offers quite some handy functionality but didn't notice this behaviour when having lots of samples. I agree that we should then probably drop the dependency and find a single line (or two lines, shouldn't be a mess - Nextflow handles that nicely) to get things going 👍 |
Whoo I think I figured out a unix-fu solution to this!!! script:
gene_ids = "<(cut -f1,2 ${input_files[0]})"
counts = input_files.collect{filename -> "<(cut -f3 ${filename})"}.join(" ")
"""
paste $gene_ids $counts > merged_gene_counts.txt
""" |
Ahh shoot need to remove header lines. But still excited about this!!
|
Okay once you're ready ping me, I'll do something else until then 👍 |
Okay the merged files are looking correct now. Before I was using field 3 which is the gene chromosomal Start for each one (???). Now it looks right with the correct header and counts!!
So yep @apeltzer ready for your review! |
merge_featurecounts
merge_featurecounts
Hello,
In running the current
salmon
pipeline on a "small" dataset of ~300 RNA-seq samples, themerge_featurecounts
step errored out with exit code255
. Looking into it more, it seems that the operation ran out of memory. I was able to run this on a local machine and the.command.trace
output showed that it used ~21.GB of memory:As we will be using this pipeline on ~6000+ samples at a time, it's critical to us that the memory footprint of merging these counts is minimized to prevent failure. This PR separates out the
csvtk cut
andcsvtk join
steps to hopefully decrease the memory needed to merge the featurecounts matrices.PR checklist
nextflow run . -profile test,docker
).nf-core lint .
).docs
is updatedCHANGELOG.md
is updatedREADME.md
is updatedLearn more about contributing: https://github.com/nf-core/rnaseq/tree/master/.github/CONTRIBUTING.md