Minimize memory footprint of `merge_featurecounts` #243

olgabot · 2019-06-25T16:31:50Z

Hello,
In running the current salmon pipeline on a "small" dataset of ~300 RNA-seq samples, the merge_featurecounts step errored out with exit code 255. Looking into it more, it seems that the operation ran out of memory. I was able to run this on a local machine and the .command.trace output showed that it used ~21.GB of memory:

nextflow.trace/v2
realtime=140436
%cpu=1645
rchar=12498004788
wchar=163107620
syscr=1531666
syscw=25205
read_bytes=131072
write_bytes=43253760
%mem=10
vmem=21335476
rss=21111964
peak_vmem=21335476
peak_rss=21111964
vol_ctxt=5289
inv_ctxt=227

As we will be using this pipeline on ~6000+ samples at a time, it's critical to us that the memory footprint of merging these counts is minimized to prevent failure. This PR separates out the csvtk cut and csvtk join steps to hopefully decrease the memory needed to merge the featurecounts matrices.

PR checklist

This comment contains a description of changes (with reason)
If you've fixed a bug or added code that should be tested, add tests!
If necessary, also make a PR on the nf-core/rnaseq branch on the nf-core/test-datasets repo
Ensure the test suite passes (nextflow run . -profile test,docker).
Make sure your code lints (nf-core lint .).
Documentation in docs is updated
CHANGELOG.md is updated
README.md is updated

Learn more about contributing: https://github.com/nf-core/rnaseq/tree/master/.github/CONTRIBUTING.md

…eded for merging

drpatelh · 2019-06-25T21:05:13Z

@olgabot Can we not just change the process requirement for merge_featureCounts and salmon_merge from label "low_memory" to label "mid_memory" as you have done already? I think its unnecessary to split these processes up.

300 samples is actually alot! I think you have to remember that very few people will actually be running the pipeline at this scale - the default parameters are set for mainly this purpose. This means that you will most likely need to customise them anyway using a custom config.

olgabot · 2019-06-26T17:50:19Z

@drpatelh It's certainly possible to change the memory label. However, using unix cut and paste is several orders of magnitude or more faster and uses fewer resources than using csvtk. It seems inefficient to me to do a join on 7 fields (csvtk join -t -f "Geneid,Start,Length,End,Chr,Strand,gene_name") only to remove 5 of them in the next step! (csvtk cut -t -f "-Start,-Chr,-End,-Length,-Strand") The unneeded columns could be removed first, and then everything else added.

As there is no need to do a true join on the rows, a simple paste of all the columns together, in a streaming fashion, without checking for rows should be correct as the gene quantifier outputs the gene quantifications in the order observed in the GTF file. Requesting 32.GB for a text merging operation seems like wayyy too much memory when there's really no reason to do a merge or join.

Here's an execution report showing that each cleaning operation took ~3.MB in peak memory and the merging operation took 6.6.MB, 4 orders of magnitude less than the original version of 21.GB peak RAM:

execution_trace_featurecounts-memory.txt

And here's a screenshot from the execution report HTML:

I agree that it may be overkill to have completely separate operations for this, but the other alternative is a bash for loop within the merge_featureCounts process that doesn't check if each individual sample got "cleaned" properly. And it seems to me that keeping the processes separate is the most robust way to do it.

olgabot · 2019-06-26T20:51:27Z

FYI this is failing due to a bug fixed by #242

drpatelh · 2019-06-26T20:56:13Z

👍 Let's get that merged first then so we can get the tests passing on this PR. Will need to add the --gencode parameter in the appropriate places in the pipeline before we merge that.

drpatelh · 2019-06-26T20:56:52Z

I'll have a better look at this PR tomorrow.

drpatelh · 2019-06-27T08:34:03Z

Yes, I agree that it would be better to replace csvtk here with native linux commands. I actually did this in a previous PR for the salmon process:

rnaseq/main.nf

Line 1143 in a77dd45

    
                   cut -f 1,4 ${sample}/quant.sf | sed '1s/TPM/${sample}/g' > ${sample}/${sample}.transcript.tpm.txt

This will also mean we can remove csvtk from environment.yml, and so have one less software dependency 👍 . There must be a way to do this all in the same process though with linux commands 🤔 e.g. use any of the counts files to get the first couple of columns and then paste the appropriate column from all of the files.

apeltzer · 2019-06-27T10:20:04Z

Yeah we replaced the former Python script with the csvtk as it offers quite some handy functionality but didn't notice this behaviour when having lots of samples. I agree that we should then probably drop the dependency and find a single line (or two lines, shouldn't be a mess - Nextflow handles that nicely) to get things going 👍

olgabot · 2019-06-27T14:43:10Z

Whoo I think I figured out a unix-fu solution to this!!! paste can take infinite inputs and we can create a redirected cut for each file:

    script:
    gene_ids = "<(cut -f1,2 ${input_files[0]})"
    counts = input_files.collect{filename -> "<(cut -f3 ${filename})"}.join(" ")
    """
    paste $gene_ids $counts > merged_gene_counts.txt
    """

olgabot · 2019-06-27T14:44:15Z

Ahh shoot need to remove header lines. But still excited about this!!

 ♥ 83%  Thu 27 Jun - 07:43  ~/code/nf-core/rnaseq   origin ☊ olgabot/featurecounts-merge-memory 28☀ 1● 
  head results/featureCounts/merged_gene_counts.txt
# Program:featureCounts v1.6.4; Command:"featureCounts" "-a" "genes.gtf" "-g" "gene_id" "-o" "SRR4238351_subsamp.sorted_gene.featureCounts.txt" "--extraAttributes" "gene_name" "-p" "-s" "0" "SRR4238351_subsamp.sorted.bam" 	# Program:featureCounts v1.6.4; Command:"featureCounts" "-a" "genes.gtf" "-g" "gene_id" "-o" "SRR4238351_subsamp.sorted_gene.featureCounts.txt" "--extraAttributes" "gene_name" "-p" "-s" "0" "SRR4238351_subsamp.sorted.bam" 	# Program:featureCounts v1.6.4; Command:"featureCounts" "-a" "genes.gtf" "-g" "gene_id" "-o" "SRR4238351_subsamp.sorted_gene.featureCounts.txt" "--extraAttributes" "gene_name" "-p" "-s" "0" "SRR4238351_subsamp.sorted.bam" 	# Program:featureCounts v1.6.4; Command:"featureCounts" "-a" "genes.gtf" "-g" "gene_id" "-o" "SRR4238355_subsamp.sorted_gene.featureCounts.txt" "--extraAttributes" "gene_name" "-p" "-s" "0" "SRR4238355_subsamp.sorted.bam" 	# Program:featureCounts v1.6.4; Command:"featureCounts" "-a" "genes.gtf" "-g" "gene_id" "-o" "SRR4238359_subsamp.sorted_gene.featureCounts.txt" "--extraAttributes" "gene_name" "-p" "-s" "0" "SRR4238359_subsamp.sorted.bam"
Geneid	Chr	Start	Start	Start	Start
YAL069W	I	335	335	335	335
YAL068W-A	I	538	538	538	538
YAL068C	I	1807	1807	1807	1807
YAL067W-A	I	2480	2480	2480	2480
YAL067C	I	7235	7235	7235	7235
YAL066W	I	10091	10091	10091	10091
YAL065C	I	11565	11565	11565	11565
YAL064W-B	I	12046	12046	12046	12046

apeltzer · 2019-06-27T18:25:04Z

Okay once you're ready ping me, I'll do something else until then 👍

olgabot · 2019-06-28T02:49:19Z

Okay the merged files are looking correct now. Before I was using field 3 which is the gene chromosomal Start for each one (???).

Now it looks right with the correct header and counts!!

 ♥ 65%  Thu 27 Jun - 19:50  ~/code/nf-core/rnaseq   origin ☊ olgabot/featurecounts-merge-memory ✔ 29☀ 
  head -n 20 results/featureCounts/merged_gene_counts.txt | csvlook -t -I
| Geneid    | gene_name | SRR4238359_subsamp.sorted.bam | SRR4238379_subsamp.sorted.bam | SRR4238355_subsamp.sorted.bam | SRR4238351_subsamp.sorted.bam |
| --------- | --------- | ----------------------------- | ----------------------------- | ----------------------------- | ----------------------------- |
| YAL069W   | YAL069W   | 0                             | 0                             | 0                             | 0                             |
| YAL068W-A | YAL068W-A | 0                             | 0                             | 0                             | 0                             |
| YAL068C   | PAU8      | 0                             | 0                             | 0                             | 0                             |
| YAL067W-A | YAL067W-A | 0                             | 0                             | 0                             | 0                             |
| YAL067C   | SEO1      | 0                             | 0                             | 1                             | 0                             |
| YAL066W   | YAL066W   | 0                             | 0                             | 0                             | 0                             |
| YAL065C   | YAL065C   | 0                             | 0                             | 0                             | 0                             |
| YAL064W-B | YAL064W-B | 1                             | 0                             | 0                             | 0                             |
| YAL064C-A | TDA8      | 0                             | 1                             | 0                             | 0                             |
| YAL064W   | YAL064W   | 0                             | 0                             | 0                             | 0                             |
| YAL063C-A | YAL063C-A | 0                             | 1                             | 0                             | 0                             |
| YAL063C   | FLO9      | 0                             | 1                             | 0                             | 0                             |
| YAL062W   | GDH3      | 1                             | 0                             | 0                             | 1                             |
| YAL061W   | BDH2      | 4                             | 0                             | 6                             | 0                             |
| YAL060W   | BDH1      | 21                            | 6                             | 16                            | 11                            |
| YAL059C-A | YAL059C-A | 0                             | 0                             | 0                             | 0                             |
| YAL059W   | ECM1      | 0                             | 1                             | 0                             | 3                             |
| YAL058W   | CNE1      | 1                             | 1                             | 7                             | 4                             |
| YAL056C-A | YAL056C-A | 0                             | 0                             | 0                             | 0                             |

So yep @apeltzer ready for your review!

Add separate step to clean featurecounts output to minimize memory ne…

a1de008

…eded for merging

olgabot requested a review from drpatelh June 25, 2019 16:33

olgabot added 3 commits June 25, 2019 09:38

Make salmon_merge also mid_memory

e60c8c2

Get clean_featureCounts to work with test data

aaab9ad

Use paste to merge everything

02d3ce9

no "markdups" in filename

5b2de6f

olgabot added 4 commits June 27, 2019 07:44

Use unix-fu to merge featurecounts

7bc52ec

Remove first line of featurecounts files

8c4ad54

Remove csvtk from requirements

d8242ec

add a note about redirection

a3c30ad

olgabot mentioned this pull request Jun 27, 2019

GENCODE annotation compatibility with salmon #242

Merged

8 tasks

Use 7th column for gene namec

beab23f

actually use the gene counts to merge ..

2be0b27

apeltzer changed the title ~~[WIP] Minimize memory footprint of merge_featurecounts~~ Minimize memory footprint of merge_featurecounts Jul 7, 2019

apeltzer approved these changes Jul 7, 2019

View reviewed changes

apeltzer merged commit 034ee2a into nf-core:salmon Jul 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimize memory footprint of `merge_featurecounts` #243

Minimize memory footprint of `merge_featurecounts` #243

olgabot commented Jun 25, 2019 •

edited

Loading

drpatelh commented Jun 25, 2019

olgabot commented Jun 26, 2019 •

edited

Loading

olgabot commented Jun 26, 2019

drpatelh commented Jun 26, 2019

drpatelh commented Jun 26, 2019

drpatelh commented Jun 27, 2019

apeltzer commented Jun 27, 2019

olgabot commented Jun 27, 2019

olgabot commented Jun 27, 2019

apeltzer commented Jun 27, 2019

olgabot commented Jun 28, 2019 •

edited

Loading

Minimize memory footprint of merge_featurecounts #243

Minimize memory footprint of merge_featurecounts #243

Conversation

olgabot commented Jun 25, 2019 • edited Loading

PR checklist

drpatelh commented Jun 25, 2019

olgabot commented Jun 26, 2019 • edited Loading

olgabot commented Jun 26, 2019

drpatelh commented Jun 26, 2019

drpatelh commented Jun 26, 2019

drpatelh commented Jun 27, 2019

apeltzer commented Jun 27, 2019

olgabot commented Jun 27, 2019

olgabot commented Jun 27, 2019

apeltzer commented Jun 27, 2019

olgabot commented Jun 28, 2019 • edited Loading

Minimize memory footprint of `merge_featurecounts` #243

Minimize memory footprint of `merge_featurecounts` #243

olgabot commented Jun 25, 2019 •

edited

Loading

olgabot commented Jun 26, 2019 •

edited

Loading

olgabot commented Jun 28, 2019 •

edited

Loading