Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimize memory footprint of merge_featurecounts #243

Merged

Conversation

olgabot
Copy link
Contributor

@olgabot olgabot commented Jun 25, 2019

Hello,
In running the current salmon pipeline on a "small" dataset of ~300 RNA-seq samples, the merge_featurecounts step errored out with exit code 255. Looking into it more, it seems that the operation ran out of memory. I was able to run this on a local machine and the .command.trace output showed that it used ~21.GB of memory:

nextflow.trace/v2
realtime=140436
%cpu=1645
rchar=12498004788
wchar=163107620
syscr=1531666
syscw=25205
read_bytes=131072
write_bytes=43253760
%mem=10
vmem=21335476
rss=21111964
peak_vmem=21335476
peak_rss=21111964
vol_ctxt=5289
inv_ctxt=227

As we will be using this pipeline on ~6000+ samples at a time, it's critical to us that the memory footprint of merging these counts is minimized to prevent failure. This PR separates out the csvtk cut and csvtk join steps to hopefully decrease the memory needed to merge the featurecounts matrices.

PR checklist

  • This comment contains a description of changes (with reason)
  • If you've fixed a bug or added code that should be tested, add tests!
  • If necessary, also make a PR on the nf-core/rnaseq branch on the nf-core/test-datasets repo
  • Ensure the test suite passes (nextflow run . -profile test,docker).
  • Make sure your code lints (nf-core lint .).
  • Documentation in docs is updated
  • CHANGELOG.md is updated
  • README.md is updated

Learn more about contributing: https://github.com/nf-core/rnaseq/tree/master/.github/CONTRIBUTING.md

@olgabot olgabot requested a review from drpatelh June 25, 2019 16:33
@drpatelh
Copy link
Member

@olgabot Can we not just change the process requirement for merge_featureCounts and salmon_merge from label "low_memory" to label "mid_memory" as you have done already? I think its unnecessary to split these processes up.

300 samples is actually alot! I think you have to remember that very few people will actually be running the pipeline at this scale - the default parameters are set for mainly this purpose. This means that you will most likely need to customise them anyway using a custom config.

@olgabot
Copy link
Contributor Author

olgabot commented Jun 26, 2019

@drpatelh It's certainly possible to change the memory label. However, using unix cut and paste is several orders of magnitude or more faster and uses fewer resources than using csvtk. It seems inefficient to me to do a join on 7 fields (csvtk join -t -f "Geneid,Start,Length,End,Chr,Strand,gene_name") only to remove 5 of them in the next step! (csvtk cut -t -f "-Start,-Chr,-End,-Length,-Strand") The unneeded columns could be removed first, and then everything else added.

As there is no need to do a true join on the rows, a simple paste of all the columns together, in a streaming fashion, without checking for rows should be correct as the gene quantifier outputs the gene quantifications in the order observed in the GTF file. Requesting 32.GB for a text merging operation seems like wayyy too much memory when there's really no reason to do a merge or join.

Here's an execution report showing that each cleaning operation took ~3.MB in peak memory and the merging operation took 6.6.MB, 4 orders of magnitude less than the original version of 21.GB peak RAM:

execution_trace_featurecounts-memory.txt

And here's a screenshot from the execution report HTML:

Screen Shot 2019-06-26 at 10 45 35 AM

I agree that it may be overkill to have completely separate operations for this, but the other alternative is a bash for loop within the merge_featureCounts process that doesn't check if each individual sample got "cleaned" properly. And it seems to me that keeping the processes separate is the most robust way to do it.

@olgabot
Copy link
Contributor Author

olgabot commented Jun 26, 2019

FYI this is failing due to a bug fixed by #242

@drpatelh
Copy link
Member

👍 Let's get that merged first then so we can get the tests passing on this PR. Will need to add the --gencode parameter in the appropriate places in the pipeline before we merge that.

@drpatelh
Copy link
Member

I'll have a better look at this PR tomorrow.

@drpatelh
Copy link
Member

Yes, I agree that it would be better to replace csvtk here with native linux commands. I actually did this in a previous PR for the salmon process:

rnaseq/main.nf

Line 1143 in a77dd45

cut -f 1,4 ${sample}/quant.sf | sed '1s/TPM/${sample}/g' > ${sample}/${sample}.transcript.tpm.txt

This will also mean we can remove csvtk from environment.yml, and so have one less software dependency 👍 . There must be a way to do this all in the same process though with linux commands 🤔 e.g. use any of the counts files to get the first couple of columns and then paste the appropriate column from all of the files.

@apeltzer
Copy link
Member

Yeah we replaced the former Python script with the csvtk as it offers quite some handy functionality but didn't notice this behaviour when having lots of samples. I agree that we should then probably drop the dependency and find a single line (or two lines, shouldn't be a mess - Nextflow handles that nicely) to get things going 👍

@olgabot
Copy link
Contributor Author

olgabot commented Jun 27, 2019

Whoo I think I figured out a unix-fu solution to this!!! paste can take infinite inputs and we can create a redirected cut for each file:

    script:
    gene_ids = "<(cut -f1,2 ${input_files[0]})"
    counts = input_files.collect{filename -> "<(cut -f3 ${filename})"}.join(" ")
    """
    paste $gene_ids $counts > merged_gene_counts.txt
    """

@olgabot
Copy link
Contributor Author

olgabot commented Jun 27, 2019

Ahh shoot need to remove header lines. But still excited about this!!

 ♥ 83%  Thu 27 Jun - 07:43  ~/code/nf-core/rnaseq   origin ☊ olgabot/featurecounts-merge-memory 28☀ 1● 
  head results/featureCounts/merged_gene_counts.txt
# Program:featureCounts v1.6.4; Command:"featureCounts" "-a" "genes.gtf" "-g" "gene_id" "-o" "SRR4238351_subsamp.sorted_gene.featureCounts.txt" "--extraAttributes" "gene_name" "-p" "-s" "0" "SRR4238351_subsamp.sorted.bam" 	# Program:featureCounts v1.6.4; Command:"featureCounts" "-a" "genes.gtf" "-g" "gene_id" "-o" "SRR4238351_subsamp.sorted_gene.featureCounts.txt" "--extraAttributes" "gene_name" "-p" "-s" "0" "SRR4238351_subsamp.sorted.bam" 	# Program:featureCounts v1.6.4; Command:"featureCounts" "-a" "genes.gtf" "-g" "gene_id" "-o" "SRR4238351_subsamp.sorted_gene.featureCounts.txt" "--extraAttributes" "gene_name" "-p" "-s" "0" "SRR4238351_subsamp.sorted.bam" 	# Program:featureCounts v1.6.4; Command:"featureCounts" "-a" "genes.gtf" "-g" "gene_id" "-o" "SRR4238355_subsamp.sorted_gene.featureCounts.txt" "--extraAttributes" "gene_name" "-p" "-s" "0" "SRR4238355_subsamp.sorted.bam" 	# Program:featureCounts v1.6.4; Command:"featureCounts" "-a" "genes.gtf" "-g" "gene_id" "-o" "SRR4238359_subsamp.sorted_gene.featureCounts.txt" "--extraAttributes" "gene_name" "-p" "-s" "0" "SRR4238359_subsamp.sorted.bam"
Geneid	Chr	Start	Start	Start	Start
YAL069W	I	335	335	335	335
YAL068W-A	I	538	538	538	538
YAL068C	I	1807	1807	1807	1807
YAL067W-A	I	2480	2480	2480	2480
YAL067C	I	7235	7235	7235	7235
YAL066W	I	10091	10091	10091	10091
YAL065C	I	11565	11565	11565	11565
YAL064W-B	I	12046	12046	12046	12046

@apeltzer
Copy link
Member

Okay once you're ready ping me, I'll do something else until then 👍

@olgabot
Copy link
Contributor Author

olgabot commented Jun 28, 2019

Okay the merged files are looking correct now. Before I was using field 3 which is the gene chromosomal Start for each one (???).

Now it looks right with the correct header and counts!!

 ♥ 65%  Thu 27 Jun - 19:50  ~/code/nf-core/rnaseq   origin ☊ olgabot/featurecounts-merge-memory ✔ 29☀ 
  head -n 20 results/featureCounts/merged_gene_counts.txt | csvlook -t -I
| Geneid    | gene_name | SRR4238359_subsamp.sorted.bam | SRR4238379_subsamp.sorted.bam | SRR4238355_subsamp.sorted.bam | SRR4238351_subsamp.sorted.bam |
| --------- | --------- | ----------------------------- | ----------------------------- | ----------------------------- | ----------------------------- |
| YAL069W   | YAL069W   | 0                             | 0                             | 0                             | 0                             |
| YAL068W-A | YAL068W-A | 0                             | 0                             | 0                             | 0                             |
| YAL068C   | PAU8      | 0                             | 0                             | 0                             | 0                             |
| YAL067W-A | YAL067W-A | 0                             | 0                             | 0                             | 0                             |
| YAL067C   | SEO1      | 0                             | 0                             | 1                             | 0                             |
| YAL066W   | YAL066W   | 0                             | 0                             | 0                             | 0                             |
| YAL065C   | YAL065C   | 0                             | 0                             | 0                             | 0                             |
| YAL064W-B | YAL064W-B | 1                             | 0                             | 0                             | 0                             |
| YAL064C-A | TDA8      | 0                             | 1                             | 0                             | 0                             |
| YAL064W   | YAL064W   | 0                             | 0                             | 0                             | 0                             |
| YAL063C-A | YAL063C-A | 0                             | 1                             | 0                             | 0                             |
| YAL063C   | FLO9      | 0                             | 1                             | 0                             | 0                             |
| YAL062W   | GDH3      | 1                             | 0                             | 0                             | 1                             |
| YAL061W   | BDH2      | 4                             | 0                             | 6                             | 0                             |
| YAL060W   | BDH1      | 21                            | 6                             | 16                            | 11                            |
| YAL059C-A | YAL059C-A | 0                             | 0                             | 0                             | 0                             |
| YAL059W   | ECM1      | 0                             | 1                             | 0                             | 3                             |
| YAL058W   | CNE1      | 1                             | 1                             | 7                             | 4                             |
| YAL056C-A | YAL056C-A | 0                             | 0                             | 0                             | 0                             |

So yep @apeltzer ready for your review!

@apeltzer apeltzer changed the title [WIP] Minimize memory footprint of merge_featurecounts Minimize memory footprint of merge_featurecounts Jul 7, 2019
@apeltzer apeltzer merged commit 034ee2a into nf-core:salmon Jul 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants