Much slower than previous versions #527

red-plant · 2020-05-30T21:41:51Z

Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?
salmon
Describe the bug
is been running like an order of magnitude slower than when I last used it
A clear and concise description of what the bug is.
So I am aligning reads against Arabidopsis Thaliana, using Araport 11 annotation. I usually had 20 M reads aligned within an hour or 2. I am aligning a 46M reads library and it has been running for 20 hours using 4 threads of my humble i5-3210M and is barely on 38.5M, after 20 hours! Index was constructed with default kmer size and no decoys. I have had this problem with other libraries since upgraded from V1.0.0,

To Reproduce
I guess just try to align stuff against araport11, this particular problem comes with any fq.gz. It will take hours and hours and hours to align.

Specifically, please provide at least the following information:

Which version of salmon was used?
1.2.1
How was salmon installed (compiled, downloaded executable, through bioconda)?
downloaded executable
Which reference (e.g. transcriptome) was used?
Araport 11, from A. thaliana
Which read files were used?
regular fastq.gz ( SRR7985407)
Which which program options were used?
--validateMappings
-p 4
--seqBias
--gcBias
--posBias

Expected behavior
Much faster alignment, it is Salmon !!
Screenshots
this is the run info so far:

Version Info: This is the most recent version of salmon.

salmon (mapping-based) v1.2.1

[ program ] => salmon

[ command ] => quant

[ index ] => { /home/jaimealaniz/Documents/indexes/salmon/ara11/ }

[ libType ] => { A }

[ mates1 ] => { SRR7985407_1.fq.gz }

[ mates2 ] => { SRR7985407_2.fq.gz }

[ validateMappings ] => { }

[ threads ] => { 4 }

[ seqBias ] => { }

[ gcBias ] => { }

[ posBias ] => { }

[ output ] => { /home/jaimealaniz/Documents/salmon.embryo/SRR7985407/ }

Logs will be written to /home/jaimealaniz/Documents/salmon.embryo/SRR7985407/logs
[2020-05-29 20:14:24.283] [jointLog] [info] setting maxHashResizeThreads to 4
[2020-05-29 20:14:24.283] [jointLog] [info] Fragment incompatibility prior below threshold. Incompatible fragments will be ignored.
[2020-05-29 20:14:24.283] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[2020-05-29 20:14:24.283] [jointLog] [info] Usage of --validateMappings implies a default consensus slack of 0.2. Setting consensusSlack to 0.35.
[2020-05-29 20:14:24.283] [jointLog] [info] parsing read library format
[2020-05-29 20:14:24.283] [jointLog] [info] There is 1 library.
[2020-05-29 20:14:24.341] [jointLog] [info] Loading pufferfish index
[2020-05-29 20:14:24.342] [jointLog] [info] Loading dense pufferfish index.

| Loading contig table | Time = 41.693 ms

size = 357712

| Loading contig offsets | Time = 657 us

| Loading reference lengths | Time = 202.87 us

| Loading mphf table | Time = 16.335 ms

size = 64786639
Number of ones: 357711
Number of ones per inventory item: 512
Inventory entries filled: 699

| Loading contig boundaries | Time = 154.69 ms

size = 64786639

| Loading sequence | Time = 6.3275 ms

size = 54055309

| Loading positions | Time = 68.544 ms

size = 74351806

| Loading reference sequence | Time = 7.0924 ms

| Loading reference accumulative lengths | Time = 325.83 us

[2020-05-29 20:14:24.638] [jointLog] [info] done
[2020-05-29 20:14:24.638] [jointLog] [info] Index contained 57,190 targets
[2020-05-29 20:14:25.202] [jointLog] [info] Number of decoys : 0

processed 38,500,000 fragmentsntLog] [info] Automatically detected most likely library type as IU
hits: 40,221,178, hits per frag: 1.04508

Desktop (please complete the following information):

OS:Ubuntu
Version 18.04

Additional context
Previously used salmon on this laptop and it was much much faster, that'ts the main reason I use salmon.

red-plant · 2020-05-31T02:14:45Z

FInally, it took like 25 hours, I am attaching the log.
salmon_quant.log

rob-p · 2020-05-31T02:18:54Z

@red-plant ,

Thanks for the detailed report. Has anything else changed on the system since you upgraded? Does this same dataset finish much faster with 1.0.0? There have been no large-scale changes to alignment since 1.0.0 that should affect the runtime, so this is definitely completely unexpected behavior. We will also try to reproduce locally on this data.

red-plant · 2020-05-31T02:47:57Z

I will try to reproduce running on V1.0.0, will come back with reports. Should've done this first, apologies.

red-plant · 2020-05-31T09:43:30Z

Ok, salmon V1.0.0 finished in 5H 15 min, so about 5 times faster, the exact same library and parameters, and achieved almost the same mapping rate (85.1058% with V1.2.0 vs 84.6341% with V1.0.0) attaching log. I must add I did not trim this library for adapters nor quality, nor did anything to it. Just mapped as is. But fastQC showed excellent levels of quality even at the ends and no or minimal adapter content.
Also no changes have been done one my OS other than regular updates, but still Ubuntu 18.04. I don't remember any specific changes I've done to it.
Pearson's correlation in transcript abundance (isoform lelvel) is 0.9984013. Spearman's is 0.9899048.
Also, I did checked that salmon was actually using 4 threads in both cases, and it was fully using those.
salmon_quant.log

rob-p · 2020-06-01T04:24:06Z

Hi @red-plant,

Thank you for providing the follow-up here. This behavior is very strange! However, I am not sure what is stranger; that 1.2.1 takes ~5x longer for you, or that 1.0.0 is taking > 5H in the first place! I can confirm that I am getting really strange (i.e. slow) behavior on this sample under both versions. It's unexpected because I've never seen a sample take this long before. We are looking into it, and will report back. However, since digging into the data to figure out precisely what is going on may take a little bit, I wanted to ask if you can try something. I think the issue may be resulting from some highly-expressed, "pathologically" repetitive sequence. Could you run salmon with the extra command line option --hitFilterPolicy BOTH, and let me know if / how the performance profile changes for you?

Thanks!
Rob

red-plant · 2020-06-01T11:08:55Z

So, it took ~30 min (impressively fast) for V1.0.0 with --hitFilterPolicy BOTH, it seems that you are right. Attaching the log and running again in V1.2.1. Thanks again
José
salmon_quant.log

red-plant · 2020-06-01T12:38:51Z

It took ~1.5H for V1.2.1 with --hitFilterPolicy BOTH. Attaching log and fastp report, which shows normal tetramer over-representation.

By the insert size determined by fastp I suspect there's quite a bit of dovetailing, I did had extremely slow performance with dovetailed libraries (for example SRR7945268, which is insert size 100, and its a PE 150 [not my data]) even allowing dovetails, to the point I ended up mapping them as single end and not using one of the pairs. Even then it took its time.

Edit: allowing dovetails only increased the mapping rate by 0.0277%.

As an additional note, not --minAlnProb 0.1 nor --hardFilter help.

fastp.pdf
salmon_quant.log

rob-p · 2020-06-02T14:09:22Z

Hi @red-plant,

So, I have some update from our end. @mohsenzakeri dug into the data a bit (specifically SRR7985407). What he found is that there are a considerable number of reads (~13%) have long stretches of polyA or polyT that are matching in a hyper-repetitive manner internally within a certain set of transcripts (i.e. these are not matching polyA tails, because those are already trimmed). These matches are, obviously, minimally informative, but we had not special-cased ignoring them yet. Specifically, what seems to be prevalent in these reads are read pairs where one read has polyA, the other has polyT, and the keep matching to the same positions. However, the rest of the reads don't match the transcript, so a bunch of time is wasted on validating (and discarding) these mappings.

To test this hypothesis, we made a small change to the mapping algorithm to special case and ignore k-mers that are purely homopolymers. I'll note that in this data, this has no effect on the mapping rate. I get the following performance profile running the trimmed version of this data (having trimmed with fastp) using 4 threads, and without the additional --hitFilterPolicy BOTH flag.

1306.86user 4.79system 4:42.54elapsed 464%CPU (0avgtext+0avgdata 592704maxresident)k

I was wondering if you might test this altered version out and see if it has a similarly beneficial effect for you as well. Probably, the time will be different, since the processors themselves are, and since I elided all non-essential flags here, but I would hope this version is faster than the current (even with the altered hitFilterPolicy).

You can find a tarball with the pre-compiled binary here. It should work on any relatively recent linux system.

red-plant · 2020-06-02T14:32:18Z

Ran with only --validateMappings and -p 4 options. Wow, it took less than 15 minutes! Thanks a lot to you and your team for looking into this.
Best,
Jose
salmon_quant.log

rob-p · 2020-06-02T15:04:13Z

Awesome! Thank you so much for the detailed report and for finding this data that exposed this strange (but interesting) performance case. We'll fold these improvements into the next release as well.

red-plant closed this as completed Jun 2, 2020

shalercr mentioned this issue Jun 16, 2020

Very very slow processing of samples #537

Closed

red-plant mentioned this issue Jun 29, 2021

Reads with highly repetitive sequences not mapped by STAR but mapped by Salmon alexdobin/STAR#1283

Closed

charlesfoster mentioned this issue Feb 23, 2023

salmon taking unexpectedly long to map reads #830

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Much slower than previous versions #527

Much slower than previous versions #527

red-plant commented May 30, 2020 •

edited

Loading

red-plant commented May 31, 2020

rob-p commented May 31, 2020

red-plant commented May 31, 2020

red-plant commented May 31, 2020 •

edited

Loading

rob-p commented Jun 1, 2020

red-plant commented Jun 1, 2020 •

edited

Loading

red-plant commented Jun 1, 2020 •

edited

Loading

rob-p commented Jun 2, 2020

red-plant commented Jun 2, 2020

rob-p commented Jun 2, 2020

Much slower than previous versions #527

Much slower than previous versions #527

Comments

red-plant commented May 30, 2020 • edited Loading

salmon (mapping-based) v1.2.1

[ program ] => salmon

[ command ] => quant

[ index ] => { /home/jaimealaniz/Documents/indexes/salmon/ara11/ }

[ libType ] => { A }

[ mates1 ] => { SRR7985407_1.fq.gz }

[ mates2 ] => { SRR7985407_2.fq.gz }

[ validateMappings ] => { }

[ threads ] => { 4 }

[ seqBias ] => { }

[ gcBias ] => { }

[ posBias ] => { }

[ output ] => { /home/jaimealaniz/Documents/salmon.embryo/SRR7985407/ }

| Loading contig table | Time = 41.693 ms

size = 357712

| Loading contig offsets | Time = 657 us

| Loading reference lengths | Time = 202.87 us

| Loading mphf table | Time = 16.335 ms

size = 64786639 Number of ones: 357711 Number of ones per inventory item: 512 Inventory entries filled: 699

| Loading contig boundaries | Time = 154.69 ms

size = 64786639

| Loading sequence | Time = 6.3275 ms

size = 54055309

| Loading positions | Time = 68.544 ms

size = 74351806

| Loading reference sequence | Time = 7.0924 ms

| Loading reference accumulative lengths | Time = 325.83 us

red-plant commented May 31, 2020

rob-p commented May 31, 2020

red-plant commented May 31, 2020

red-plant commented May 31, 2020 • edited Loading

rob-p commented Jun 1, 2020

red-plant commented Jun 1, 2020 • edited Loading

red-plant commented Jun 1, 2020 • edited Loading

rob-p commented Jun 2, 2020

red-plant commented Jun 2, 2020

rob-p commented Jun 2, 2020

red-plant commented May 30, 2020 •

edited

Loading

size = 64786639
Number of ones: 357711
Number of ones per inventory item: 512
Inventory entries filled: 699

red-plant commented May 31, 2020 •

edited

Loading

red-plant commented Jun 1, 2020 •

edited

Loading

red-plant commented Jun 1, 2020 •

edited

Loading