Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Much slower than previous versions #527

Closed
red-plant opened this issue May 30, 2020 · 10 comments
Closed

Much slower than previous versions #527

red-plant opened this issue May 30, 2020 · 10 comments

Comments

@red-plant
Copy link

red-plant commented May 30, 2020

Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?
salmon
Describe the bug
is been running like an order of magnitude slower than when I last used it
A clear and concise description of what the bug is.
So I am aligning reads against Arabidopsis Thaliana, using Araport 11 annotation. I usually had 20 M reads aligned within an hour or 2. I am aligning a 46M reads library and it has been running for 20 hours using 4 threads of my humble i5-3210M and is barely on 38.5M, after 20 hours! Index was constructed with default kmer size and no decoys. I have had this problem with other libraries since upgraded from V1.0.0,

To Reproduce
I guess just try to align stuff against araport11, this particular problem comes with any fq.gz. It will take hours and hours and hours to align.

Specifically, please provide at least the following information:

  • Which version of salmon was used?
    1.2.1
  • How was salmon installed (compiled, downloaded executable, through bioconda)?
    downloaded executable
  • Which reference (e.g. transcriptome) was used?
    Araport 11, from A. thaliana
  • Which read files were used?
    regular fastq.gz ( SRR7985407)
  • Which which program options were used?
    --validateMappings
    -p 4
    --seqBias
    --gcBias
    --posBias

Expected behavior
Much faster alignment, it is Salmon !!
Screenshots
this is the run info so far:

Version Info: This is the most recent version of salmon.

salmon (mapping-based) v1.2.1

[ program ] => salmon

[ command ] => quant

[ index ] => { /home/jaimealaniz/Documents/indexes/salmon/ara11/ }

[ libType ] => { A }

[ mates1 ] => { SRR7985407_1.fq.gz }

[ mates2 ] => { SRR7985407_2.fq.gz }

[ validateMappings ] => { }

[ threads ] => { 4 }

[ seqBias ] => { }

[ gcBias ] => { }

[ posBias ] => { }

[ output ] => { /home/jaimealaniz/Documents/salmon.embryo/SRR7985407/ }

Logs will be written to /home/jaimealaniz/Documents/salmon.embryo/SRR7985407/logs
[2020-05-29 20:14:24.283] [jointLog] [info] setting maxHashResizeThreads to 4
[2020-05-29 20:14:24.283] [jointLog] [info] Fragment incompatibility prior below threshold. Incompatible fragments will be ignored.
[2020-05-29 20:14:24.283] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[2020-05-29 20:14:24.283] [jointLog] [info] Usage of --validateMappings implies a default consensus slack of 0.2. Setting consensusSlack to 0.35.
[2020-05-29 20:14:24.283] [jointLog] [info] parsing read library format
[2020-05-29 20:14:24.283] [jointLog] [info] There is 1 library.
[2020-05-29 20:14:24.341] [jointLog] [info] Loading pufferfish index
[2020-05-29 20:14:24.342] [jointLog] [info] Loading dense pufferfish index.

| Loading contig table | Time = 41.693 ms

size = 357712

| Loading contig offsets | Time = 657 us


| Loading reference lengths | Time = 202.87 us


| Loading mphf table | Time = 16.335 ms

size = 64786639
Number of ones: 357711
Number of ones per inventory item: 512
Inventory entries filled: 699

| Loading contig boundaries | Time = 154.69 ms

size = 64786639

| Loading sequence | Time = 6.3275 ms

size = 54055309

| Loading positions | Time = 68.544 ms

size = 74351806

| Loading reference sequence | Time = 7.0924 ms


| Loading reference accumulative lengths | Time = 325.83 us

[2020-05-29 20:14:24.638] [jointLog] [info] done
[2020-05-29 20:14:24.638] [jointLog] [info] Index contained 57,190 targets
[2020-05-29 20:14:25.202] [jointLog] [info] Number of decoys : 0

processed 38,500,000 fragmentsntLog] [info] Automatically detected most likely library type as IU
hits: 40,221,178, hits per frag: 1.04508

Desktop (please complete the following information):

  • OS:Ubuntu
  • Version 18.04

Additional context
Previously used salmon on this laptop and it was much much faster, that'ts the main reason I use salmon.

@red-plant
Copy link
Author

FInally, it took like 25 hours, I am attaching the log.
salmon_quant.log

@rob-p
Copy link
Collaborator

rob-p commented May 31, 2020

@red-plant ,

Thanks for the detailed report. Has anything else changed on the system since you upgraded? Does this same dataset finish much faster with 1.0.0? There have been no large-scale changes to alignment since 1.0.0 that should affect the runtime, so this is definitely completely unexpected behavior. We will also try to reproduce locally on this data.

@red-plant
Copy link
Author

I will try to reproduce running on V1.0.0, will come back with reports. Should've done this first, apologies.

@red-plant
Copy link
Author

red-plant commented May 31, 2020

Ok, salmon V1.0.0 finished in 5H 15 min, so about 5 times faster, the exact same library and parameters, and achieved almost the same mapping rate (85.1058% with V1.2.0 vs 84.6341% with V1.0.0) attaching log. I must add I did not trim this library for adapters nor quality, nor did anything to it. Just mapped as is. But fastQC showed excellent levels of quality even at the ends and no or minimal adapter content.
Also no changes have been done one my OS other than regular updates, but still Ubuntu 18.04. I don't remember any specific changes I've done to it.
Pearson's correlation in transcript abundance (isoform lelvel) is 0.9984013. Spearman's is 0.9899048.
Also, I did checked that salmon was actually using 4 threads in both cases, and it was fully using those.
salmon_quant.log

@rob-p
Copy link
Collaborator

rob-p commented Jun 1, 2020

Hi @red-plant,

Thank you for providing the follow-up here. This behavior is very strange! However, I am not sure what is stranger; that 1.2.1 takes ~5x longer for you, or that 1.0.0 is taking > 5H in the first place! I can confirm that I am getting really strange (i.e. slow) behavior on this sample under both versions. It's unexpected because I've never seen a sample take this long before. We are looking into it, and will report back. However, since digging into the data to figure out precisely what is going on may take a little bit, I wanted to ask if you can try something. I think the issue may be resulting from some highly-expressed, "pathologically" repetitive sequence. Could you run salmon with the extra command line option --hitFilterPolicy BOTH, and let me know if / how the performance profile changes for you?

Thanks!
Rob

@red-plant
Copy link
Author

red-plant commented Jun 1, 2020

So, it took ~30 min (impressively fast) for V1.0.0 with --hitFilterPolicy BOTH, it seems that you are right. Attaching the log and running again in V1.2.1. Thanks again
José
salmon_quant.log

@red-plant
Copy link
Author

red-plant commented Jun 1, 2020

It took ~1.5H for V1.2.1 with --hitFilterPolicy BOTH. Attaching log and fastp report, which shows normal tetramer over-representation.

By the insert size determined by fastp I suspect there's quite a bit of dovetailing, I did had extremely slow performance with dovetailed libraries (for example SRR7945268, which is insert size 100, and its a PE 150 [not my data]) even allowing dovetails, to the point I ended up mapping them as single end and not using one of the pairs. Even then it took its time.

Edit: allowing dovetails only increased the mapping rate by 0.0277%.

As an additional note, not --minAlnProb 0.1 nor --hardFilter help.

fastp.pdf
salmon_quant.log

@rob-p
Copy link
Collaborator

rob-p commented Jun 2, 2020

Hi @red-plant,

So, I have some update from our end. @mohsenzakeri dug into the data a bit (specifically SRR7985407). What he found is that there are a considerable number of reads (~13%) have long stretches of polyA or polyT that are matching in a hyper-repetitive manner internally within a certain set of transcripts (i.e. these are not matching polyA tails, because those are already trimmed). These matches are, obviously, minimally informative, but we had not special-cased ignoring them yet. Specifically, what seems to be prevalent in these reads are read pairs where one read has polyA, the other has polyT, and the keep matching to the same positions. However, the rest of the reads don't match the transcript, so a bunch of time is wasted on validating (and discarding) these mappings.

To test this hypothesis, we made a small change to the mapping algorithm to special case and ignore k-mers that are purely homopolymers. I'll note that in this data, this has no effect on the mapping rate. I get the following performance profile running the trimmed version of this data (having trimmed with fastp) using 4 threads, and without the additional --hitFilterPolicy BOTH flag.

1306.86user 4.79system 4:42.54elapsed 464%CPU (0avgtext+0avgdata 592704maxresident)k

I was wondering if you might test this altered version out and see if it has a similarly beneficial effect for you as well. Probably, the time will be different, since the processors themselves are, and since I elided all non-essential flags here, but I would hope this version is faster than the current (even with the altered hitFilterPolicy).

You can find a tarball with the pre-compiled binary here. It should work on any relatively recent linux system.

@red-plant
Copy link
Author

Ran with only --validateMappings and -p 4 options. Wow, it took less than 15 minutes! Thanks a lot to you and your team for looking into this.
Best,
Jose
salmon_quant.log

@rob-p
Copy link
Collaborator

rob-p commented Jun 2, 2020

Awesome! Thank you so much for the detailed report and for finding this data that exposed this strange (but interesting) performance case. We'll fold these improvements into the next release as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants