Increasing --threads increases execution time #437

bede · 2023-06-09T10:14:32Z

Firstly, thank you for developing and maintaining Bowtie2! I noticed that 2.5.1 performs strangely when varying the --threads parameter. Beyond a certain number of threads, execution time increases considerably and CPU utilisation decreases. I've observed this with multiple read datasets using both an x86_64 Ubuntu VM and my arm64 MacOS machine, both using the appropriate GitHub release binaries. The behaviour is reproducible on any one machine, although a given read dataset will not necessarily trigger the problem on both my laptop and the VM.

Here I used ~20m paired mixed bacterial and human reads with an index built from the human T2T reference + HLA sequences.

time bowtie2 -x human-index --threads ${threads} -1 all.bwa.read1.fastq.gz -2 all.bwa.read2.fastq.gz > /dev/null

VM info

Ubuntu 22.04 LTS

$ bowtie2 --version
/data/bowtie2/bin/bowtie2-align-s version 2.5.1
64-bit
Built on 0ba86a911637
Wed Jan 18 03:20:56 UTC 2023
Compiler: gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC) 
Options: -O3 -msse2 -funroll-loops -g3 -g -O2 -fvisibility=hidden -I/hbb_exe_gc_hardened/include -ffunction-sections -fdata-sections -fstack-protector -D_FORTIFY_SOURCE=2 -fPIE -std=c++11 -DPOPCNT_CAPABILITY -DNO_SPINLOCK -DWITH_QUEUELOCK=1 -DWITH_ZSTD
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}

$ uname -r
5.15.0-1030-oracle

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         40 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  32
  On-line CPU(s) list:   0-31
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC 7J13 64-Core Processor
    CPU family:          25
    Model:               1
    Thread(s) per core:  2
    Core(s) per socket:  16
    Socket(s):           1
    Stepping:            1
    BogoMIPS:            4890.80
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good
                          nopl cpuid extd_apicid amd_dcm tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c r
                         drand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmca
                         ll fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd a
                         rat npt nrip_save umip pku ospke vaes vpclmulqdq rdpid arch_capabilities
Virtualization features: 
  Virtualization:        AMD-V
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   1 MiB (16 instances)
  L1i:                   1 MiB (16 instances)
  L2:                    8 MiB (16 instances)
  L3:                    64 MiB (4 instances)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-31
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

The text was updated successfully, but these errors were encountered:

snayfach · 2023-07-24T01:50:49Z

Rolling back to bowtie2-2.4.4-linux-x86_64 solved the issue for me. More recent versions failed to use the number of specified threads. On a machine with 224 CPUs, only 2 were being used using the latest version.

bede · 2023-07-25T13:32:06Z

Thank you @snayfach, rolling back to 2.4.4 also worked for me. Great news.

ch4rr0 · 2023-07-25T16:51:55Z

Hello all,

Thank you for your patience. My initial hunch is that this issue was introduced in 2.5.0 with the changes async changes that we made to input reading an writing. @bede, since you are able to reproduce this issue would you be willing to test v2.4.5 and v2.5.0?

bede · 2023-07-26T12:27:18Z

Thanks @ch4rr0, I tested v2.4.5 and v2.5.0 as you suggested on a 32 core machine specifying 32 threads. 2.4.5 consistently used ~3200% CPU (reported by top) as expected.

However 2.5.0 was erratic, jumping between 400% and 3200%. I haven't compared 2.5.0 and 2.5.1, but it is clear to me that this issue began with 2.5.0.

ch4rr0 · 2023-07-27T16:41:57Z

@bede, I pushed a change-set to the bug_fixes branch. Would be willing to test whether it has any effect on thread behavior?

ch4rr0 · 2023-07-31T15:28:46Z

@snayfach -- would you be willing to test since I have not heard back from bede yet?

bede · 2023-08-02T13:44:15Z

Sorry for delay @ch4rr0, I just got time to test

I'm afraid the same behaviour remains with the bug_fixes branch in my testing. 800-1600% CPU usage with 32 physical cores, whereas 2.4.5 gives >3100%

$ git status
On branch bug_fixes
Your branch is up to date with 'origin/bug_fixes'.

nothing to commit, working tree clean

$ bowtie2 --version
/data16/bowtie2-bug_fixes_2023-08-02/bowtie2-align-s version 2.5.1
64-bit
Built on pikachu
Wed Aug  2 13:27:40 UTC 2023
Compiler: gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
Options: -O3 -msse2 -funroll-loops -g3 -std=c++11 -DPOPCNT_CAPABILITY -DNO_SPINLOCK -DWITH_QUEUELOCK=1
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}

ch4rr0 · 2023-08-03T17:32:14Z

I have been able to recreate this issue and indeed my latest push does not resolve it. I have found the cause and I am currently working on a solution. In the meantime increasing the --reads-per-batch seems to keep the threads busy for longer and decreasing contention on the producer. The current default is 16, increasing that number to 1024 or 2048 seemed like a sweet spot on my hardware.

Thank you all for your patience.

ch4rr0 · 2023-08-15T16:59:15Z

I committed a few changes to bug_fixes that, in my testing, seem to resolve the issue. Would you be willing to test these changes?

bede · 2023-08-16T15:34:57Z

Thanks @ch4rr0, your changes in 6f6458c have resolved the issue for my test case. Looks good to me!

bede · 2023-10-23T10:09:11Z

I'm reasonably satisfied that this issue has been fixed in 2.5.2, thanks! 🙏

sfiligoi · 2024-09-12T20:53:40Z

Just a note on "CPUs busy" vs "faster time to completion".
The async changes in 2.5.0 were aimed squarely at "reducing time to completion".
If IO is not fast enough, CPUs being idle was expected, as there is nothing to compute in that case.

That said, it is obvious from the original submission that "time to completion" was going in the wrong direction at large thread count, too, so not saying it was not problematic.
But it is not clear from the follow-up comments that the fix actually did result in "faster time to completion", and not just "all CPUs being busy".

sfiligoi · 2024-09-12T20:57:06Z

@bede Could you post a comparison "time to completion" of the original use case above for 2.5.2, too?

Also, would you be able to also post the results when you pin bowtie2 to only use a fixed number of cores (e.g., using taskset -c on linux)?
(for both 2.5.1 and 2.5.2)

bede · 2024-09-12T22:01:26Z

Hi @sfiligoi, I do not have these results, but am curious also. I'll need to repeat using both versions.

bede mentioned this issue Jul 14, 2023

Bowtie2 performance suffers when using many threads in some conditions bede/hostile#21

Closed

bede closed this as completed Oct 23, 2023

ch4rr0 mentioned this issue Sep 12, 2024

-p/--threads option always uses 1 more thread than is specified. #465

Open

sfiligoi mentioned this issue Sep 13, 2024

Add back the condition_variable based LockedQueue for low-thread setups #491

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increasing --threads increases execution time #437

Increasing --threads increases execution time #437

bede commented Jun 9, 2023 •

edited

Loading

snayfach commented Jul 24, 2023

bede commented Jul 25, 2023

ch4rr0 commented Jul 25, 2023

bede commented Jul 26, 2023

ch4rr0 commented Jul 27, 2023

ch4rr0 commented Jul 31, 2023

bede commented Aug 2, 2023 •

edited

Loading

ch4rr0 commented Aug 3, 2023

ch4rr0 commented Aug 15, 2023

bede commented Aug 16, 2023

bede commented Oct 23, 2023

sfiligoi commented Sep 12, 2024 •

edited

Loading

sfiligoi commented Sep 12, 2024 •

edited

Loading

bede commented Sep 12, 2024

Increasing --threads increases execution time #437

Increasing --threads increases execution time #437

Comments

bede commented Jun 9, 2023 • edited Loading

VM info

snayfach commented Jul 24, 2023

bede commented Jul 25, 2023

ch4rr0 commented Jul 25, 2023

bede commented Jul 26, 2023

ch4rr0 commented Jul 27, 2023

ch4rr0 commented Jul 31, 2023

bede commented Aug 2, 2023 • edited Loading

ch4rr0 commented Aug 3, 2023

ch4rr0 commented Aug 15, 2023

bede commented Aug 16, 2023

bede commented Oct 23, 2023

sfiligoi commented Sep 12, 2024 • edited Loading

sfiligoi commented Sep 12, 2024 • edited Loading

bede commented Sep 12, 2024

bede commented Jun 9, 2023 •

edited

Loading

bede commented Aug 2, 2023 •

edited

Loading

sfiligoi commented Sep 12, 2024 •

edited

Loading

sfiligoi commented Sep 12, 2024 •

edited

Loading