Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increasing --threads increases execution time #437

Closed
bede opened this issue Jun 9, 2023 · 14 comments
Closed

Increasing --threads increases execution time #437

bede opened this issue Jun 9, 2023 · 14 comments

Comments

@bede
Copy link

bede commented Jun 9, 2023

Firstly, thank you for developing and maintaining Bowtie2! I noticed that 2.5.1 performs strangely when varying the --threads parameter. Beyond a certain number of threads, execution time increases considerably and CPU utilisation decreases. I've observed this with multiple read datasets using both an x86_64 Ubuntu VM and my arm64 MacOS machine, both using the appropriate GitHub release binaries. The behaviour is reproducible on any one machine, although a given read dataset will not necessarily trigger the problem on both my laptop and the VM.

Here I used ~20m paired mixed bacterial and human reads with an index built from the human T2T reference + HLA sequences.

time bowtie2 -x human-index --threads ${threads} -1 all.bwa.read1.fastq.gz -2 all.bwa.read2.fastq.gz > /dev/null

visualization(3)

VM info

Ubuntu 22.04 LTS

$ bowtie2 --version
/data/bowtie2/bin/bowtie2-align-s version 2.5.1
64-bit
Built on 0ba86a911637
Wed Jan 18 03:20:56 UTC 2023
Compiler: gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC) 
Options: -O3 -msse2 -funroll-loops -g3 -g -O2 -fvisibility=hidden -I/hbb_exe_gc_hardened/include -ffunction-sections -fdata-sections -fstack-protector -D_FORTIFY_SOURCE=2 -fPIE -std=c++11 -DPOPCNT_CAPABILITY -DNO_SPINLOCK -DWITH_QUEUELOCK=1 -DWITH_ZSTD
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}
$ uname -r
5.15.0-1030-oracle
$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         40 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  32
  On-line CPU(s) list:   0-31
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC 7J13 64-Core Processor
    CPU family:          25
    Model:               1
    Thread(s) per core:  2
    Core(s) per socket:  16
    Socket(s):           1
    Stepping:            1
    BogoMIPS:            4890.80
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good
                          nopl cpuid extd_apicid amd_dcm tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c r
                         drand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmca
                         ll fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd a
                         rat npt nrip_save umip pku ospke vaes vpclmulqdq rdpid arch_capabilities
Virtualization features: 
  Virtualization:        AMD-V
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   1 MiB (16 instances)
  L1i:                   1 MiB (16 instances)
  L2:                    8 MiB (16 instances)
  L3:                    64 MiB (4 instances)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-31
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected
@snayfach
Copy link

Rolling back to bowtie2-2.4.4-linux-x86_64 solved the issue for me. More recent versions failed to use the number of specified threads. On a machine with 224 CPUs, only 2 were being used using the latest version.

@bede
Copy link
Author

bede commented Jul 25, 2023

Thank you @snayfach, rolling back to 2.4.4 also worked for me. Great news.

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Jul 25, 2023

Hello all,

Thank you for your patience. My initial hunch is that this issue was introduced in 2.5.0 with the changes async changes that we made to input reading an writing. @bede, since you are able to reproduce this issue would you be willing to test v2.4.5 and v2.5.0?

@bede
Copy link
Author

bede commented Jul 26, 2023

Thanks @ch4rr0, I tested v2.4.5 and v2.5.0 as you suggested on a 32 core machine specifying 32 threads. 2.4.5 consistently used ~3200% CPU (reported by top) as expected.

However 2.5.0 was erratic, jumping between 400% and 3200%. I haven't compared 2.5.0 and 2.5.1, but it is clear to me that this issue began with 2.5.0.

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Jul 27, 2023

@bede, I pushed a change-set to the bug_fixes branch. Would be willing to test whether it has any effect on thread behavior?

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Jul 31, 2023

@snayfach -- would you be willing to test since I have not heard back from bede yet?

@bede
Copy link
Author

bede commented Aug 2, 2023

Sorry for delay @ch4rr0, I just got time to test

I'm afraid the same behaviour remains with the bug_fixes branch in my testing. 800-1600% CPU usage with 32 physical cores, whereas 2.4.5 gives >3100%

$ git status
On branch bug_fixes
Your branch is up to date with 'origin/bug_fixes'.

nothing to commit, working tree clean
$ bowtie2 --version
/data16/bowtie2-bug_fixes_2023-08-02/bowtie2-align-s version 2.5.1
64-bit
Built on pikachu
Wed Aug  2 13:27:40 UTC 2023
Compiler: gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
Options: -O3 -msse2 -funroll-loops -g3 -std=c++11 -DPOPCNT_CAPABILITY -DNO_SPINLOCK -DWITH_QUEUELOCK=1
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Aug 3, 2023

I have been able to recreate this issue and indeed my latest push does not resolve it. I have found the cause and I am currently working on a solution. In the meantime increasing the --reads-per-batch seems to keep the threads busy for longer and decreasing contention on the producer. The current default is 16, increasing that number to 1024 or 2048 seemed like a sweet spot on my hardware.

Thank you all for your patience.

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Aug 15, 2023

I committed a few changes to bug_fixes that, in my testing, seem to resolve the issue. Would you be willing to test these changes?

@bede
Copy link
Author

bede commented Aug 16, 2023

Thanks @ch4rr0, your changes in 6f6458c have resolved the issue for my test case. Looks good to me!

@bede
Copy link
Author

bede commented Oct 23, 2023

I'm reasonably satisfied that this issue has been fixed in 2.5.2, thanks! 🙏

@sfiligoi
Copy link
Contributor

sfiligoi commented Sep 12, 2024

Just a note on "CPUs busy" vs "faster time to completion".
The async changes in 2.5.0 were aimed squarely at "reducing time to completion".
If IO is not fast enough, CPUs being idle was expected, as there is nothing to compute in that case.

That said, it is obvious from the original submission that "time to completion" was going in the wrong direction at large thread count, too, so not saying it was not problematic.
But it is not clear from the follow-up comments that the fix actually did result in "faster time to completion", and not just "all CPUs being busy".

@sfiligoi
Copy link
Contributor

sfiligoi commented Sep 12, 2024

@bede Could you post a comparison "time to completion" of the original use case above for 2.5.2, too?

Also, would you be able to also post the results when you pin bowtie2 to only use a fixed number of cores (e.g., using taskset -c on linux)?
(for both 2.5.1 and 2.5.2)

@bede
Copy link
Author

bede commented Sep 12, 2024

Hi @sfiligoi, I do not have these results, but am curious also. I'll need to repeat using both versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants