Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] improve thread scheduling and affinity #305

Closed
genivia-inc opened this issue Oct 19, 2023 · 2 comments
Closed

[FR] improve thread scheduling and affinity #305

genivia-inc opened this issue Oct 19, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@genivia-inc
Copy link
Member

genivia-inc commented Oct 19, 2023

I was asked by a ugrep user if I had plans to further accelerate the performance of recursive searching by setting thread affinity in ugrep and to improve worker job scheduling, if it helps. So yes, I thought about that and it is relatively easy to do with pthreads (I taught it in my HPC class at FSU), but the pthreads C++11 code is not portable, so I hesitated to do this until later.

We all know that setting thread affinity can improve threading performance when certain conditions are met in the way threads are used (thread lifetime, RAM/cache use, memory access time versus CPU time ratio).

Running a preliminary test of ugrep with Ubuntu quad core x64 w/ HTT (8 logical cores) shows that a search of /usr with thread affinity runs in 0.12 seconds which is up to 2x faster than without affinity. So yes, it is worthwhile to set the thread affinity of the worker thread pool in ugrep. However, MacOS and Windows do not seem to benefit much, if anything.

On Debian and Ubuntu 2.9GHz Intel Core i7 quad core with hyperthreading (8 logical cores) and 16GB 2133MHz LPDDR3 I get around 700% to 750% CPU utilization when searching ugrep -Ilr zodiaq /usr in a container in 0.12 seconds (20437 files and 2011 directories, none of them matching zodiaq). Now, we all know that CPU percentage is generally meaningless. It could just be running a busy-wait loop for that matter (so it's easy to increase CPU% by doing whatever stuff). The real time (aka. wall clock time) is ultimately the measure of performance to get the work done in parallel. It measures the time elapsed for the work performed with a level of parallelism i.e. when compared to a single CPU performance we get the speed up.

To my surprise, also recursively searching with option -z (--decompress) runs a lot faster. I was surprised by this, because decompression threads are used in ugrep to feed decompressed streams or plain input (when not compressed) to the search engines of the worker threads, thereby increasing the concurrency of ugrep beyond the number of available physical cores. Still, it looks even better now to make a push to set thread affinity of the worker thread pool in ugrep.

This optimization will be included in the upcoming 4.3.2 release. Obviously, I need more time to test different programmatic ways to set thread affinity and measure the performance impact.

@genivia-inc genivia-inc added the enhancement New feature or request label Oct 19, 2023
@genivia-inc genivia-inc pinned this issue Oct 19, 2023
@genivia-inc
Copy link
Member Author

genivia-inc commented Oct 21, 2023

Completed the optimization and tested on MacOS (M1/arm64 and x64), Windows x64, Debian x64, Ubuntu x64, Android termux arm64, RPi3 w/ Debian-based Linux, and Cygwin. [old: This should compile on FreeBSD, but I have not been able to test with FreeBSD since we don't have a machine available to do so.] New: I received confirmation that the code is correct for FreeBSD.

The thread affinity and priority is set for the calling thread as follows in the code I wrote (updated to support DragonFly and NetBSD):

// set this thread's affinity and priority, if supported by the OS, ignore errors to leave scheduling to the OS
static void set_this_thread_affinity_and_priority(size_t cpu)
{
  // set affinity

#if defined(OS_WIN) || defined(__CYGWIN__)

  (void)SetThreadAffinityMask(GetCurrentThread(), DWORD_PTR(1) << cpu);

#elif defined(__APPLE__)

  (void)pthread_set_qos_class_self_np(QOS_CLASS_USER_INITIATED, 0);

#elif defined(__FreeBSD__) && defined(HAVE_CPUSET_SETAFFINITY)

  cpuset_t cpuset;
  CPU_ZERO(&cpuset);
  CPU_SET(cpu, &cpuset);
  (void)cpuset_setaffinity(CPU_LEVEL_WHICH, CPU_WHICH_TID, -1, sizeof(cpuset), &cpuset);

#elif defined(__DragonFly__) && defined(HAVE_PTHREAD_SETAFFINITY_NP)

  cpuset_t cpuset;
  CPU_ZERO(&cpuset);
  CPU_SET(cpu, &cpuset);
  (void)pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);

#elif defined(__NetBSD__) && defined(HAVE_PTHREAD_SETAFFINITY_NP)

  cpuset_t *cpuset = cpuset_create();
  cpuset_set(cpu, cpuset);
  (void)pthread_setaffinity_np(pthread_self(), cpuset_size(cpuset), cpuset);
  cpuset_destroy(cpuset);

#elif defined(HAVE_SCHED_SETAFFINITY)

  cpu_set_t cpuset;
  CPU_ZERO(&cpuset);
  CPU_SET(cpu, &cpuset);
  (void)sched_setaffinity(0, sizeof(cpuset), &cpuset);

#elif defined(HAVE_CPUSET_SETAFFINITY)

  cpu_set_t cpuset;
  CPU_ZERO(&cpuset);
  CPU_SET(cpu, &cpuset);
  (void)cpuset_setaffinity(CPU_LEVEL_WHICH, CPU_WHICH_TID, -1, sizeof(cpuset), &cpuset);

#elif defined(HAVE_PTHREAD_SETAFFINITY_NP)

  cpu_set_t cpuset;
  CPU_ZERO(&cpuset);
  CPU_SET(cpu, &cpuset);
  (void)pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);

#endif

  // set priority

#if defined(OS_WIN) || defined(__CYGWIN__)

  (void)SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_HIGHEST);

#elif defined(__APPLE__)

  (void)setpriority(PRIO_DARWIN_THREAD, 0, -20);

#elif defined(HAVE_PTHREAD_SETSCHEDPRIO)

  (void)pthread_setschedprio(pthread_self(), -20);

#elif defined(HAVE_SETPRIORITY)

  (void)setpriority(PRIO_PROCESS, 0, -20);

#endif

  (void)cpu;
}

Note: the performance impact on MacOS is not observable in my tests and don't differ from the current benchmarks, which is not entirely unexpected because MacOS doesn't offer a thread affinity API as far as I know and I've found the thread performance on MacOS quite optimal already. Setting the MacOS thread QoS and priority is probably a good idea anyway, so it can't hurt.

Android with Termux is a different story. It won't always set affinity with the code above. The number of available cores changes all the time. Several online resources confirm this.

Updated ugrep v4.3.1-1 is committed and can be cloned from this repo to build and test. I will release v4.3.2 later, after more testing.

genivia-inc added a commit that referenced this issue Oct 21, 2023
set thread affinity and priority #305 & improve TUI regex syntax highlighting of --bool AND/OR/NOT
@genivia-inc
Copy link
Member Author

Included with release v4.3.2.

@genivia-inc genivia-inc unpinned this issue Nov 3, 2023
stdedos pushed a commit to stdedos/ugrep that referenced this issue Jan 18, 2024
# By Robert van Engelen (55) and others
# Via GitHub (16) and Robert van Engelen (2)
* tag 'v4.5.2':
  released 4.5.2
  tests: Fix tests with 7zip disabled
  7zip: Do not build when configured with disable-7zip
  released 4.5.1 fix bzip3/7zip configure interference
  add Genivia#341 format %Z enhancement
  fix Genivia#10 --disable-7zip
  fix bzip3/7zip detection interference
  released 4.5.0
  remove shebang from bash completion script
  released 4.4.1
  Fix installation target to use DESTDIR when setting up completions
  add `installers-regex` to Winget Releaser workflow
  released 4.4.0
  released 4.4.0
  Update README.md
  improved zsh completions with option args
  Update README.md
  Update README.md
  Update README.md
  add bash fish zsh completions
  Bump github/codeql-action from 2 to 3
  updated fish completions
  update completions
  add fish completions
  add bash completions
  docs: openSUSE install method added
  released 4.3.6
  Update README.md
  released 4.3.5
  released 4.3.5
  Add Macports moar +pager variant (moar-pager)
  fix linker warning -L/lib directory not found
  fix Genivia#323 configure check
  released 4.3.4
  Refactor Dockerfile for optimized build speed and image size
  Update Arch Linux package URL in README.md
  Update README.md
  update to fix Genivia#316 Genivia#317 Genivia#319
  ugrep.cpp: Fix typo preceeded
  include bzip3 library only when --with-bzip3 is specified
  released 4.3.3
  add bzip3 decompression Genivia#311
  add brotli decompression Genivia#312
  add brotli decompression Genivia#312
  nested zip error recovery Genivia#313 redux
  nested zip error recovery Genivia#313
  quicker TUI blanking when search restarts
  update README
  updated README
  Add Zig support
  released 4.3.2
  released 4.3.2
  Update README.md
  Update README.md
  Update README.md
  Update README.md
  add ugrep.com
  updated README
  update Genivia#305 to support DragonFly and NetBSD
  add thread affinity and priority
  fix Genivia#306 option --bool space in regex bracket list
  fix Genivia#306 option --bool space in regex bracket list
  updated README
  Add Kakoune
  updated README
  Bump actions/checkout from 3 to 4
  released 4.3.1
  updated README
  updated README
  add winget installation reference in the readme
  add Winget Releaser workflow
  updated README

Signed-off-by: Stavros Ntentos <[email protected]>

# Conflicts:
#	src/ugrep.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant