You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
For whatever reason, across multiple different compilers (msvc, clang, gcc) and C runtime libraries (ucrt, cygwin, msvcrt), the qsort step of --train-cover seems to take forever. Profiling of an msvc-built zstd shows that the qsort function body is not inline-able. This results in excessive numbers of forced function calls from qsort to zstd and then back to non-inlined memcmp. That said, I could be misinterpreting the data and need to look at the actual asm to be sure. Maybe just pessimal pivot selection and worst case performance?
Training on a dataset of 100MB takes almost 45 minutes even with full parallelism. A naive reimplementation in C++ which calls std::stable_sort(std::execution::par_unseq, ...) completes the sort step in seconds.
To Reproduce
Steps to reproduce the behavior:
Get a 100+mb training dataset.
Run zstd.exe --train-cover=steps=512,d=10 -T0 -r data --maxdict=100KB -o data.dict --memory-limit 100MB -v
Observe functional hang at Constructing partial suffix array.
Compare against same training set on linux.
Expected behavior
--train-cover can be used on larger datasets without issue on all operating systems.
Screenshots and charts
qsort implementation performance (training did not finish, this is a 3 minute 21 second snapshot of the Constructing partial suffix array step:
c++ implementation with par_unseq (completes in scant seconds with less than half the total cpu time. RtlUserThreadStart is the rollup of the threads created to do the parallel sort):
c++ singlethreaded (~30 seconds wall time):
Desktop (please complete the following information):
OS: Windows
Version: 11
Compiler: msvc VS2022 (also reproduces with zstd prebuilts in msys2 environments built with clang, gcc against cygwin, msvc, and ucrt c runtimes https://www.msys2.org/docs/environments/)
Flags: "Release" visual studio configuration
Other relevant hardware specs: 32/64 threadripper
Build system: visual studio
Additional context
I don't think I botched the C++ implementation, but I may have broken something with all the void* casts happening. The biggest difference is I had to change the 'tiebreaker' because the pointer value is not passed to the comparator. https://gist.github.com/akrieger/c023dad6ffe5eac7d44a3a34a3dc7721
The text was updated successfully, but these errors were encountered:
Describe the bug
For whatever reason, across multiple different compilers (msvc, clang, gcc) and C runtime libraries (ucrt, cygwin, msvcrt), the qsort step of
--train-cover
seems to take forever. Profiling of an msvc-built zstd shows that the qsort function body is not inline-able. This results in excessive numbers of forced function calls from qsort to zstd and then back to non-inlined memcmp. That said, I could be misinterpreting the data and need to look at the actual asm to be sure. Maybe just pessimal pivot selection and worst case performance?Training on a dataset of 100MB takes almost 45 minutes even with full parallelism. A naive reimplementation in C++ which calls
std::stable_sort(std::execution::par_unseq, ...)
completes the sort step in seconds.To Reproduce
Steps to reproduce the behavior:
zstd.exe --train-cover=steps=512,d=10 -T0 -r data --maxdict=100KB -o data.dict --memory-limit 100MB -v
Constructing partial suffix array
.Expected behavior
Screenshots and charts
qsort implementation performance (training did not finish, this is a 3 minute 21 second snapshot of the
Constructing partial suffix array
step:c++ implementation with par_unseq (completes in scant seconds with less than half the total cpu time. RtlUserThreadStart is the rollup of the threads created to do the parallel sort):
c++ singlethreaded (~30 seconds wall time):
Desktop (please complete the following information):
Additional context
I don't think I botched the C++ implementation, but I may have broken something with all the void* casts happening. The biggest difference is I had to change the 'tiebreaker' because the pointer value is not passed to the comparator. https://gist.github.com/akrieger/c023dad6ffe5eac7d44a3a34a3dc7721
The text was updated successfully, but these errors were encountered: