-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use link-time optimization for the production build #10942
Comments
A preliminary/naive single-node performance test of a yb-tserver binary built with Clang 12 link-time optimization (LTO) shows about 30% throughput improvement on a write-only CassandraKeyValue workload. LTO works by putting LLVM bitcode into .o files instead of native code, and at link time, the entire program is loaded into memory and optimized as a whole. E.g. this allows better inlining, devirtualization (replacing virtual function calls with direct calls in case the class is known at compile time), etc. For a dynamically linked program (the way we build code today), these are not possible because in theory any function could be replaced by a different implementation, e.g. through LD_PRELOAD. https://gist.githubusercontent.com/mbautin/6d2debaef1286aa045afde0c08853760/raw -- and linking the remaining shared library statically could be even better ( right now a few libraries are still dynamically linked: https://gist.githubusercontent.com/mbautin/bc8769a9ae93f8d6d1f2244591a08376/raw ). Potentially we could even link statically with glibc (we'll have to rebuild it). On the flip side, this statically linked binary is 480 MB with debug info (but only 61 MB without it). I was thinking of creating a "busybox style" binary (busybox is a single executable that provides lots of Unix utilities -- https://en.wikipedia.org/wiki/BusyBox ). So we could create one binary that can be yb-master, yb-tserver, or postgres, depending on argv[0]. And the rest of the tools in our release tarball could still use dynamic linking the same way they do today. |
…lang 12 Summary: Link-time optimization ( https://llvm.org/docs/LinkTimeOptimization.html ) allows for more aggressive optimizations, including inlining, compared to the shared library based model that we currently use. This diff enables link-time optimization for the Clang 12 Linuxbrew-based release build for the yb-tserver executable only, producing a binary that statically links all object files needed by yb-tserver, including those that are included in the yb_pgbackend library. Third-party libraries are being linked statically but they are not LTO-enabled yet. The linking of the final LTO-enabled binary is currently being done outside of the CMake build system, using the dependency_graph.py tool that can access the dependency graph of targets and object files, and therefore has all the information needed to construct the linker command line. This also gives us more flexibility customizing the linker command line compared to attempts to do this in the CMake build system. Moving this linking step to CMake may be a future project. Refactored the dependency_graph.py script into multiple modules: dependency_graph.py, dep_graph_common.py, source_files.py, as well as lto.py (with the new LTO logic). Also refactored master_main.cc and tablet_server_main.cc and extracted common initialization code to tserver/server_main_util.cc. It is in the tserver directory because the master code currently uses the tserver code. For building LTO-enabled binaries, we need to use LLVM's lld linker. It has issues with our distributed compilation framework ( #11034 ). Fixing this by always running LLD-enabled linking commands locally and not on a remote build worker. Various static initialization issues were identified as fixed as part of this work. If not fixed, these would result in the yb-tserver binary crashing immediately with a core dump. - In consensus_queue.cc, the RpcThrottleThresholdBytesValidator function for validating the rpc_throttle_threshold_bytes flag was trying to access other flags before they were fully initialized. Moved this validation to the main program. - The webserver_doc_root flag was calling yb::GetDefaultDocumentRoot() to determine its default value. Moved that default value determination to where the flag is being used. - [ #11033 ] The INTERNAL_TRACE_EVENT_ADD_SCOPED macro, when invoked during static initialization, led to a crash in std::string construction. Added a new atomic trace_events_enabled for enabling trace events and only turned it on after main() started executing. The INTERNAL_TRACE_EVENT_ADD_SCOPED is a no-op before trace_events_enabled is set to true. - [ #10964 ] The kGlobalTransactionTableName global constant of the YBTableName type relied on the statically initialized string constant, kGlobalTransactionsTableName, which turned out to be empty during initialization. As a result, the transaction status table could not be properly located. Changed kGlobalTransactionsTableName to be a `const char*`. In addition, in the LTO-enable build, it became apparent that some symbols were duplicated between the gperftools library and the gutil part of YugabyteDB code ( #10956 ): - AtomicOps_Internalx86CPUFeatures -- renamed to YbAtomicOps_Internalx86CPUFeatures - RunningOnValgrind -- renamed to YbRunningOnValgrind - ValgrindSlowdown -- renamed to YbValgrindSlowdown - base::internal::SpinLockDelay, base::internal::SpinLockWake -- added a top-level yb namespace To enable easily switching between regular and LTO binaries, we are updating yb-ctl to support YB_CTL_TSERVER_DAEMON_FILE_NAME and YB_CTL_MASTER_DAEMON_FILE_NAME environment variables. For example, by setting YB_CTL_TSERVER_DAEMON_FILE_NAME=yb-tserver-lto, you can tell yb-ctl to launch the tablet server using build/latest/bin/yb-tserver-lto. However, for the release package, the LTO-enabled yb-tserver executable will still be named yb-tserver, replacing the previous shared library based executable. Another tooling change in this diff is how we handle the `--no-tests` flag passed to `yb_build.sh`. That flag results in setting the YB_DO_NOT_BUILD_TESTS environment variable to 1, and our CMake scripts skip all the test targets. However, it is easy to forget to keep specifying that flag. In this diff, we are storing the variable BUILD_TESTS in CMake's build cache, and reuse it during future CMake runs, without the developer having to specify `--no-tests`. It can be reset by setting YB_DO_NOT_BUILD_TESTS=0. Test Plan: Jenkins ``` # ./yb_build.sh --clang12 release # build-support/tserver_lto.sh # ldd build/latest/bin/yb-tserver-lto linux-vdso.so.1 (0x00007fff535bf000) libm.so.6 => /opt/yb-build/brew/linuxbrew-20181203T161736v9/lib/libm.so.6 (0x00007f1b85b7d000) libgcc_s.so.1 => /opt/yb-build/brew/linuxbrew-20181203T161736v9/lib/libgcc_s.so.1 (0x00007f1b85966000) libc.so.6 => /opt/yb-build/brew/linuxbrew-20181203T161736v9/lib/libc.so.6 (0x00007f1b855ca000) /opt/yb-build/brew/linuxbrew-20181203T161736v9/lib/ld.so => /lib64/ld-linux-x86-64.so.2 (0x00007f1b85e80000) libdl.so.2 => /opt/yb-build/brew/linuxbrew-20181203T161736v9/lib/libdl.so.2 (0x00007f1b853c6000) libpthread.so.0 => /opt/yb-build/brew/linuxbrew-20181203T161736v9/lib/libpthread.so.0 (0x00007f1b851a9000) librt.so.1 => /opt/yb-build/brew/linuxbrew-20181203T161736v9/lib/librt.so.1 (0x00007f1b84fa1000) ``` The yb-tserver-lto is ~326 MiB. Microbenchmark -------------- The test was done on a dual-socket Xeon E5-2670 machine (16 cores total, 32 hyper-threads) running AlmaLinux 8.5. Details: https://gist.githubusercontent.com/mbautin/7f9784fb2ea4173539d2e2656cfe117f/raw Results (CassandraKeyValue workload): 78K ops/sec with GCC 5.5, 85K ops/sec with Clang 12 without LTO, 104K ops/sec with Clang 12 with LTO. Reviewers: sergei Reviewed By: sergei Subscribers: sergei, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D14616
Enabled for yb-tserver. Will create follow-up issues for doing more LTO for yb-master and postgres. |
It would probably be very good for YugabyteDB performance to compile all YB + postgres code using Clang's LTO (link time optimization).
Preliminary results produced with Clang 12, thin LTO, x86_64, using the Linuxbrew glibc and other libraries. First and third experiments are on the LTO build, the middle one is the non-LTO release build.
The build is done with
-fwhole-program
, and YB + postgres code are currently included in LTO (could also include third-party libraries). Only yb-tserver is compiled with LTO.The text was updated successfully, but these errors were encountered: