-
Notifications
You must be signed in to change notification settings - Fork 19
Optimization progress
This documents the progress made while working on the code. For now it shows results for local only operation. All results below are gathered using this command line:
-Disableoutput -Problem=moving_star -Max_level=6 -Stopstep=1 -Xscale=32 \
-Odt=0.5 -Stoptime=0.1 --hpx:threads=6
The results are gathered on a 2 socket Nehalem system, 6 cores each socket. Note that this means all vector operations are limited to using SSE/SSE2 only.
Here is the base line data, commit 850bf4, (click on the image to see it full sized):
This clearly shows that the overall runtime is determined by 5 functions:
Function Module CPU Time
taylor<5,class Vc_1::SimdArray<double,8,class Vc_1::Vector<double,struct Vc_1::VectorAbi::Sse>,2> >::set_basis
grid::compute_boundary_interactions_multipole_multipole
grid::compute_interactions
grid::compute_boundary_interactions_monopole_multipole
grid::compute_boundary_interactions_monopole_monopole
After applying some optimizations to the taylor loops, restructuring code by lifting index operations out of loops we get this (commit 3c24cd):
which is a clear improvement.
The next step focused on reducing the impact of calling pow
(which is now under the top 5 functions. This mainly involved converting const
variables and functions to constexpr
and pre-calculating certain expressions. The result can be seen here (commit 432888):
Now we do the same for std::copysign
. The result can be seen here (commit f846e7):
All of the above changes have improved the overall runtime by almost 30%.
The next steps should focus on looking into further optimizing the three hotspot functions:
Top Hotspots
taylor<5,class Vc_1::SimdArray<double,8,class Vc_1::Vector<double,struct Vc_1::VectorAbi::Sse>,2> >::set_basis 174.351s
grid::compute_boundary_interactions_multipole_multipole 138.763s
grid::compute_interactions 105.589s
The following figures are a picture of Octotiger running on the KNL system (1 node) at UO. HPX was configured with:
cmake -DCMAKE_CXX_COMPILER="icpc" \
-DCMAKE_C_COMPILER="icc" \
-DCMAKE_Fortran_COMPILER="ifort" \
-DCMAKE_LINKER="xild" \
-DCMAKE_CXX_FLAGS="-xMIC-AVX512 -march=native -fast" \
-DCMAKE_C_FLAGS="-xMIC-AVX512 -march=native -fast" \
-DCMAKE_Fortran_FLAGS="-xMIC-AVX512 -march=native -fast" \
-DCMAKE_BUILD_TYPE=Release \
-DHPX_WITH_MAX_CPU_COUNT=272 \
-DBOOST_ROOT=/usr/local/packages/boost/1.61.0-knl \
-DCMAKE_INSTALL_PREFIX=${startdir}/install-knl \
-DHPX_WITH_APEX=TRUE \
-DAPEX_WITH_ACTIVEHARMONY=TRUE \
-DACTIVEHARMONY_ROOT=${HOME}/install/activeharmony/4.6.0-knl \
-DAPEX_WITH_OTF2=TRUE \
-DOTF2_ROOT=${HOME}/install/otf2/2.0-knl \
-DHPX_WITH_MALLOC=jemalloc \
-DJEMALLOC_ROOT=${HOME}/install/jemalloc-3.5.1 \
-DHWLOC_ROOT=${HOME}/install/hwloc-1.8 \
-DHPX_WITH_TOOLS=ON \
${HOME}/src/hpx-lsu
Octotiger was configured with:
cmake -DCMAKE_PREFIX_PATH=$HOME/src/tmp/build-knl \
-DCMAKE_CXX_COMPILER="icpc" \
-DCMAKE_C_COMPILER="icc" \
-DCMAKE_AR="xiar" \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_FLAGS="-xMIC-AVX512 -march=native -fast" \
-DCMAKE_Fortran_FLAGS="-xMIC-AVX512 -march=native -fast" \
-DCMAKE_CXX_FLAGS="-xMIC-AVX512 -march=native -fast" \
-DOCTOTIGER_WITH_SILO=OFF \
$HOME/src/octotiger
Octotiger was executed with:
-Disableoutput -Problem=moving_star -Max_level=4 -Stopstep=1 --hpx:threads=68
Here is a view of an OTF2 trace of Octotiger in Vampir 8.5: Here is a zoomed view (~10ms) of an OTF2 trace of Octotiger in Vampir 8.5: Here is a view of an APEX concurrency view of the same execution (sampled every 1 second):
Several questions/observations about this execution:
- I requested two iterations. Why does it look like four? There appears to be a period halfway through an iteration where concurrency drops to near zero.
- One worker (#27) appears to do nothing. That is actually the APEX background task, updating the APEX profile - none of that work is measured by APEX, hence the apparent "idle" thread. That's been fixed. (see below)
- The overall concurrency is poor - the hardware is less than half utilized on average.
- If Vampir zooms in on the trace, concurrency is obviously poor.
The build process was streamlined and improved. I created a KNL toolchain file for HPX that provides the settings for building on "normal" socket-based KNLs.
The new toolchain file:
# Copyright (c) 2016 Kevin Huck
#
# Distributed under the Boost Software License, Version 1.0. (See accompanying
# file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
#
# This is the default toolchain file to be used with Intel Xeon KNLs. It sets
# the appropriate compile flags and compiler such that HPX will compile.
# Note that you still need to provide Boost, hwloc and other utility libraries
# like a custom allocator yourself.
#
# Set the Cray Compiler Wrapper
set(CMAKE_CXX_COMPILER icpc)
set(CMAKE_C_COMPILER icc)
set(CMAKE_Fortran_COMPILER ifort)
set(CMAKE_C_FLAGS_INIT "-xMIC-AVX512" CACHE STRING "")
set(CMAKE_SHARED_LIBRARY_C_FLAGS "-fPIC" CACHE STRING "")
set(CMAKE_SHARED_LIBRARY_CREATE_C_FLAGS "-fPIC" CACHE STRING "")
set(CMAKE_CXX_FLAGS_INIT "-xMIC-AVX512" CACHE STRING "")
set(CMAKE_SHARED_LIBRARY_CXX_FLAGS "-fPIC" CACHE STRING "")
set(CMAKE_SHARED_LIBRARY_CREATE_CXX_FLAGS "-fPIC" CACHE STRING "")
set(CMAKE_Fortran_FLAGS_INIT "-xMIC-AVX512" CACHE STRING "")
set(CMAKE_SHARED_LIBRARY_Fortran_FLAGS "-fPIC" CACHE STRING "")
set(CMAKE_SHARED_LIBRARY_CREATE_Fortran_FLAGS "-shared" CACHE STRING "")
set(HPX_WITH_PARCELPORT_TCP ON CACHE BOOL "")
set(HPX_WITH_PARCELPORT_MPI ON CACHE BOOL "")
set(HPX_WITH_PARCELPORT_MPI_MULTITHREADED OFF CACHE BOOL "")
# We default to system as our allocator on the KNL
if(NOT DEFINED HPX_WITH_MALLOC)
set(HPX_WITH_MALLOC "system" CACHE STRING "")
endif()
# Set the TBBMALLOC_PLATFORM correctly so that find_package(TBBMalloc) sets the
# right hints
set(TBBMALLOC_PLATFORM "mic-knl" CACHE STRING "")
# We have a bunch of cores on the MIC ... increase the default
set(HPX_WITH_MAX_CPU_COUNT "512" CACHE STRING "")
# RDTSC is available on Xeon/Phis
set(HPX_WITH_RDTSC ON CACHE BOOL "")
The new HPX config:
cmake -DCMAKE_TOOLCHAIN_FILE=$HOME/src/hpx-lsu/cmake/toolchains/KNL.cmake \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DBOOST_ROOT=/usr/local/packages/boost/1.61.0-knl \
-DHPX_WITH_DATAPAR_VC=On -DVc_ROOT=$HOME/src/operation_gordon_bell/Vc-icc \
-DCMAKE_INSTALL_PREFIX=. \
-DHPX_WITH_MALLOC=tcmalloc \
-DTCMALLOC_ROOT=${startdir}/gperftools \
-DHWLOC_ROOT=${startdir}/hwloc \
-DHPX_WITH_APEX=TRUE \
-DHPX_WITH_APEX_NO_UPDATE=TRUE \
-DAPEX_WITH_ACTIVEHARMONY=TRUE \
-DACTIVEHARMONY_ROOT=${HOME}/install/activeharmony/4.6.0-knl \
-DAPEX_WITH_OTF2=TRUE \
-DOTF2_ROOT=${HOME}/install/otf2/2.0-knl \
${HOME}/src/hpx-lsu
Octotiger was configured with:
cmake -DCMAKE_PREFIX_PATH=$HOME/src/operation_gordon_bell/build-knl \
-DCMAKE_BUILD_TYPE=Release \
-DOCTOTIGER_WITH_SILO=OFF \
$HOME/src/octotiger
Octotiger was executed with (${threads} was set to 68, 136, 208, 272):
-Disableoutput -Problem=moving_star -Max_level=4 -Stopstep=1 --hpx:threads=${threads} -Ihpx.stacks.use_guard_pages=0
Kevin executed octotiger on a 24 core node with two sandy bridge processors. This run was tracked with ThreadSpotter, a performance tool for measuring cache performance and contention between threads. The report is available here. While bandwidth doesn't seem to be a problem... latency, locality and contention are.
Unfortunately, the binary compiled with Intel compilers did not have compatible debug information for the analysis tool, so this executable (and HPX) were compiled with GCC 5.3 instead of Intel 17. The cmake configuration was "RelWithDebInfo" for both HPX and Octotiger.