Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to choose fine-grained CPU intrinsics on as CMake options #849

Merged
merged 3 commits into from
Apr 9, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
## [Unreleased]

### Added
- Allow for fine-grained CPU intrinsics overrides when BUILD_ARCH != native e.g. -DBUILD_ARCH=x86-64 -DCOMPILE_AVX512=off
- Better suppression of unwanted output symbols, specifically "\n" from SentencePiece with byte-fallback. Can be deactivated with --allow-special
- Display decoder time statistics with marian-decoder --stat-freq 10 ...
- Support for MS-internal binary shortlist
Expand All @@ -34,6 +35,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
- Broken links to MNIST data sets

### Changed
- For BUILD_ARCH != native enable all intrinsics types by default, can be disabled like this: -DCOMPILE_AVX512=off
- Moved FBGEMM pointer to commit c258054 for gcc 9.3+ fix
- Change compile options a la -DCOMPILE_CUDA_SM35 to -DCOMPILE_KEPLER, -DCOMPILE_MAXWELL,
-DCOMPILE_PASCAL, -DCOMPILE_VOLTA, -DCOMPILE_TURING and -DCOMPILE_AMPERE
Expand Down
91 changes: 68 additions & 23 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -124,50 +124,95 @@ else(MSVC)

# Detect support CPU instrinsics for the current platform. This will
# only by used with BUILD_ARCH=native. For overridden BUILD_ARCH we
# minimally use -msse4.1. This seems to work with MKL.
# force intrinsics as set in the options.
set(INTRINSICS "")
list(APPEND INTRINSICS_NVCC)

option(COMPILE_SSE2 "Compile CPU code with SSE2 support" ON)
option(COMPILE_SSE3 "Compile CPU code with SSE3 support" ON)
option(COMPILE_SSE4_1 "Compile CPU code with SSE4.1 support" ON)
option(COMPILE_SSE4_2 "Compile CPU code with SSE4.2 support" ON)
option(COMPILE_AVX "Compile CPU code with AVX support" ON)
option(COMPILE_AVX2 "Compile CPU code with AVX2 support" ON)
option(COMPILE_AVX512 "Compile CPU code with AVX512 support" ON)

if(BUILD_ARCH STREQUAL "native")
# @TODO: if we are building "-march=native" anyway is the whole shebang here even useful?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe not? :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Native would enable all the supported flags, so specifying the additional things won't do anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the only thing that's useful is that the XXX_FOUND vars get added to the build and can be displayed with --build-info. I think I will leave the messages in but remove the flags maybe.

Copy link
Member Author

@emjotde emjotde Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, the messages here should rather inform the user that with march=native their request to build with e.g. -DCOMPILE_AVX512=off will be essentially ignored if avx512 was detected since the compiler will add it anyway.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the long run I'd still like to see fat binaries that determine CPU features at run time, so that we can actually deliver binaries that run everywhere with the best code path for the given architecture. For dockerization, I always have to compile with the lowest common set, because I don't know ahead of time what architecture the container will run on.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth taking a look at.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually might have similar requirements for something like that in the very near future. We might want to sync?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be quite time consuming, but very rewarding potentially.

The way we do it in intgemm is that at runtime you have a bunch of function ptrs that get initiated to the the kernel that corresponds to your architecture. We have to do that for all the performance critical marian functions and the make sure that the non-performance critical parts only generate generic x86 instructions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to be able to distribute binaries that make the most of the available hardware, you either have to maintain a zoo of binaries and educate users how to determine which of the many versions available is the right one for them, or have the software make that decision for them. In that sense it's not only rewarding, but inevitable. We should focus on decoding first (anyone with the technical knowledge to set up training will be competent to compile).

My hunch is that we can replace pre-compiler switches by specialization of (inline) template functions. MKL is currently another obstacle, as it insists on linking to a dynamic system library. It's currently not possible to create a fully static executable even if you know the CPU intrinsics available and or are willing to go with the minimum set of intrinsics required.

message(STATUS "Checking support for CPU intrinsics")
include(FindSSE)
if(SSE2_FOUND)
message(STATUS "SSE2 support found")
if(SSE2_FOUND AND COMPILE_SSE2)
message(STATUS "SSE2 support requested and found")
set(INTRINSICS "${INTRINSICS} -msse2")
list(APPEND INTRINSICS_NVCC -Xcompiler\ -msse2)
endif(SSE2_FOUND)
if(SSE3_FOUND)
message(STATUS "SSE3 support found")
endif(SSE2_FOUND AND COMPILE_SSE2)
if(SSE3_FOUND AND COMPILE_SSE3)
message(STATUS "SSE3 support requested and found")
set(INTRINSICS "${INTRINSICS} -msse3")
list(APPEND INTRINSICS_NVCC -Xcompiler\ -msse3)
endif(SSE3_FOUND)
if(SSE4_1_FOUND)
message(STATUS "SSE4.1 support found")
endif(SSE3_FOUND AND COMPILE_SSE3)
if(SSE4_1_FOUND AND COMPILE_SSE4_1)
message(STATUS "SSE4.1 support requested and found")
set(INTRINSICS "${INTRINSICS} -msse4.1")
list(APPEND INTRINSICS_NVCC -Xcompiler\ -msse4.1)
endif(SSE4_1_FOUND)
if(SSE4_2_FOUND)
message(STATUS "SSE4.2 support found")
endif(SSE4_1_FOUND AND COMPILE_SSE4_1)
if(SSE4_2_FOUND AND COMPILE_SSE4_2)
message(STATUS "SSE4.2 support requested and found")
set(INTRINSICS "${INTRINSICS} -msse4.2")
list(APPEND INTRINSICS_NVCC -Xcompiler\ -msse4.2)
endif(SSE4_2_FOUND)
if(AVX_FOUND)
message(STATUS "AVX support found")
endif(SSE4_2_FOUND AND COMPILE_SSE4_2)
if(AVX_FOUND AND COMPILE_AVX)
message(STATUS "AVX support requested and found")
set(INTRINSICS "${INTRINSICS} -mavx")
list(APPEND INTRINSICS_NVCC -Xcompiler\ -mavx)
endif(AVX_FOUND)
if(AVX2_FOUND)
message(STATUS "AVX2 support found")
endif(AVX_FOUND AND COMPILE_AVX)
if(AVX2_FOUND AND COMPILE_AVX2)
message(STATUS "AVX2 support requested and found")
set(INTRINSICS "${INTRINSICS} -mavx2")
list(APPEND INTRINSICS_NVCC -Xcompiler\ -mavx2)
endif(AVX2_FOUND)
if(AVX512_FOUND)
message(STATUS "AVX512 support found")
endif(AVX2_FOUND AND COMPILE_AVX2)
if(AVX512_FOUND AND COMPILE_AVX512)
message(STATUS "AVX512 support requested and found")
set(INTRINSICS "${INTRINSICS} -mavx512f")
list(APPEND INTRINSICS_NVCC -Xcompiler\ -mavx512f)
endif(AVX512_FOUND)
endif(AVX512_FOUND AND COMPILE_AVX512)
else()
set(INTRINSICS "-msse4.1")
# force to build with the requested intrisics, requires compiler support
message(STATUS "Building for ${BUILD_ARCH} and forcing intrisics as requested")
if(COMPILE_SSE2)
message(STATUS "SSE2 support requested")
set(INTRINSICS "${INTRINSICS} -msse2")
list(APPEND INTRINSICS_NVCC -Xcompiler\ -msse2)
endif(COMPILE_SSE2)
if(COMPILE_SSE3)
message(STATUS "SSE3 support requested")
set(INTRINSICS "${INTRINSICS} -msse3")
list(APPEND INTRINSICS_NVCC -Xcompiler\ -msse3)
endif(COMPILE_SSE3)
if(COMPILE_SSE4_1)
message(STATUS "SSE4.1 support requested")
set(INTRINSICS "${INTRINSICS} -msse4.1")
list(APPEND INTRINSICS_NVCC -Xcompiler\ -msse4.1)
endif(COMPILE_SSE4_1)
if(COMPILE_SSE4_2)
message(STATUS "SSE4.2 support requested")
set(INTRINSICS "${INTRINSICS} -msse4.2")
list(APPEND INTRINSICS_NVCC -Xcompiler\ -msse4.2)
endif(COMPILE_SSE4_2)
if(COMPILE_AVX)
message(STATUS "AVX support requested")
set(INTRINSICS "${INTRINSICS} -mavx")
list(APPEND INTRINSICS_NVCC -Xcompiler\ -mavx)
endif(COMPILE_AVX)
if(COMPILE_AVX2)
message(STATUS "AVX2 support requested")
set(INTRINSICS "${INTRINSICS} -mavx2")
list(APPEND INTRINSICS_NVCC -Xcompiler\ -mavx2)
endif(COMPILE_AVX2)
if(COMPILE_AVX512)
message(STATUS "AVX512 support requested")
set(INTRINSICS "${INTRINSICS} -mavx512f")
list(APPEND INTRINSICS_NVCC -Xcompiler\ -mavx512f)
endif(COMPILE_AVX512)
endif()

if(USE_FBGEMM)
Expand Down