Unlike SSE/AVX, AMX feature on Linux is not enabled by default for every process(due to excessively larger register space requirement on XSAVE area and thus longer context switching latency), user application needs to request permission for such dynamically enabled feature using XSTATE system call.
oneDNN will do such initialization at the first time when some primitive calls mayiuse
with amx related cpu_isa, for example: mayiuse(avx512_core_bf16_amx_bf16)
. This may happen in followng stages:
- graph compilation stage: in stream's scheduling thread.
- prepare parameter stage: also in stream's scheduling thread
This is fine for C/C++ OpenVINO application, but Python application using OpenVINO would fail to use AMX due to following known issue:
Insufficient sigaltstack size used by CPython prevents extensions from using new ISA.
In brief, CPython sets it's signal stack too small in Modules/faulthandler.c, so no enough space to store AMX in XSAVE area, causing Linux XSTATE implementation to return ENOSPC. This issue is probably not fixed until Python3.9.
The stream scheduling thread created by pthread_create() does not inherit the creating thread's alternate signal stack, so it's big enough, but Linux XSTATE API requires all threads in current process to have big enough signal stack since the feature is enabled for whole process, and the CPython main thread doesn't satisfy this requirement due to above issue, so the fix has to be done in CPython thread rather than in stream scheduling thread.
We choose to fix it in constructor/destructor of Engine
class, a member variable of type std::shared_ptr<void>
holds a reference to a CPUSpecialSetup
instance which manages signal stack replacement.