-
Notifications
You must be signed in to change notification settings - Fork 229
AddX86Support
Adding new x86_64 architectures to LIKWID is more work than for ARM. The reason is that LIKWID provides multiple counter access backends for x86_64: direct access, daemon access and perf_event.
At first, LIKWID requires some IDs from the hardware to identify the platform. Although the hwloc library gathers most information, LIKWID additionally reads /proc/cpuinfo
and reads the following fields:
-
cpu family
: For Intel commonly0x6
, for AMD different cpu families per CPU generation. -
model
: Differentiate the model of the CPU -
stepping
: Vendor internal revision of CPU model -
model name
: Name of the CPU model
When you have these IDs at hand (in hexadecimal format), you add them to src/includes/topology.h
. Here is a snippet of this file:
#define P6_FAMILY 0x6U
#define ZEN_FAMILY 0x17
/* Intel P6 */
#define BROADWELL 0x3DU
#define SKYLAKE1 0x4EU
/* AMD */
#define ZEN2_RYZEN 0x31
[...]
The IDs with FAMILY
correspond to cpu family
and the Intel P6
and AMD
sections to model
. Please use reasonable abbreviations.
With this information, the chip can be identified but LIKWID adds some more data for the user like chip architecture description and a short name. The description is only used for output but the short name is used later in the performance monitoring part. Both information has to be added to src/topology.c
. Snippet:
[...]
static char* broadwell_str = "Intel Core Broadwell processor";
static char* skylake_str = "Intel Skylake processor";
[...]
static char* amd_zen2_str = "AMD K17 (Zen2) architecture";
[...]
static char* short_broadwell = "broadwell";
static char* short_skylake = "skylake";
[...]
static char* short_zen2 = "zen2";
So add a nice string here and a short name. If the vendor publishes a short name for the chip, please use them. Intel provides long and short names like Intel Cascadelake X
and CLX
. I failed at using them, so please, be better than me and use official names ;)
The file src/topology.c
contains a function topology_setName()
which contains a set of switch-case statements based on the IDs we have added to src/includes/topology.h
before. Search for P6
(Intel) or ZEN_FAMILY
because the function is quite long. There you add the description and short name for the new chip. Here is a snippet of it:
switch ( cpuid_info.family )
{
case P6_FAMILY:
switch (cpuid_info.model)
{
case BROADWELL:
cpuid_info.name = broadwell_str;
cpuid_info.short_name = short_broadwell;
break;
[...]
}
break;
[...]
}
With these settings, you should be able to run likwid-topology
and get proper output (except cache information). The cache and NUMA information is provided by hwloc and should "just work" for x86_64 systems.
In order to get performance monitoring support for your chip, three new files are required:
-
perfmon_<name>.h
: Main header for the chip -
perfmon_<name>_counters.h
: Counter/register definitions -
perfmon_<name>_events.txt
: Event definitions
No underscores or similar allowed in name but please name them reasonably to find it again later. If there are multiple versions of the chip with different counters and/or events (like Intel Broadwell
desktop class broadwell
, single-socket server class broadwellD
and multi-socket server class broadwellEP
), please put the logic for all chips in the main header e.g. perfmon_broadwell.h
and separate files for counters and/or events.
The first file should be perfmon_<name>_counters.h
. It commonly consists of 3 to 4 tables: list of counters, list of units (set of counters using the same device) and a table which maps LIKWID types to the perf_event units and the device information. For the direct and daemon access method, the registers should contain proper MSR and PCI register offsets in the device. In case of two 32 bit registers as one 64 bit register, there are two COUNTER_REG
slots. Best practice is to define register names in src/includes/registers.h
and use the named registers here. The first entry in the following tables is the template, the second line an example:
- List of counters:
#define NUM_COUNTERS_<UPPERCASE_NAME> X
static RegisterMap <name>_counter_map[NUM_COUNTERS_<UPPERCASE_NAME>] = {
{COUNTERNAME, UNIQUE_ID, UNIT, CONFIG_REG, COUNTER_REG1, COUNTER_REG2, DEVICE_ID, OPTION_MASK, }
{"PMC0", PMC0, PMC, 0x0, 0x0, 0, 0, 0x0},
// for perf_event only the COUNTERNAME, UNIQUE_ID and UNIT are of interest
// for direct and daemon access, the register offsets are required
// if there is only a counter, like free-running counter, the CONFIG_REG is not required
};
The COUNTERNAME
can be chosen freely but the naming should somehow reflect their use-case. The OPTION_MASK
is used if a counter has some specific extensions like thresholds or feature-switches.
- List of units:
static BoxMap <name>_box_map[NUM_UNITS] = {
[UNITNAME] = {CONTROL_REG, STATUS_REG, CLEAR_REG, STATUS_REG_OFFSET, IS_PCI, DEVICE_ID, COUNTER_WIDTH}
[PMC] = {0, 0, 0, 0, 0, 0, 48},
// for perf_event only the COUNTER_WIDTH is of interest
// for direct and daemon access, the register offsets are required
};
The UNITNAME
s are defined in src/includes/register_types.h
. Adding new types is not recommended.
- Translation map:
static char* <name>_translate_types[NUM_UNITS] = {
[UNITNAME] = "path_to_perf_event_directory_containing_the_'type'_file_and_'format'_folder",
[PMC] = "/sys/bus/event_source/devices/cpu",
};
There is a default_translate_types
(src/perfmon.c
) list with basic settings. The list here is
only required if the types differ from the default.
- Device list:
This list defines like the access device and is therefore only required for the direct and daemon
access methods. The device names in
[]
are listed insrc/includes/pci_types.h
.
static PciDevice <name>_pci_devices[MAX_NUM_PCI_DEVICES] = {
[MSR_DEV] = {NODEVTYPE, "", "MSR", ""}, // line should be always present
[PCI_HA_DEVICE_0] = {HA, "12.1", NULL, NULL, 0x2f30},
[DEVICE_NAME] = {DEVICE_TYPE, "DEVICE_FILENAME", NULL, NULL, PCI_DEVICE_ID},
LIKWID tries to find the devices using the DEVICE_FILENAME
(like /proc/bus/pci/7f/12.1
) and the
PCI_DEVICE_ID
(like /sys/bus/pci/devices/0000\:7f\:12.1/device -> 0x2f30
). There are commonly
one PCI bus for a socket (like 0x7f
in the last two example paths).
The most tedious work when adding a new chip is typing down/copying/parsing the list of supported events. The list of events is a plain text file and transformed into a header during compilation.
The format for the events is fixed:
EVENT_<EVENTNAME> <EVENT_ID> <USABLE_COUNTERS>
UMASK_<EVENTNAME_SUBEVENT1> <UMASK>
UMASK_<EVENTNAME_SUBEVENT2> <UMASK> <CFGBITS> <THRES>
The <CFGBITS>
and <THRES>
are not required but can be used to enhance the counter options.
An example for an Intel event LD_BLOCKS_STORE_FORWARD
and LD_BLOCKS_NO_SR
:
EVENT_LD_BLOCKS 0x03 PMC
UMASK_LD_BLOCKS_STORE_FORWARD 0x02
UMASK_LD_BLOCKS_NO_SR 0x08
The <USABLE_COUNTERS>
is compared to the counter names and only the beginning has to match, so PMC
matches for PMC0
, PMC1
, ... It depends how you named the counters in perfmon_<name>_counters.h
's list of counters.
If an event provides an additional option that is not already specified in the OPTION_MASK
in the counters definition, you can extend the option mask for the event like this:
EVENT_OFFCORE_RESPONSE_1 0xBB PMC
OPTIONS_OFFCORE_RESPONSE_1_OPTIONS EVENT_OPTION_MATCH0_MASK
UMASK_OFFCORE_RESPONSE_1_OPTIONS 0x01
For the event OFFCORE_RESPONSE_1_OPTIONS
the user can use all options provided by the PMC
counter and additionally the EVENT_OPTION_MATCH0
option. The OPTIONS_
line needs to be ahead of the UMASK_
line.
Some events require some options having a specific value to properly record the execution events. You can set default values for all options provided by the counter (and event):
EVENT_MACHINE_CLEARS 0xC3 PMC
DEFAULT_OPTIONS_MACHINE_CLEARS_COUNT EVENT_OPTION_THRESHOLD=0x01,EVENT_OPTION_EDGE=1
UMASK_MACHINE_CLEARS_COUNT 0x01
UMASK_MACHINE_CLEARS_CYCLES 0x01
The <UMASK>
value for MACHINE_CLEARS_CYCLES
and MACHINE_CLEARS_COUNT
is the same, but in order to get the count, two default option values are required (EVENT_OPTION_THRESHOLD
and EVENT_OPTION_EDGE
). The accepted values are hexadecimal. This is similar to run:
$ likwid-perfctr -C 0 -g MACHINE_CLEARS_CYCLES:PMC0,MACHINE_CLEARS_CYCLES:PMC1:THRES=0x01:EDGEDETECT ...
If the event adds another option, the OPTIONS_
line must be before the DEFAULT_OPTIONS_
line.
For X86 the main header file can be quite large due to different units. It contains the code to programm the registers directly, starting them, stopping with overflow checks and reading of counter registers. The file requires at least 6 functions for initialization, setup, activation, deactivation, reading and finalizing the support.
#include <topology.h>
#include <access.h>
#include <error.h>
#include <affinity.h>
#include <perfmon_<name>_events.h>
#include <perfmon_<name>_counters.h>
static int perfmon_numCounters<UPPERCASE_NAME> = NUM_COUNTERS_<UPPERCASE_NAME>;
static int perfmon_numArchEvents<UPPERCASE_NAME> = NUM_ARCH_EVENTS_<UPPERCASE_NAME>;
int perfmon_init_<name>(int cpu_id)
{
// Acquire locks for the hardware thread cpu_id
// Determine which hardware thread is responsible for a CPU core
lock_acquire((int*) &tile_lock[affinity_thread2core_lookup[cpu_id]], cpu_id);
// Determine which hardware thread is responsible for a CPU socket
lock_acquire((int*) &socket_lock[affinity_thread2socket_lookup[cpu_id]], cpu_id);
// There are different locks available. Before using them in the setup, start, stop, read or
// finalize function, you have to acquire them here
// Do other initialization work like setting some registers to zero or set function pointers
// for later use.
}
int perfmon_setupCounterThread_<name>(int thread_id, PerfmonEventSet* eventSet)
{
// thread_id is the offset in the CPUset, so get the cpu_id
int cpu_id = groupSet->threads[thread_id].processorId;
// Check whether cpu_id is responsible for the socket
int haveLock = 0;
if (socket_lock[affinity_thread2socket_lookup[cpu_id]] == cpu_id)
{
haveLock = 1;
}
for (int i=0;i < eventSet->numberOfEvents;i++)
{
// skip non-existing counters (device not available, ...)
RegisterType type = eventSet->events[i].type;
if (!TESTTYPE(eventSet, type))
{
continue;
}
// Index in counter_map (in perfmon_<name>_counters.h)
RegisterIndex index = eventSet->events[i].index;
// Event configuration (struct generated from perfmon_<name>_events.txt)
PerfmonEvent *event = &(eventSet->events[i].event);
// Access device
PciDeviceIndex dev = counter_map[index].device;
// configure event for counter at device
// the current implementation uses a big switch-case here:
switch (type)
{
case PMC:
// configure event for PMC counter at MSR_DEV
break;
default:
break;
}
// Mark the successful setup
eventSet->events[i].threadCounter[thread_id].init = TRUE;
}
return 0;
}
int perfmon_startCountersThread_<name>(int thread_id, PerfmonEventSet* eventSet)
{
// get cpu_id and lock status as in setup function
for (int i=0;i < eventSet->numberOfEvents;i++)
{
if (eventSet->events[i].threadCounter[thread_id].init == TRUE)
{
// get type, index, event and dev as in setup function
eventSet->events[i].threadCounter[thread_id].startData = 0;
// start event for counter at device
// if you cannot start/stop a counter, read the current value and store it in
// eventSet->events[i].threadCounter[thread_id].startData
eventSet->events[i].threadCounter[thread_id].counterData = eventSet->events[i].threadCounter[thread_id].startData;
}
}
// commonly here you do something to start the units
return 0;
}
int perfmon_stopCountersThread_<name>(int thread_id, PerfmonEventSet* eventSet)
{
// get cpu_id and lock status as in setup function
// commonly here you do something to stops the units and consequently also all counters
for (int i=0;i < eventSet->numberOfEvents;i++)
{
if (eventSet->events[i].threadCounter[thread_id].init == TRUE)
{
// get type, index, event and dev as in setup function
uint64_t raw_value = 0;
// read the counter at dev into raw_value
// store the value truncated to the counter width defined in perfmon_<name>_counter.h
eventSet->events[i].threadCounter[thread_id].counterData field64(raw_value, 0, box_map[type].regWidth);
}
}
return 0;
}
int perfmon_readCountersThread_<name>(int thread_id, PerfmonEventSet* eventSet)
{
// The read function is sometimes only a combination of the start and stop function but
// might be used to provide low overhead reads without pausing the measurements
// The main difference is that the start/stop function commonly resets the counter register to
// zero and with read, we want to keep it counting.
// get cpu_id and lock status as in setup function
// Maybe stop the counters and save current settings
for (int i=0;i < eventSet->numberOfEvents;i++)
{
if (eventSet->events[i].threadCounter[thread_id].init == TRUE)
{
// get type, index, event and dev as in setup function
uint64_t raw_value = 0;
// read the counter at dev into raw_value
// store the value truncated to the counter width defined in perfmon_<name>_counter.h
eventSet->events[i].threadCounter[thread_id].counterData field64(raw_value, 0, box_map[type].regWidth);
}
}
// Maybe restart counters with previous settings
return 0;
}
int perfmon_finalizeCountersThread_<name>(int thread_id, PerfmonEventSet* eventSet)
{
// get cpu_id and lock status as in setup function
for (int i=0;i < eventSet->numberOfEvents;i++)
{
// get type, index, event and dev as in setup function
// reset config and counter register(s) to zero, revert all done changes.
}
return;
}
The performance module is defined in src/perfmon.c
.
At first, add the main header: #include <perfmon_<name>.h>
. The next step is comparable to the topology_setName()
function. The name of the function is perfmon_init_maps()
and also contains a set of nested switch-case statements. Search for P6
or ZEN
as the function is quite long. Here we register for a CPU family and model the lists/tables we have defined before in perfmon_<name>_counters.h
and perfmon_<name>_counters.h
.
switch ( cpuid_info.family )
{
[...]
case P6_FAMILY:
switch ( cpuid_info.model)
{
case BROADWELL:
eventHash = broadwell_arch_events; // <name>_arch_events generated at compilation
perfmon_numArchEvents = perfmon_numArchEventsBroadwell; // defined by you in perfmon_<name>.h
perfmon_numCounters = perfmon_numCountersBroadwell; // defined by you in perfmon_<name>.h
counter_map = broadwell_counter_map; // <name>_counter_map defined in perfmon_<name>_counters.h
box_map = broadwell_box_map; // <name>_box_map defined in perfmon_<name>_counters.h
translate_types = broadwell_translate_types; // <name>_translate_types defined in perfmon_<name>_counters.h
break;
[...]
}
[...]
}
Moreover, we need to register the functions from the main header to be called for the architecture. The function perfmon_init_funcs()
is similar to the perfmon_init_maps()
with a big switch-case statement.
switch ( cpuid_info.family )
{
[...]
case P6_FAMILY:
switch ( cpuid_info.model)
{
// the functions work for different models. The lists in perfmon_init_maps() differ
// but the logic is the same
case BROADWELL:
case BROADWELL_E:
case BROADWELL_D:
case BROADWELL_E3:
// call power_init() for that architecture to initialize the energy module
initialize_power = TRUE;
// call thermal_init() for that architecture to initialize the thermal module
initialize_thermal = TRUE;
// register the six function from src/include/perfmon_<name>.h
initThreadArch = perfmon_init_broadwell;
perfmon_startCountersThread = perfmon_startCountersThread_broadwell;
perfmon_stopCountersThread = perfmon_stopCountersThread_broadwell;
perfmon_readCountersThread = perfmon_readCountersThread_broadwell;
perfmon_setupCountersThread = perfmon_setupCounterThread_broadwell;
perfmon_finalizeCountersThread = perfmon_finalizeCountersThread_broadwell;
break;
}
break;
}
The above steps should enable the performance monitoring for perf_event and the direct access mode.
The access daemon requires some more work because it is a different code to reduce the code base. The
access daemon is in src/access-daemon/accessDaemon.c
. Similar to the previous steps, you have to
register the architecture and specify a function to do access checks for the commands received from
the library.
Similar to the previous registrations, the setting is done based on the cpu family and model. Search
for P6
to find the switch-case block.
switch (family)
{
case P6_FAMILY:
allowed = allowed_intel;
if (model == BROADWELL)
{
allowed = allowed_broadwell;
}
break;
}
There are some flags that can be set as well like:
-
isPCIUncore
: The access daemon tries to find and load the PCI devices. You should also set theallowed_pci(PciDeviceType type, uint32_t reg)
function pointer to an appropriate access checker function. -
isClientMem
: The access daemon tries to load the desktop class memory controllers of Intel CPUs
For isPCIUncore
, you need to specify which PCI devices should be used. There is a bigger if-elseif-else block and you can reuse the PCI devices list in perfmon_<name>_counters.h
for that: pci_devices_daemon = broadwelld_pci_devices
. Don't forget to include the counters header in the access daemon.
In the access checker functions, you can use the register names from src/includes/registers.h
.
There are some other modules in LIKWID which provide functionality for x86 systems. Previously named were the power and thermal module which are also used for performance monitoring but also other modules that are not related to that like prefetcher or CPU frequency manipulations.
Intel (since SandyBridge) and AMD (since Zen) support the so-called RAPL interface which provides energy consumption measurements. The vendors use the measurements internally for management reasons like hardware-guided power constraints etc. If the system is a successor of previous Intel or AMD systems, check src/power.c
for switch-case blocks and add your cpu family/model combination there.
The thermal module is currently provided only by Intel architectures although the feature should be available for AMD as well. You don't have to change anything in src/thermal.c
, just set the initialize_power
flag in src/perfmon.c:perfmon_init_funcs()
.
As a final step, add your chip to the print_supportedCPUs()
function in src/topology.c
Add the new chip to README.md
in section https://github.com/RRZE-HPC/likwid/blob/master/README.md#supported-architectures. Add a line in CHANGELOG
like Support for VENDOR MODEL (LIST_OF_SUPPORTED_UNITS)
.
The LIKWID wiki contains one page per supported architecture with tables of available counters, restrictions and further information. Unfortunately, I had to use HTML tables instead of Markdown tables. Copy one already existing ARM architecture file to get the structure and add all information. If you also used the HTML table syntax, you can copy the tables into the Doxygen documentation in `doc/arch/.md. Use the basic layout as the other architecture files.
-
Applications
-
Config files
-
Daemons
-
Architectures
- Available counter options
- AMD
- Intel
- Intel Atom
- Intel Pentium M
- Intel Core2
- Intel Nehalem
- Intel NehalemEX
- Intel Westmere
- Intel WestmereEX
- Intel Xeon Phi (KNC)
- Intel Silvermont & Airmont
- Intel Goldmont
- Intel SandyBridge
- Intel SandyBridge EP/EN
- Intel IvyBridge
- Intel IvyBridge EP/EN/EX
- Intel Haswell
- Intel Haswell EP/EN/EX
- Intel Broadwell
- Intel Broadwell D
- Intel Broadwell EP
- Intel Skylake
- Intel Coffeelake
- Intel Kabylake
- Intel Xeon Phi (KNL)
- Intel Skylake X
- Intel Cascadelake SP/AP
- Intel Tigerlake
- Intel Icelake
- Intel Icelake X
- Intel SappireRapids
- ARM
- POWER
-
Tutorials
-
Miscellaneous
-
Contributing