下表列出了不同 GPU 产品上支持的功能。
Feature Group | Tesla | Titan | Quadro | GeForce |
---|---|---|---|---|
Field Value Watches (GPU metrics) | X | X | X | X |
Configuration Management | X | X | X | X |
Active Health Checks (GPU subsystems) | X | X | X | X |
Job Statistics | X | X | X | X |
Topology | X | X | X | X |
Introspection | X | X | X | X |
Policy Notification | X | |||
GPU Diagnostics (Diagnostic Levels - 1, 2, 3) | All Levels | Level 1 | Level 1 | Level 1 |
> dcgmi -h
Usage: dcgmi
dcgmi subsystem
dcgmi -v
Flags:
-v vv Get DCGMI version information
subsystem The desired subsystem to be accessed.
Subsystems Available:
topo GPU Topology [dcgmi topo -h for more info] (GPU拓扑)
stats Process Statistics [dcgmi stats -h for more info] (过程统计)
diag System Validation/Diagnostic [dcgmi diag –h for more info] (系统验证/诊断)
policy Policy Management [dcgmi policy –h for more info] (策略管理)
health Health Monitoring [dcgmi health –h for more info] (健康监测)
config Configuration Management [dcgmi config –h for more info] (配置管理)
group GPU Group Management [dcgmi group –h for more info] (GPU组管理)
fieldgroup Field Group Management [dcgmi fieldgroup –h for more info] (Field组管理)
discovery Discover GPUs on the system [dcgmi discovery –h for more info] (发现系统上的 GPU)
introspect Gather info about DCGM itself [dcgmi introspect –h for more info] (收集有关 DCGM 本身的信息)
nvlink Displays NvLink link statuses and error counts [dcgmi nvlink –h for more info] (显示 NvLink 链接状态和错误计数)
dmon Stats Monitoring of GPUs [dcgmi dmon –h for more info] (GPU 统计监控)
modules Control and list DCGM modules (控制和列出 DCGM 模块)
profile Control and list DCGM profiling metrics (控制和列出 DCGM 分析指标)
set Configure hostengine settings (配置hostengine设置)
-- ignore_rest Ignores the rest of the labeled arguments following this
flag.
--version Displays version information and exits.
-h --help Displays usage information and exits.
Please email [email protected] with any questions, bug reports, etc.
NVIDIA Datacenter GPU Management Interface
> dcgmi -v
Version : 3.1.8
Build ID : 8
Build Date : 2023-04-27
Build Type : Release
Commit ID : c36ef145e4e71092fc106a80566758b1faf6115e
Branch Name : rel_dcgm_3_1
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : fd25da813d4871545eba7aaaf653fcba
Hostengine build info:
Version : 3.1.8
Build ID : 8
Build Date : 2023-04-27
Build Type : Release
Commit ID : c36ef145e4e71092fc106a80566758b1faf6115e
Branch Name : rel_dcgm_3_1
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : fd25da813d4871545eba7aaaf653fcba
> dcgmi topo -h
topo -- Used to find the topology of GPUs on the system.
Usage: dcgmi topo
dcgmi topo --host <IP/FQDN> -g <groupId> -j
dcgmi topo --host <IP/FQDN> --gpuid <gpuId> -j
Flags:
-g --group groupId The group ID to query.
--host IP/FQDN Connects to specified IP or fully-qualified domain
name. To connect to a host engine that was
started with -d (unix socket), prefix the unix
socket filename with 'unix://'. [default =
localhost]
-h --help Displays usage information and exits.
--gpuid gpuId The GPU ID to query.
-j --json Print the output in a json format
-- --ignore_rest Ignores the rest of the labeled arguments
following this flag.
NVIDIA Datacenter GPU Management Interface
> dcgmi topo -g 0 -j
{
"body" :
{
"CPU Core Affinity" :
{
"value" : "0 - 127"
},
"NUMA Optimal" :
{
"value" : "False"
},
"Worst Path" :
{
"value" : "Connected via a CPU-level link"
}
},
"header" :
[
"Topology Information",
"DCGM_ALL_SUPPORTED_GPUS"
]
}
> dcgmi topo --gpuid 1 -j
{
"body" :
{
"CPU Core Affinity" :
{
"value" : "0 - 31, 64 - 95"
},
"To GPU 0" :
{
"overflow" :
[
"Connected via eight NVLINKs (Links: 0, 1, 2, 3, 4, 5, 6, 7)"
],
"value" : "Connected via a CPU-level link"
},
"To GPU 2" :
{
"overflow" :
[
"Connected via eight NVLINKs (Links: 0, 1, 2, 3, 4, 5, 6, 7)"
],
"value" : "Connected via a CPU-level link"
},
"To GPU 3" :
{
"overflow" :
[
"Connected via eight NVLINKs (Links: 0, 1, 2, 3, 4, 5, 6, 7)"
],
"value" : "Connected via a CPU-level link"
},
"To GPU 4" :
{
"overflow" :
[
"Connected via eight NVLINKs (Links: 0, 1, 2, 3, 4, 5, 6, 7)"
],
"value" : "Connected via a CPU-level link"
},
"To GPU 5" :
{
"overflow" :
[
"Connected via eight NVLINKs (Links: 0, 1, 2, 3, 4, 5, 6, 7)"
],
"value" : "Connected via a CPU-level link"
},
"To GPU 6" :
{
"overflow" :
[
"Connected via eight NVLINKs (Links: 0, 1, 2, 3, 4, 5, 6, 7)"
],
"value" : "Connected via a CPU-level link"
},
"To GPU 7" :
{
"overflow" :
[
"Connected via eight NVLINKs (Links: 0, 1, 2, 3, 4, 5, 6, 7)"
],
"value" : "Connected via a CPU-level link"
}
},
"header" :
[
"Topology Information",
"GPU ID: 1"
]
}
dcgmi topo --gpuid 7 -j
{
"body" :
{
"CPU Core Affinity" :
{
"value" : "32 - 63, 96 - 127"
},
"To GPU 0" :
{
"overflow" :
[
"Connected via eight NVLINKs (Links: 0, 1, 2, 3, 4, 5, 6, 7)"
],
"value" : "Connected via a CPU-level link"
},
"To GPU 1" :
{
"overflow" :
[
"Connected via eight NVLINKs (Links: 0, 1, 2, 3, 4, 5, 6, 7)"
],
"value" : "Connected via a CPU-level link"
},
"To GPU 2" :
{
"overflow" :
[
"Connected via eight NVLINKs (Links: 0, 1, 2, 3, 4, 5, 6, 7)"
],
"value" : "Connected via a CPU-level link"
},
"To GPU 3" :
{
"overflow" :
[
"Connected via eight NVLINKs (Links: 0, 1, 2, 3, 4, 5, 6, 7)"
],
"value" : "Connected via a CPU-level link"
},
"To GPU 4" :
{
"overflow" :
[
"Connected via eight NVLINKs (Links: 0, 1, 2, 3, 4, 5, 6, 7)"
],
"value" : "Connected via a CPU-level link"
},
"To GPU 5" :
{
"overflow" :
[
"Connected via eight NVLINKs (Links: 0, 1, 2, 3, 4, 5, 6, 7)"
],
"value" : "Connected via a CPU-level link"
},
"To GPU 6" :
{
"overflow" :
[
"Connected via eight NVLINKs (Links: 0, 1, 2, 3, 4, 5, 6, 7)"
],
"value" : "Connected via a CPU-level link"
}
},
"header" :
[
"Topology Information",
"GPU ID: 7"
]
}
> dcgmi stats -h
stats -- Used to view process statistics.
Usage: dcgmi stats
dcgmi stats --host <IP/FQDN> -g <groupId> -e -u <> -m <>
dcgmi stats --host <IP/FQDN> -g <groupId> -d
dcgmi stats --host <IP/FQDN> -g <groupId> -p <pid> -v
dcgmi stats --host <IP/FQDN> -g <groupId> -s <job id>
dcgmi stats --host <IP/FQDN> -x <job id>
dcgmi stats --host <IP/FQDN> -j <job id> -v --host <IP/FQDN> -r <job id>
-v
dcgmi stats --host <IP/FQDN> -a -v
Flags:
-g --group groupId The GPU group to query on the specified host.
--host IP/FQDN Connects to specified IP or fully-qualified domain
name. To connect to a host engine that was
started with -d (unix socket), prefix the unix
socket filename with 'unix://'. [default =
localhost]
-p --pid pid View statistics for the specified pid.
-e --enable Enable system watches and start recording
information.
-d --disable Disable system watches and stop recording
information.
-s --jstart job id Start recording job statistics.
-x --jstop job id Stop recording job statistics.
-j --job job id Display job statistics.
-r --jremove job id Remove job statistics.
-a --jremoveall Remove all job statistics.
-h --help Displays usage information and exits.
-v --verbose Show process information for each GPU.
-u --update-interval How often to update the underlying job stats
in ms.
-m --max-keep-age How long to retain job stats data for in seconds.
This should be longer than your job/process
duration.
-- --ignore_rest Ignores the rest of the labeled arguments
following this flag.
Process Statistics Information:
-- Execution Stats --
Start Time (*) - Process start time
End Time (*) - Process end time
Total Execution Time (*) - Total execution time in seconds
No. Conflicting Processes (*) - Number of other processes that ran
Conflicting Compute PID - PID of conflicting compute process
Conflicting Graphics PID - PID of conflicting graphics process
-- Performance Stats --
Energy Consumed - Total energy consumed during process in joules
Max GPU Memory Used (*) - Maximum amount of GPU memory used in bytes
SM Clock - Statistics for SM clocks(s) in MHz
Memory Clock - Statistics for memory clock(s) in MHz
SM Utilization - Utilization of the GPU's SMs in percent
Memory Utilization - Utilization of the GPU's memory in percent
PCIe Rx Bandwidth - PCIe bytes read from the GPU
PCIe Tx Bandwidth - PCIe bytes written to the GPU
-- Event Stats --
Single Bit ECC Errors - Number of ECC single bit errors that occurred
Double Bit ECC Errors - Number of ECC double bit errors that occurred
PCIe Replay Warnings - Number of PCIe replay warnings
Critical XID Errors - Number of critical XID Errors
XID - Time of XID error in since start of process
-- Slowdown Stats --
Power - Runtime % at reduced clocks due to power violation
Thermal - Runtime % at reduced clocks due to thermal limit
Reliability - Runtime % at reduced clocks due to reliability limit
Board Limit - Runtime % at reduced clocks due to board's voltage limit
Low Utilization - Runtime % at reduced clocks due to low utilization
Sync Boost - Runtime % at reduced clocks due to sync boost
(*) Represents a process statistic. Otherwise device statistic during
process lifetime listed.
NVIDIA Datacenter GPU Management Interface
验证是否能够找到 GPU 设备
> dcgmi discovery -l
8 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information |
+--------+----------------------------------------------------------------------+
| 0 | Name: NVIDIA H800 |
| | PCI Bus ID: 00000000:18:00.0 |
| | Device UUID: GPU-34bf77d1-c686-6821-79a8-32d326c5039c |
+--------+----------------------------------------------------------------------+
| 1 | Name: NVIDIA H800 |
| | PCI Bus ID: 00000000:3E:00.0 |
| | Device UUID: GPU-f5046fa5-3db4-45e8-870a-dc1376becaa5 |
+--------+----------------------------------------------------------------------+
| 2 | Name: NVIDIA H800 |
| | PCI Bus ID: 00000000:51:00.0 |
| | Device UUID: GPU-9de407ad-ba9c-af12-ce09-65828829a67c |
+--------+----------------------------------------------------------------------+
| 3 | Name: NVIDIA H800 |
| | PCI Bus ID: 00000000:65:00.0 |
| | Device UUID: GPU-b54d703a-dee5-a9da-aeb9-465003acdd4b |
+--------+----------------------------------------------------------------------+
| 4 | Name: NVIDIA H800 |
| | PCI Bus ID: 00000000:98:00.0 |
| | Device UUID: GPU-09c6e33a-ffcf-b330-e68b-e1e9f745eae6 |
+--------+----------------------------------------------------------------------+
| 5 | Name: NVIDIA H800 |
| | PCI Bus ID: 00000000:BD:00.0 |
| | Device UUID: GPU-9a8ef0b8-9816-459d-fa13-cda74cf19d37 |
+--------+----------------------------------------------------------------------+
| 6 | Name: NVIDIA H800 |
| | PCI Bus ID: 00000000:CF:00.0 |
| | Device UUID: GPU-70c5b9a8-82a3-4199-d7f5-adb9186459eb |
+--------+----------------------------------------------------------------------+
| 7 | Name: NVIDIA H800 |
| | PCI Bus ID: 00000000:E2:00.0 |
| | Device UUID: GPU-474d838c-171f-d249-4f45-bbc01a8eb74a |
+--------+----------------------------------------------------------------------+
0 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
+-----------+
> dcgmi group -h
group -- Used to create and maintain groups of GPUs. Groups of GPUs can then be
uniformly controlled through other DCGMI subsystems.
Usage: dcgmi group
dcgmi group --host <IP/FQDN> -l -j
dcgmi group --host <IP/FQDN> -c <groupName> --default
--defaultnvswitches
dcgmi group --host <IP/FQDN> -c <groupName> -a <entityId>
dcgmi group --host <IP/FQDN> -d <groupId>
dcgmi group --host <IP/FQDN> -g <groupId> -i -j
dcgmi group --host <IP/FQDN> -g <groupId> -a <entityId>
dcgmi group --host <IP/FQDN> -g <groupId> -r <entityId>
Flags:
--host IP/FQDN Connects to specified IP or fully-qualified domain
name. To connect to a host engine that was
started with -d (unix socket), prefix the unix
socket filename with 'unix://'. [default =
localhost]
-l --list List the groups that currently exist for a host.
-d --delete groupId Delete a group on the remote host.
-c --create groupName Create a group on the remote host.
-h --help Displays usage information and exits.
-i --info Display the information for the specified group
ID.
-r --remove entityId Remove device(s) from group. (csv gpuIds, or
entityIds like gpu:0,nvswitch:994)
-a --add entityId Add device(s) to group. (csv gpuIds or entityIds
simlar to gpu:0, instance:1, compute_instance:2,
nvswitch:994)
--default Adds all available GPUs to the group being
created.
--defaultnvswitches Adds all available NvSwitches to the group
being created.
-j --json Print the output in a json format
-- --ignore_rest Ignores the rest of the labeled arguments
following this flag.
NVIDIA Datacenter GPU Management Interface
dcgmi group -l
+-------------------+----------------------------------------------------------+
| GROUPS |
| 2 groups found. |
+===================+==========================================================+
| Groups | |
| -> 0 | |
| -> Group ID | 0 |
| -> Group Name | DCGM_ALL_SUPPORTED_GPUS |
| -> Entities | GPU 0, GPU 1, GPU 2, GPU 3, GPU 4, GPU 5, GPU 6, GPU 7 |
| -> 1 | |
| -> Group ID | 1 |
| -> Group Name | DCGM_ALL_SUPPORTED_NVSWITCHES |
| -> Entities | None |
+-------------------+----------------------------------------------------------+
> dcgmi group -c GPU_Group_Demo
Successfully created group "GPU_Group_Demo" with a group ID of 18
> dcgmi group -g 18 -a 0,7
Add to group operation successful.
> dcgmi group -g 0 -i
+-------------------+----------------------------------------------------------+
| GROUP INFO |
+===================+==========================================================+
| 0 | |
| -> Group ID | 0 |
| -> Group Name | DCGM_ALL_SUPPORTED_GPUS |
| -> Entities | GPU 0, GPU 1, GPU 2, GPU 3, GPU 4, GPU 5, GPU 6, GPU 7 |
+-------------------+----------------------------------------------------------+
> dcgmi group -g 18 -i
+-------------------+----------------------------------------------------------+
| GROUP INFO |
+===================+==========================================================+
| 18 | |
| -> Group ID | 18 |
| -> Group Name | GPU_Group_Demo |
| -> Entities | GPU 0, GPU 7 |
+-------------------+----------------------------------------------------------+
用于监控 GPU 及其统计数据
> dcgmi dmon --help
dmon -- Used to monitor GPUs and their stats.
Usage: dcgmi dmon
dcgmi dmon -i <gpuId> -g <groupId> -f <fieldGroupId> -e <fieldId> -d
<delay> -c <count> -l
Flags:
--host IP/FQDN Connects to specified IP or fully-qualified domain
name. To connect to a host engine that was
started with -d (unix socket), prefix the unix
socket filename with 'unix://'. [default =
localhost]
-f --field-group-idfieldGroupId The field group to query on the specified
host.
-e --field-id fieldId Field identifier to view/inject.(要查看的字段ID)
-l --list List to look up the long names, short names and
field ids.(用于查找长名称、短名称和字段 ID 的列表。)
-h --help Displays usage information and exits.
-i --gpu-id gpuId The comma separated list of GPU/GPU-I/GPU-CI IDs
to run the dmon on. Default is -1 which runs for
all supported GPU. Run dcgmi discovery -c to
check list of available GPU entities (用于运行守护程序的 GPU/GPU-I/GPU-CI ID 的逗号分隔列表。 默认值为 -1,适用于所有支持的 GPU。 运行 dcgmi discovery -c 以检查可用 GPU 实体列表)
-g --group-id groupId The group to query on the specified host.
-d --delay delay In milliseconds. Integer representing how often
to query results from DCGM and print them for all
of the entities. [default = 1000 msec, Minimum
value = 1 msec.](以毫秒为单位。 表示从 DCGM 查询结果并为所有实体打印结果的频率。 [默认 = 1000 毫秒,最小值 = 1 毫秒)
-c --count count Integer representing How many times to loop
before exiting. [default- runs forever.](表示退出前循环次数。[默认值-永远运行])
-- --ignore_rest Ignores the rest of the labeled arguments
following this flag.
NVIDIA Datacenter GPU Management Interface
> dcgmi dmon -l
___________________________________________________________________________________________________________
Long Name Short Name Field ID
___________________________________________________________________________________________________________
driver_version DRVER 1
nvml_version NVVER 2
process_name PRNAM 3
device_count DVCNT 4
cuda_driver_version CDVER 5
name DVNAM 50
brand DVBRN 51
nvml_index NVIDX 52
serial_number SRNUM 53
uuid UUID# 54
minor_number MNNUM 55
oem_inforom_version OEMVR 56
pci_busid PCBID 57
pci_combined_id PCCID 58
pci_subsys_id PCSID 59
system_topology_pci STVCI 60
system_topology_nvlink STNVL 61
system_affinity SYSAF 62
cuda_compute_capability DVCCC 63
compute_mode CMMOD 65
persistance_mode PMMOD 66
mig_mode MGMOD 67
cuda_visible_devices CUVID 68
mig_max_slices MIGMS 69
cpu_affinity_0 CAFF0 70
cpu_affinity_1 CAFF1 71
cpu_affinity_2 CAFF2 72
cpu_affinity_3 CAFF3 73
cc_mode CCMOD 74
mig_attributes MIGATT 75
mig_gi_info MIGGIINFO 76
mig_ci_info MIGCIINFO 77
ecc_inforom_version EIVER 80
power_inforom_version PIVER 81
inforom_image_version IIVER 82
inforom_config_checksum CCSUM 83
inforom_config_valid ICVLD 84
vbios_version VBVER 85
bar1_total B1TTL 90
sync_boost SYBST 91
bar1_used B1USE 92
bar1_free B1FRE 93
sm_clock SMCLK 100
memory_clock MMCLK 101
video_clock VICLK 102
sm_app_clock SACLK 110
mem_app_clock MACLK 111
current_clock_throttle_reasons DVCCTR 112
sm_max_clock SMMAX 113
memory_max_clock MMMAX 114
video_max_clock VIMAX 115
autoboost ATBST 120
supported_clocks SPCLK 130
memory_temp MMTMP 140
gpu_temp TMPTR 150
gpu_mem_max_op_temp GMMOT 151
gpu_max_op_temp GGMOT 152
power_usage POWER 155
total_energy_consumption TOTEC 156
slowdown_temp SDTMP 158
shutdown_temp SHTMP 159
power_management_limit PMLMT 160
power_management_limit_min PMMIN 161
power_management_limit_max PMMAX 162
power_management_limit_default PMDEF 163
enforced_power_limit EPLMT 164
pstate PSTAT 190
fan_speed FANSP 191
pcie_tx_throughput TXTPT 200
pcie_rx_throughput RXTPT 201
pcie_replay_counter RPCTR 202
gpu_utilization GPUTL 203
mem_copy_utilization MCUTL 204
accounting_data ACCDT 205
enc_utilization ECUTL 206
dec_utilization DCUTL 207
mem_util_samples MUSAM 210
gpu_util_samples GUSAM 211
graphics_pids GPIDS 220
compute_pids CMPID 221
xid_errors XIDER 230
pcie_max_link_gen PCIMG 235
pcie_max_link_width PCIMW 236
pcie_link_gen PCILG 237
pcie_link_width PCILW 238
power_violation PVIOL 240
thermal_violation TVIOL 241
sync_boost_violation SBVIO 242
board_limit_violation BLVIO 243
low_util_violation LUVIO 244
reliability_violation RVIOL 245
app_clock_violation TAPCV 246
base_clock_violation TAPBC 247
fb_total FBTTL 250
fb_free FBFRE 251
fb_used FBUSD 252
fb_resv FBRSV 253
fb_USDP FBUSP 254
ecc ECCUR 300
ecc_pending ECPEN 301
ecc_sbe_volatile_total ESVTL 310
ecc_dbe_volatile_total EDVTL 311
ecc_sbe_aggregate_total ESATL 312
ecc_dbe_aggregate_total EDATL 313
ecc_sbe_volatile_l1 ESVL1 314
ecc_dbe_volatile_l1 EDVL1 315
ecc_sbe_volatile_l2 ESVL2 316
ecc_dbe_volatile_l2 EDVL2 317
ecc_sbe_volatile_device ESVDV 318
ecc_dbe_volatile_device EDVDV 319
ecc_sbe_volatile_register ESVRG 320
ecc_dbe_volatile_register EDVRG 321
ecc_sbe_volatile_texture ESVTX 322
ecc_dbe_volatile_texture EDVTX 323
ecc_sbe_aggregate_l1 ESAL1 324
ecc_dbe_aggregate_l1 EDAL1 325
ecc_sbe_aggregate_l2 ESAL2 326
ecc_dbe_aggregate_l2 EDAL2 327
ecc_sbe_aggregate_device ESADV 328
ecc_dbe_aggregate_device EDADV 329
ecc_sbe_aggregate_register ESARG 330
ecc_dbe_aggregate_register EDARG 331
ecc_sbe_aggregate_texture ESATX 332
ecc_dbe_aggregate_texture EDATX 333
retired_pages_sbe RPSBE 390
retired_pages_dbe RPDBE 391
retired_pages_pending RPPEN 392
uncorrectable_remapped_rows URMPS 393
correctable_remapped_rows CRMPS 394
row_remap_failure RRF 395
row_remap_pending RRP 396
nvlink_flit_crc_error_count_l0 NFEL0 400
nvlink_flit_crc_error_count_l1 NFEL1 401
nvlink_flit_crc_error_count_l2 NFEL2 402
nvlink_flit_crc_error_count_l3 NFEL3 403
nvlink_flit_crc_error_count_l4 NFEL4 404
nvlink_flit_crc_error_count_l5 NFEL5 405
nvlink_flit_crc_error_count_l12 NFEL12 406
nvlink_flit_crc_error_count_l13 NFEL13 407
nvlink_flit_crc_error_count_l14 NFEL14 408
nvlink_flit_crc_error_count_total NFELT 409
nvlink_data_crc_error_count_l0 NDEL0 410
nvlink_data_crc_error_count_l1 NDEL1 411
nvlink_data_crc_error_count_l2 NDEL2 412
nvlink_data_crc_error_count_l3 NDEL3 413
nvlink_data_crc_error_count_l4 NDEL4 414
nvlink_data_crc_error_count_l5 NDEL5 415
nvlink_data_crc_error_count_l12 NDEL12 416
nvlink_data_crc_error_count_l13 NDEL13 417
nvlink_data_crc_error_count_l14 NDEL14 418
nvlink_data_crc_error_count_total NDELT 419
nvlink_replay_error_count_l0 NREL0 420
nvlink_replay_error_count_l1 NREL1 421
nvlink_replay_error_count_l2 NREL2 422
nvlink_replay_error_count_l3 NREL3 423
nvlink_replay_error_count_l4 NREL4 424
nvlink_replay_error_count_l5 NREL5 425
nvlink_replay_error_count_l12 NREL12 426
nvlink_replay_error_count_l13 NREL13 427
nvlink_replay_error_count_l14 NREL14 428
nvlink_replay_error_count_total NRELT 429
nvlink_recovery_error_count_l0 NRCL0 430
nvlink_recovery_error_count_l1 NRCL1 431
nvlink_recovery_error_count_l2 NRCL2 432
nvlink_recovery_error_count_l3 NRCL3 433
nvlink_recovery_error_count_l4 NRCL4 434
nvlink_recovery_error_count_l5 NRCL5 435
nvlink_recovery_error_count_l12 NRCL12 436
nvlink_recovery_error_count_l13 NRCL13 437
nvlink_recovery_error_count_l14 NRCL14 438
nvlink_recovery_error_count_total NRCLT 439
nvlink_bandwidth_l0 NBWL0 440
nvlink_bandwidth_l1 NBWL1 441
nvlink_bandwidth_l2 NBWL2 442
nvlink_bandwidth_l3 NBWL3 443
nvlink_bandwidth_l4 NBWL4 444
nvlink_bandwidth_l5 NBWL5 445
nvlink_bandwidth_l12 NBWL12 446
nvlink_bandwidth_l13 NBWL13 447
nvlink_bandwidth_l14 NBWL14 448
nvlink_bandwidth_total NBWLT 449
gpu_nvlink_errors GNVERR 450
nvlink_flit_crc_error_count_l6 NFEL6 451
nvlink_flit_crc_error_count_l7 NFEL7 452
nvlink_flit_crc_error_count_l8 NFEL8 453
nvlink_flit_crc_error_count_l9 NFEL9 454
nvlink_flit_crc_error_count_l10 NFEL10 455
nvlink_flit_crc_error_count_l11 NFEL11 456
nvlink_data_crc_error_count_l6 NDEL6 457
nvlink_data_crc_error_count_l7 NDEL7 458
nvlink_data_crc_error_count_l8 NDEL8 459
nvlink_data_crc_error_count_l9 NDEL9 460
nvlink_data_crc_error_count_l10 NDEL10 461
nvlink_data_crc_error_count_l11 NDEL11 462
nvlink_replay_error_count_l6 NREL6 463
nvlink_replay_error_count_l7 NREL7 464
nvlink_replay_error_count_l8 NREL8 465
nvlink_replay_error_count_l9 NREL9 466
nvlink_replay_error_count_l10 NREL10 467
nvlink_replay_error_count_l11 NREL11 468
nvlink_recovery_error_count_l6 NRCL6 469
nvlink_recovery_error_count_l7 NRCL7 470
nvlink_recovery_error_count_l8 NRCL8 471
nvlink_recovery_error_count_l9 NRCL9 472
nvlink_recovery_error_count_l10 NRCL10 473
nvlink_recovery_error_count_l11 NRCL11 474
nvlink_bandwidth_l6 NBWL6 475
nvlink_bandwidth_l7 NBWL7 476
nvlink_bandwidth_l8 NBWL8 477
nvlink_bandwidth_l9 NBWL9 478
nvlink_bandwidth_l10 NBWL10 479
nvlink_bandwidth_l11 NBWL11 480
nvlink_flit_crc_error_count_l15 NFEL15 481
nvlink_flit_crc_error_count_l16 NFEL16 482
nvlink_flit_crc_error_count_l17 NFEL17 483
nvlink_data_crc_error_count_l15 NDEL15 484
nvlink_data_crc_error_count_l16 NDEL16 485
nvlink_data_crc_error_count_l17 NDEL17 486
nvlink_replay_error_count_l15 NREL15 487
nvlink_replay_error_count_l16 NREL16 488
nvlink_replay_error_count_l17 NREL17 489
nvlink_recovery_error_count_l15 NRCL15 491
nvlink_recovery_error_count_l16 NRCL16 492
nvlink_recovery_error_count_l17 NRCL17 493
nvlink_bandwidth_l15 NBWL15 494
nvlink_bandwidth_l16 NBWL16 495
nvlink_bandwidth_l17 NBWL17 496
virtualization_mode VMODE 500
supported_type_info SPINF 501
creatable_vgpu_type_ids CGPID 502
active_vgpu_instance_ids VGIID 503
vgpu_instance_utilizations VIUTL 504
vgpu_instance_per_process_utilization VIPPU 505
enc_stats ENSTA 506
fbc_stats FBCSTA 507
fbc_sessions_info FBCINF 508
vgpu_type_ids VTID 509
vgpu_type_info VTPINF 510
vgpu_type_name VTPNM 511
vgpu_type_class VTPCLS 512
vgpu_type_license VTPLC 513
vgpu_instance_vm_id VVMID 520
vgpu_instance_vm_name VMNAM 521
vgpu_instance_type VITYP 522
vgpu_instance_uuid VUUID 523
vgpu_instance_driver_version VDVER 524
vgpu_instance_memory_usage VMUSG 525
vgpu_instance_license_status VLCST 526
vgpu_instance_frame_rate_limit VFLIM 527
vgpu_instance_enc_stats VSTAT 528
vgpu_instance_enc_sessions_info VSINF 529
vgpu_instance_fbc_stats VFSTAT 530
vgpu_instance_fbc_sessions_info VFINF 531
vgpu_instance_license_state VLCIST 532
vgpu_instance_pci_id VPCIID 533
vgpu_instance_gpu_instance_id VGII 534
nvswitch_link_bandwidth_tx SWLNKTX 780
nvswitch_link_bandwidth_rx SWLNKRX 781
nvswitch_link_fatal_errors SWLNKFE 782
nvswitch_link_non_fatal_errors SWLNKNF 783
nvswitch_link_replay_errors SWLNKRP 784
nvswitch_link_recovery_errors SWLNKRC 785
nvswitch_link_flit_errors SWLNKFL 786
nvswitch_link_crc_errors SWLNKCR 787
nvswitch_link_ecc_errors SWLNKEC 788
nvswitch_link_latency_low_vc0 SWVCLL0 789
nvswitch_link_latency_low_vc1 SWVCLL1 790
nvswitch_link_latency_low_vc2 SWVCLL2 791
nvswitch_link_latency_low_vc SWVCLL3 792
nvswitch_link_latency_medium_vc0 SWVCLM0 793
nvswitch_link_latency_medium_vc1 SWVCLM1 794
nvswitch_link_latency_medium_vc2 SWVCLM2 795
nvswitch_link_latency_medium_vc3 SWVCLM3 796
nvswitch_link_latency_high_vc0 SWVCLH0 797
nvswitch_link_latency_high_vc1 SWVCLH1 798
nvswitch_link_latency_high_vc2 SWVCLH2 799
nvswitch_link_latency_high_vc3 SWVCLH3 800
nvswitch_link_latency_panic_vc0 SWVCLP0 801
nvswitch_link_latency_panic_vc1 SWVCLP1 802
nvswitch_link_latency_panic_vc2 SWVCLP2 803
nvswitch_link_latency_panic_vc3 SWVCLP3 804
nvswitch_link_latency_count_vc0 SWVCLC0 805
nvswitch_link_latency_count_vc1 SWVCLC1 806
nvswitch_link_latency_count_vc2 SWVCLC2 807
nvswitch_link_latency_count_vc3 SWVCLC3 808
nvswitch_link_crc_errors_lane0 SWLACR0 809
nvswitch_link_crc_errors_lane1 SWLACR1 810
nvswitch_link_crc_errors_lane2 SWLACR2 811
nvswitch_link_crc_errors_lane3 SWLACR3 812
nvswitch_link_ecc_errors_lane0 SWLAEC0 813
nvswitch_link_ecc_errors_lane1 SWLAEC1 814
nvswitch_link_ecc_errors_lane2 SWLAEC2 815
nvswitch_link_ecc_errors_lane3 SWLAEC3 816
nvswitch_fatal_error SEN00 856
nvswitch_non_fatal_error SEN01 857
nvswitch_current_temperature TMP01 858
nvswitch_slowdown_temperature TMP02 859
nvswitch_shutdown_temperature TMP03 860
nvswitch_bandwidth_tx SWTX 861
nvswitch_bandwidth_rx SWRX 862
nvswitch_physical_id SWPHID 863
nvswitch_reset_required SWFRMVER 864
nvlink_id LNKID 865
nvswitch_pcie_dom SWPCIEDOM 866
nvswitch_pcie_bus SWPCIEBUS 867
nvswitch_pcie_dev SWPCIEDEV 868
nvswitch_pcie_fun SWPCIEFUN 869
nvswitch_nvlink_status SWNVLNKST 870
nvswitch_nvlink_dev_type SWNVLNKDT 871
link_pcie_remote_dom LNKDOM 872
link_pcie_remote_bus LNKBUS 873
link_pcie_remote_dev LNKDEV 874
link_pcie_remote_func LNKFNC 875
link_dev_link_id SWNVLNKID 876
link_dev_link_sid SWNVLNSID 877
link_dev_link_uuid SWNVLNUID 878
gr_engine_active GRACT 1001
sm_active SMACT 1002
sm_occupancy SMOCC 1003
tensor_active TENSO 1004
dram_active DRAMA 1005
fp64_active FP64A 1006
fp32_active FP32A 1007
fp16_active FP16A 1008
pcie_tx_bytes PCITX 1009
pcie_rx_bytes PCIRX 1010
nvlink_tx_bytes NVLTX 1011
nvlink_rx_bytes NVLRX 1012
tensor_imma_active TIMMA 1013
tensor_hmma_active THMMA 1014
tensor_dfma_active TDFMA 1015
integer_active INTAC 1016
nvdec0_active NVDEC0 1017
nvdec1_active NVDEC1 1018
nvdec2_active NVDEC2 1019
nvdec3_active NVDEC3 1020
nvdec4_active NVDEC4 1021
nvdec5_active NVDEC5 1022
nvdec6_active NVDEC6 1023
nvdec7_active NVDEC7 1024
nvjpg0_active NVJPG0 1025
nvjpg1_active NVJPG1 1026
nvjpg2_active NVJPG2 1027
nvjpg3_active NVJPG3 1028
nvjpg4_active NVJPG4 1029
nvjpg5_active NVJPG5 1030
nvjpg6_active NVJPG6 1031
nvjpg7_active NVJPG7 1032
nvofa0_active NVOFA0 1033
nvlink_l0_tx_bytes NVL0T 1040
nvlink_l0_rx_bytes NVL0R 1041
nvlink_l1_tx_bytes NVL1T 1042
nvlink_l1_rx_bytes NVL1R 1043
nvlink_l2_tx_bytes NVL2T 1044
nvlink_l2_rx_bytes NVL2R 1045
nvlink_l3_tx_bytes NVL3T 1046
nvlink_l3_rx_bytes NVL3R 1047
nvlink_l4_tx_bytes NVL4T 1048
nvlink_l4_rx_bytes NVL4R 1049
nvlink_l5_tx_bytes NVL5T 1050
nvlink_l5_rx_bytes NVL5R 1051
nvlink_l6_tx_bytes NVL6T 1052
nvlink_l6_rx_bytes NVL6R 1053
nvlink_l7_tx_bytes NVL7T 1054
nvlink_l7_rx_bytes NVL7R 1055
nvlink_l8_tx_bytes NVL8T 1056
nvlink_l8_rx_bytes NVL8R 1057
nvlink_l9_tx_bytes NVL9T 1058
nvlink_l9_rx_bytes NVL9R 1059
nvlink_l10_tx_bytes NVL10T 1060
nvlink_l10_rx_bytes NVL10R 1061
nvlink_l11_tx_bytes NVL11T 1062
nvlink_l11_rx_bytes NVL11R 1063
nvlink_l12_tx_bytes NVL12T 1064
nvlink_l12_rx_bytes NVL12R 1065
nvlink_l13_tx_bytes NVL13T 1066
nvlink_l13_rx_bytes NVL13R 1067
nvlink_l14_tx_bytes NVL14T 1068
nvlink_l14_rx_bytes NVL14R 1069
nvlink_l15_tx_bytes NVL15T 1070
nvlink_l15_rx_bytes NVL15R 1071
nvlink_l16_tx_bytes NVL16T 1072
nvlink_l16_rx_bytes NVL16R 1073
nvlink_l17_tx_bytes NVL17T 1074
nvlink_l17_rx_bytes NVL17R 1075
dcgmi dmon -i 0 -e 1011,1012,1009,1010 -c 5
#Entity NVLTX NVLRX PCITX PCIRX
ID
GPU 0 N/A N/A N/A N/A
GPU 0 N/A N/A N/A N/A
GPU 0 0 0 498948 1555
GPU 0 0 0 449138 2074
GPU 0 0 0 548740 1555
用于获取系统中 GPU 和 NvSwitch 的 NvLink 链接状态或错误计数
> dcgmi nvlink --help
nvlink -- Used to get NvLink link status or error counts for GPUs and
NvSwitches in the system
NVLINK Error description
=========================
CRC FLIT Error => Data link receive flow control digit CRC error.
CRC Data Error => Data link receive data CRC error.
Replay Error => Data link transmit replay error.
Recovery Error => Data link transmit recovery error.
Usage: dcgmi nvlink
dcgmi nvlink --host <IP/FQDN> -g <gpuId> -e -j
dcgmi nvlink --host <IP/FQDN> -s
Flags:
--host IP/FQDN Connects to specified IP or fully-qualified domain
name. To connect to a host engine that was
started with -d (unix socket), prefix the unix
socket filename with 'unix://'. [default =
localhost]
-e --errors Print NvLink errors for a given gpuId (-g).
-s --link-status Print NvLink link status for all GPUs and
NvSwitches in the system.
-h --help Displays usage information and exits.
-g --gpuid gpuId The GPU ID to query. Required for -e
-j --json Print the output in a json format(json格式输出)
-- --ignore_rest Ignores the rest of the labeled arguments
following this flag.
NVIDIA Datacenter GPU Management Interface
json格式输出:
> dcgmi nvlink -g 0 -e -j
{
"body" :
{
"Link 0" :
{
"children" :
{
"CRC Data Error" :
{
"value" : "0"
},
"CRC FLIT Error" :
{
"value" : "0"
},
"Recovery Error" :
{
"value" : "0"
},
"Replay Error" :
{
"value" : "0"
}
}
},
"Link 1" :
{
"children" :
{
"CRC Data Error" :
{
"value" : "0"
},
"CRC FLIT Error" :
{
"value" : "0"
},
"Recovery Error" :
{
"value" : "0"
},
"Replay Error" :
{
"value" : "0"
}
}
},
"Link 2" :
{
"children" :
{
"CRC Data Error" :
{
"value" : "0"
},
"CRC FLIT Error" :
{
"value" : "0"
},
"Recovery Error" :
{
"value" : "0"
},
"Replay Error" :
{
"value" : "0"
}
}
},
"Link 3" :
{
"children" :
{
"CRC Data Error" :
{
"value" : "0"
},
"CRC FLIT Error" :
{
"value" : "0"
},
"Recovery Error" :
{
"value" : "0"
},
"Replay Error" :
{
"value" : "0"
}
}
},
"Link 4" :
{
"children" :
{
"CRC Data Error" :
{
"value" : "0"
},
"CRC FLIT Error" :
{
"value" : "0"
},
"Recovery Error" :
{
"value" : "0"
},
"Replay Error" :
{
"value" : "0"
}
}
},
"Link 5" :
{
"children" :
{
"CRC Data Error" :
{
"value" : "0"
},
"CRC FLIT Error" :
{
"value" : "0"
},
"Recovery Error" :
{
"value" : "0"
},
"Replay Error" :
{
"value" : "0"
}
}
},
"Link 6" :
{
"children" :
{
"CRC Data Error" :
{
"value" : "0"
},
"CRC FLIT Error" :
{
"value" : "0"
},
"Recovery Error" :
{
"value" : "0"
},
"Replay Error" :
{
"value" : "0"
}
}
},
"Link 7" :
{
"children" :
{
"CRC Data Error" :
{
"value" : "0"
},
"CRC FLIT Error" :
{
"value" : "0"
},
"Recovery Error" :
{
"value" : "0"
},
"Replay Error" :
{
"value" : "0"
}
}
}
},
"header" :
[
"NVLINK Error Counts",
"GPU 0"
]
}
dcgmi nvlink -s
+----------------------+
| NvLink Link Status |
+----------------------+
GPUs:
gpuId 0:
U U U U U U U U _ _ _ _ _ _ _ _ _ _
gpuId 1:
U U U U U U U U _ _ _ _ _ _ _ _ _ _
gpuId 2:
U U U U U U U U _ _ _ _ _ _ _ _ _ _
gpuId 3:
U U U U U U U U _ _ _ _ _ _ _ _ _ _
gpuId 4:
U U U U U U U U _ _ _ _ _ _ _ _ _ _
gpuId 5:
U U U U U U U U _ _ _ _ _ _ _ _ _ _
gpuId 6:
U U U U U U U U _ _ _ _ _ _ _ _ _ _
gpuId 7:
U U U U U U U U _ _ _ _ _ _ _ _ _ _
NvSwitches:
No NvSwitches found.
Key: Up=U, Down=D, Disabled=X, Not Supported=_
> dcgmi nvlink -g 1 -e
+-----------------------------+------------------------------------------------+
| NVLINK Error Counts |
| GPU 1 |
+=============================+================================================+
| Link 0 | |
| -> CRC FLIT Error | 0 |
| -> CRC Data Error | 0 |
| -> Replay Error | 0 |
| -> Recovery Error | 0 |
| Link 1 | |
| -> CRC FLIT Error | 0 |
| -> CRC Data Error | 0 |
| -> Replay Error | 0 |
| -> Recovery Error | 0 |
| Link 2 | |
| -> CRC FLIT Error | 0 |
| -> CRC Data Error | 0 |
| -> Replay Error | 0 |
| -> Recovery Error | 0 |
| Link 3 | |
| -> CRC FLIT Error | 0 |
| -> CRC Data Error | 0 |
| -> Replay Error | 0 |
| -> Recovery Error | 0 |
| Link 4 | |
| -> CRC FLIT Error | 0 |
| -> CRC Data Error | 0 |
| -> Replay Error | 0 |
| -> Recovery Error | 0 |
| Link 5 | |
| -> CRC FLIT Error | 0 |
| -> CRC Data Error | 0 |
| -> Replay Error | 0 |
| -> Recovery Error | 0 |
| Link 6 | |
| -> CRC FLIT Error | 0 |
| -> CRC Data Error | 0 |
| -> Replay Error | 0 |
| -> Recovery Error | 0 |
| Link 7 | |
| -> CRC FLIT Error | 0 |
| -> CRC Data Error | 0 |
| -> Replay Error | 0 |
| -> Recovery Error | 0 |
+-----------------------------+------------------------------------------------+
支持以下新的设备级分析指标。 列出了定义和相应的 DCGM 字段 ID。
默认情况下,DCGM 以 1Hz(每 1000毫秒(ms))的采样率提供指标。 用户可以以任何可配置的频率(最小为 100 毫秒(ms))从 DCGM 查询指标(例如:dcgmi dmon -d)。
以下是设备水平(level)的GPU指标
Metric | Definition | DCGM Field Name (DCGM_FI_*) and ID |
---|---|---|
Graphics Engine Activity | The fraction of time any portion of the graphics or compute engines were active. The graphics engine is active if a graphics/compute context is bound and the graphics/compute pipe is busy. The value represents an average over a time interval and is not an instantaneous value. | PROF_GR_ENGINE_ACTIVE (ID: 1001) |
SM Activity | The fraction of time at least one warp was active on a multiprocessor, averaged over all multiprocessors. Note that “active” does not necessarily mean a warp is actively computing. For instance, warps waiting on memory requests are considered active. The value represents an average over a time interval and is not an instantaneous value. A value of 0.8 or greater is necessary, but not sufficient, for effective use of the GPU. A value less than 0.5 likely indicates ineffective GPU usage.Given a simplified GPU architectural view, if a GPU has N SMs then a kernel using N blocks that runs over the entire time interval will correspond to an activity of 1 (100%). A kernel using N/5 blocks that runs over the entire time interval will correspond to an activity of 0.2 (20%). A kernel using N blocks that runs over one fifth of the time interval, with the SMs otherwise idle, will also have an activity of 0.2 (20%). The value is insensitive to the number of threads per block (see DCGM_FI_PROF_SM_OCCUPANCY ). |
PROF_SM_ACTIVE (ID: 1002) |
SM Occupancy | The fraction of resident warps on a multiprocessor, relative to the maximum number of concurrent warps supported on a multiprocessor. The value represents an average over a time interval and is not an instantaneous value. Higher occupancy does not necessarily indicate better GPU usage. For GPU memory bandwidth limited workloads (see DCGM_FI_PROF_DRAM_ACTIVE ), higher occupancy is indicative of more effective GPU usage. However if the workload is compute limited (i.e. not GPU memory bandwidth or latency limited), then higher occupancy does not necessarily correlate with more effective GPU usage.Calculating occupancy is not simple and depends on factors such as the GPU properties, the number of threads per block, registers per thread, and shared memory per block. Use the CUDA Occupancy Calculator to explore various occupancy scenarios. |
PROF_SM_OCCUPANCY (ID: 1003) |
Tensor Activity | The fraction of cycles the tensor (HMMA / IMMA) pipe was active. The value represents an average over a time interval and is not an instantaneous value. Higher values indicate higher utilization of the Tensor Cores. An activity of 1 (100%) is equivalent to issuing a tensor instruction every other cycle for the entire time interval. An activity of 0.2 (20%) could indicate 20% of the SMs are at 100% utilization over the entire time period, 100% of the SMs are at 20% utilization over the entire time period, 100% of the SMs are at 100% utilization for 20% of the time period, or any combination in between (see DCGM_FI_PROF_SM_ACTIVE to help disambiguate these possibilities). |
PROF_PIPE_TENSOR_ACTIVE (ID: 1004) |
FP64 Engine Activity | The fraction of cycles the FP64 (double precision) pipe was active. The value represents an average over a time interval and is not an instantaneous value. Higher values indicate higher utilization of the FP64 cores. An activity of 1 (100%) is equivalent to a FP64 instruction on every SM every fourth cycle on Volta over the entire time interval. An activity of 0.2 (20%) could indicate 20% of the SMs are at 100% utilization over the entire time period, 100% of the SMs are at 20% utilization over the entire time period, 100% of the SMs are at 100% utilization for 20% of the time period, or any combination in between (see DCGM_FI_PROF_SM_ACTIVE to help disambiguate these possibilities). | PROF_PIPE_FP64_ACTIVE (ID: 1006) |
FP32 Engine Activity | The fraction of cycles the FMA (FP32 (single precision), and integer) pipe was active. The value represents an average over a time interval and is not an instantaneous value. Higher values indicate higher utilization of the FP32 cores. An activity of 1 (100%) is equivalent to a FP32 instruction every other cycle over the entire time interval. An activity of 0.2 (20%) could indicate 20% of the SMs are at 100% utilization over the entire time period, 100% of the SMs are at 20% utilization over the entire time period, 100% of the SMs are at 100% utilization for 20% of the time period, or any combination in between (see DCGM_FI_PROF_SM_ACTIVE to help disambiguate these possibilities). |
PROF_PIPE_FP32_ACTIVE (ID: 1007) |
FP16 Engine Activity | The fraction of cycles the FP16 (half precision) pipe was active. The value represents an average over a time interval and is not an instantaneous value. Higher values indicate higher utilization of the FP16 cores. An activity of 1 (100%) is equivalent to a FP16 instruction every other cycle over the entire time interval. An activity of 0.2 (20%) could indicate 20% of the SMs are at 100% utilization over the entire time period, 100% of the SMs are at 20% utilization over the entire time period, 100% of the SMs are at 100% utilization for 20% of the time period, or any combination in between (see DCGM_FI_PROF_SM_ACTIVE to help disambiguate these possibilities). |
PROF_PIPE_FP16_ACTIVE (ID: 1008) |
Memory BW Utilization | The fraction of cycles where data was sent to or received from device memory. The value represents an average over a time interval and is not an instantaneous value. Higher values indicate higher utilization of device memory. An activity of 1 (100%) is equivalent to a DRAM instruction every cycle over the entire time interval (in practice a peak of ~0.8 (80%) is the maximum achievable). An activity of 0.2 (20%) indicates that 20% of the cycles are reading from or writing to device memory over the time interval. | PROF_DRAM_ACTIVE (ID: 1005) |
NVLink Bandwidth | 通过 NVLink 传输/接收的数据速率(不包括协议头(protocol headers)),以字节/秒为单位。 该值表示一段时间间隔内的平均值,而不是瞬时值。 该速率是时间间隔内的平均值。 例如,如果 1 秒内传输 1 GB 数据,则无论数据以恒定速率还是突发传输,速率均为 1 GB/s。 NVLink Gen2 的理论最大带宽为每个链路每个方向 25 GB/s。 | PROF_NVLINK_TX_BYTES (1011) and PROF_NVLINK_RX_BYTES (1012) |
PCIe Bandwidth | 通过 PCIe 总线传输/接收的数据速率,包括协议标头和数据有效负载,以字节/秒为单位。 该值表示一段时间间隔内的平均值,而不是瞬时值。 该速率是时间间隔内的平均值。 例如,如果 1 秒内传输 1 GB 数据,则无论数据以恒定速率还是突发传输,速率均为 1 GB/s。 理论最大 PCIe Gen3 带宽为每通道 985 MB/s。 | PROF_PCIE_[T|R]X_BYTES (ID: 1009 (TX); 1010 (RX)) |
> dcgmi dmon -i 0,1,2,3 -e 1011,1012
#Entity NVLTX NVLRX
ID
GPU 3 19694075554 19687914629
GPU 2 19777203418 19819177524
GPU 1 19699841766 22070216956
GPU 0 20779220484 21900091841
GPU 3 12945588302 12953884356
GPU 2 12558214740 12560935679
GPU 1 13059621728 10651057317
GPU 0 11576689215 9600734242
GPU 3 11155319776 11155326544
GPU 2 11155466819 11155466298
GPU 1 11040517157 12515409691
GPU 0 11592513041 13925722805
GPU 3 1286216247 1217881887
GPU 2 928524939 860186978
GPU 1 1506174212 50051
GPU 0 31802 911367981
GPU 3 0 0
GPU 2 0 0
GPU 1 0 0
GPU 0 0 0
GPU 3 23309642310 23377912493
GPU 2 23176458503 23176459024
GPU 1 23447369511 23507663607
GPU 0 23508249062 23174848479
...
> dcgmi dmon -e 1011,1012
#Entity NVLTX NVLRX
ID
GPU 7 30570603980 30638829242
GPU 6 30567094640 30635348592
GPU 5 30628398352 33089365519
GPU 4 33098848601 36119516306
GPU 3 33750138990 33825970205
GPU 2 31743752465 31812022474
GPU 1 34030055050 34098309807
GPU 0 32873620375 29632298747
GPU 7 24371477520 24370431480
GPU 6 24443717565 24443653033
GPU 5 24450523113 23160485855
GPU 4 23167734167 23167708130
GPU 3 25744193567 25744198774
GPU 2 25027562441 25027562441
GPU 1 24099003605 24099024433
GPU 0 24669591596 24669655619
...
> dcgmi dmon -e 1011,1012,1009,1010
GPU 7 9012312241 9010146921 3658375705 1470950385
GPU 6 38735656050(36.07GB/s) 38739460219(36.08GB/s) 3715470394(3.46GB/s) 069653476(0.996GB/s)
GPU 5 37117100494 37114692577 3684018382 1195478083
GPU 4 15832363949 30483540427 3617204053 1084584434
GPU 3 11415357717 11415357717 3762838708 3470626438
GPU 2 32126737331 32124608391 3860671178 1817597475
GPU 1 37055654032 37055676937 3666866771 1201785740
GPU 0 27827206810 27762665999 N/A 1146900782
GPU 7 37300245001 37302405771 3843250109 4599309358
GPU 6 14877616163 14939829270 3919059148 4513192032
GPU 5 17320548737 17382778744 3889129122 4641743864
GPU 4 30487341762 16373502117 3933037804 6115081312
GPU 3 34918736873 34918742079 3910245112 1761934955
GPU 2 16547291813 19112960872 2761505306 3060783203
GPU 1 18380875930 18390091637 148870522 2103742852
GPU 0 19407501485 15881929591 3711808007 1055934784
统计6000次,将结果保存到文件。
dcgmi dmon -e 1011,1012,1009,1010 -c 6000 >> bandwitch.txt
# -g 18 为 上面创建的组
dcgmi dmon -e 1011,1012,1009,1010 -g 18
#Entity NVLTX NVLRX PCITX PCIRX
ID
GPU 7 N/A N/A N/A N/A
GPU 0 N/A N/A N/A N/A
GPU 7 N/A N/A N/A N/A
GPU 0 N/A N/A N/A N/A
GPU 7 0 0 499346 3629
GPU 0 0 0 498931 2074
GPU 7 0 0 499343 3111