This repository has been archived by the owner on Feb 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 15
/
triton_profile
134 lines (123 loc) · 10.3 KB
/
triton_profile
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
cudamode) mark@mark:/mnt/data/Dev/cudamode$ /usr/local/NVIDIA-Nsight-Compute-2023.3/ncu $(which python) triton_square.py
==PROF== Connected to process 5311 (/home/mark/anaconda3/envs/cudamode/bin/python3.10)
==PROF== Profiling "distribution_elementwise_grid..." - 0: 0%....50%....100% - 9 passes
==PROF== Profiling "square_kernel_0d1d234" - 1: 0%....50%....100% - 9 passes
==PROF== Disconnected from process 5311
[5311] [email protected]
void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(const at::TensorBase &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 2)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIteratorBase &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(const at::TensorBase &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 2)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIteratorBase &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(const at::TensorBase &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 2)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIteratorBase &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4) (768, 1, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 8.9
Section: GPU Speed Of Light Throughput
----------------------- ------------- ------------
Metric Name Metric Unit Metric Value
----------------------- ------------- ------------
DRAM Frequency cycle/nsecond 9.99
SM Frequency cycle/nsecond 2.14
Elapsed Cycles cycle 12,717
Memory Throughput % 43.57
DRAM Throughput % 0.00
Duration usecond 5.92
L1/TEX Cache Throughput % 21.98
L2 Cache Throughput % 43.57
SM Active Cycles cycle 9,927.17
Compute (SM) Throughput % 52.33
----------------------- ------------- ------------
OPT This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance
of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate
latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.
Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name Metric Unit Metric Value
-------------------------------- --------------- ---------------
Block Size 256
Function Cache Configuration CachePreferNone
Grid Size 768
Registers Per Thread register/thread 38
Shared Memory Configuration Size Kbyte 16.38
Driver Shared Memory Per Block Kbyte/block 1.02
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0
Threads thread 196,608
Waves Per SM 1
-------------------------------- --------------- ---------------
Section: Occupancy
------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------------- ----------- ------------
Block Limit SM block 24
Block Limit Registers block 6
Block Limit Shared Mem block 16
Block Limit Warps block 6
Theoretical Active Warps per SM warp 48
Theoretical Occupancy % 100
Achieved Occupancy % 67.27
Achieved Active Warps Per SM warp 32.29
------------------------------- ----------- ------------
OPT Est. Local Speedup: 32.73%
The difference between calculated theoretical (100.0%) and measured achieved occupancy (67.3%) can be the
result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can
occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices
Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on
optimizing occupancy.
square_kernel_0d1d234 (1823, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 8.9
Section: GPU Speed Of Light Throughput
----------------------- ------------- ------------
Metric Name Metric Unit Metric Value
----------------------- ------------- ------------
DRAM Frequency cycle/nsecond 10.86
SM Frequency cycle/nsecond 2.37
Elapsed Cycles cycle 27,344
Memory Throughput % 86.10
DRAM Throughput % 86.10
Duration usecond 11.55
L1/TEX Cache Throughput % 16.92
L2 Cache Throughput % 34.57
SM Active Cycles cycle 17,462.53
Compute (SM) Throughput % 6.67
----------------------- ------------- ------------
INF The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To
further improve performance, work will likely need to be shifted from the most utilized to another unit.
Start by analyzing DRAM in the Memory Workload Analysis section.
Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name Metric Unit Metric Value
-------------------------------- --------------- ---------------
Block Size 128
Function Cache Configuration CachePreferNone
Grid Size 1,823
Registers Per Thread register/thread 20
Shared Memory Configuration Size Kbyte 32.77
Driver Shared Memory Per Block Kbyte/block 1.02
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0
Threads thread 233,344
Waves Per SM 1.19
-------------------------------- --------------- ---------------
OPT Est. Speedup: 50%
A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the
target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical
occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 286 thread blocks.
Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for
up to 50.0% of the total kernel runtime with a lower occupancy of 26.8%. Try launching a grid with no
partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for
a grid. See the Hardware Model
(https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more
details on launch configurations.
Section: Occupancy
------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------------- ----------- ------------
Block Limit SM block 24
Block Limit Registers block 21
Block Limit Shared Mem block 32
Block Limit Warps block 12
Theoretical Active Warps per SM warp 48
Theoretical Occupancy % 100
Achieved Occupancy % 73.20
Achieved Active Warps Per SM warp 35.14
------------------------------- ----------- ------------
OPT Est. Local Speedup: 26.8%
The difference between calculated theoretical (100.0%) and measured achieved occupancy (73.2%) can be the
result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can
occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices
Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on
optimizing occupancy.
(cudamode) mark@mark:/mnt/data/Dev/cudamode$