triton_profile

cudamode) mark@mark:/mnt/data/Dev/cudamode$ /usr/local/NVIDIA-Nsight-Compute-2023.3/ncu $(which python) triton_square.py
==PROF== Connected to process 5311 (/home/mark/anaconda3/envs/cudamode/bin/python3.10)
==PROF== Profiling "distribution_elementwise_grid..." - 0: 0%....50%....100% - 9 passes
==PROF== Profiling "square_kernel_0d1d234" - 1: 0%....50%....100% - 9 passes
==PROF== Disconnected from process 5311
[5311] python3.10@127.0.0.1
  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(const at::TensorBase &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 2)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIteratorBase &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, (int)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_and_transform<float, float, (unsigned long)4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(const at::TensorBase &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 2)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIteratorBase &, T4, T5)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::normal_kernel<at::CUDAGeneratorImpl *>(const at::TensorBase &, double, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 2)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIteratorBase &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(int, at::PhiloxCudaState, T3, T4) (768, 1, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 8.9
    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ------------
    Metric Name               Metric Unit Metric Value
    ----------------------- ------------- ------------
    DRAM Frequency          cycle/nsecond         9.99
    SM Frequency            cycle/nsecond         2.14
    Elapsed Cycles                  cycle       12,717
    Memory Throughput                   %        43.57
    DRAM Throughput                     %         0.00
    Duration                      usecond         5.92
    L1/TEX Cache Throughput             %        21.98
    L2 Cache Throughput                 %        43.57
    SM Active Cycles                cycle     9,927.17
    Compute (SM) Throughput             %        52.33
    ----------------------- ------------- ------------

    OPT   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    -------------------------------- --------------- ---------------
    Metric Name                          Metric Unit    Metric Value
    -------------------------------- --------------- ---------------
    Block Size                                                   256
    Function Cache Configuration                     CachePreferNone
    Grid Size                                                    768
    Registers Per Thread             register/thread              38
    Shared Memory Configuration Size           Kbyte           16.38
    Driver Shared Memory Per Block       Kbyte/block            1.02
    Dynamic Shared Memory Per Block       byte/block               0
    Static Shared Memory Per Block        byte/block               0
    Threads                                   thread         196,608
    Waves Per SM                                                   1
    -------------------------------- --------------- ---------------

    Section: Occupancy
    ------------------------------- ----------- ------------
    Metric Name                     Metric Unit Metric Value
    ------------------------------- ----------- ------------
    Block Limit SM                        block           24
    Block Limit Registers                 block            6
    Block Limit Shared Mem                block           16
    Block Limit Warps                     block            6
    Theoretical Active Warps per SM        warp           48
    Theoretical Occupancy                     %          100
    Achieved Occupancy                        %        67.27
    Achieved Active Warps Per SM           warp        32.29
    ------------------------------- ----------- ------------

    OPT   Est. Local Speedup: 32.73%                                                                                    
          The difference between calculated theoretical (100.0%) and measured achieved occupancy (67.3%) can be the     
          result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can   
          occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices   
          Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on     
          optimizing occupancy.                                                                                         

  square_kernel_0d1d234 (1823, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 8.9
    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ------------
    Metric Name               Metric Unit Metric Value
    ----------------------- ------------- ------------
    DRAM Frequency          cycle/nsecond        10.86
    SM Frequency            cycle/nsecond         2.37
    Elapsed Cycles                  cycle       27,344
    Memory Throughput                   %        86.10
    DRAM Throughput                     %        86.10
    Duration                      usecond        11.55
    L1/TEX Cache Throughput             %        16.92
    L2 Cache Throughput                 %        34.57
    SM Active Cycles                cycle    17,462.53
    Compute (SM) Throughput             %         6.67
    ----------------------- ------------- ------------

    INF   The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To   
          further improve performance, work will likely need to be shifted from the most utilized to another unit.      
          Start by analyzing DRAM in the Memory Workload Analysis section.                                              

    Section: Launch Statistics
    -------------------------------- --------------- ---------------
    Metric Name                          Metric Unit    Metric Value
    -------------------------------- --------------- ---------------
    Block Size                                                   128
    Function Cache Configuration                     CachePreferNone
    Grid Size                                                  1,823
    Registers Per Thread             register/thread              20
    Shared Memory Configuration Size           Kbyte           32.77
    Driver Shared Memory Per Block       Kbyte/block            1.02
    Dynamic Shared Memory Per Block       byte/block               0
    Static Shared Memory Per Block        byte/block               0
    Threads                                   thread         233,344
    Waves Per SM                                                1.19
    -------------------------------- --------------- ---------------

    OPT   Est. Speedup: 50%                                                                                             
          A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the    
          target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical       
          occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 286 thread blocks.  
          Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for   
          up to 50.0% of the total kernel runtime with a lower occupancy of 26.8%. Try launching a grid with no         
          partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for  
          a grid. See the Hardware Model                                                                                
          (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more      
          details on launch configurations.                                                                             

    Section: Occupancy
    ------------------------------- ----------- ------------
    Metric Name                     Metric Unit Metric Value
    ------------------------------- ----------- ------------
    Block Limit SM                        block           24
    Block Limit Registers                 block           21
    Block Limit Shared Mem                block           32
    Block Limit Warps                     block           12
    Theoretical Active Warps per SM        warp           48
    Theoretical Occupancy                     %          100
    Achieved Occupancy                        %        73.20
    Achieved Active Warps Per SM           warp        35.14
    ------------------------------- ----------- ------------

    OPT   Est. Local Speedup: 26.8%                                                                                     
          The difference between calculated theoretical (100.0%) and measured achieved occupancy (73.2%) can be the     
          result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can   
          occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices   
          Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on     
          optimizing occupancy.                                                                                         

(cudamode) mark@mark:/mnt/data/Dev/cudamode$