Unable to Use Allocation API to Keep Multiple PSUM Banks Active #1010

nandeeka · 2024-10-10T22:25:21Z

I am trying to use the NKI Allocation API to keep multiple PSUM banks active. In the documentation, I see, "rather than laying out
multiple 128x512 tensors in the same partition with offset byte_addr , and making them live with allocated_block_shape, on PSUM, we achieve similar parallelism by mapping the blocks we want live into multiple banks."

However, when I try to implement this, only one PSUM bank is in use at a time, and despite specifying that banks 0-3 should be in use, over the course of the kernel, I see all eight banks in use.

Environment: I started with the Neuron 2.20 DLAMI and installed the Allocation API using the .deb and .whl files @aws-serina-tan sent me.

Full Kernel:

from neuronxcc import nki
import neuronxcc.nki.language as nl
import neuronxcc.nki.isa as ni
import numpy as np

def matmul(I_DRAM, PW_DRAM, O_DRAM):
  K, N = I_DRAM.shape
  _, M = PW_DRAM.shape

  K0 = 128
  M0 = 128
  N0 = 512

  M1 = 4
  N1 = 4
  K1 = 8

  K2 = K // (K1 * K0)
  M2 = M // (M1 * M0)
  N2 = N // (N1 * N0)

  assert K2 * K1 * K0 == K
  assert M2 * M1 * M0 == M
  assert N2 * N1 * N0 == N

  sbuf_bitwidth = 2

  for n2 in nl.affine_range(N2):
    for m2 in nl.affine_range(M2):

      po_addr = 0
      po_blocks = 1
      PO_SBUF = nl.ndarray((M1, nl.par_dim(M0), N1 * N0), dtype=O_DRAM.dtype,
                  buffer=ni.sbuf.allocate(byte_addr=po_addr,
                          allocated_block_shape=(po_blocks,)))
      for m1 in nl.affine_range(M1):
        PO_SBUF[m1] = ni.memset((M0, N1 * N0), 0, dtype=O_DRAM.dtype)

      for k2 in nl.affine_range(K2):
        pw_addr = po_addr + po_blocks * M1 * M0 * sbuf_bitwidth
        pw_blocks = 1
        PW_SBUF = nl.ndarray((K1, nl.par_dim(K0), M1 * M0), dtype=PW_DRAM.dtype,
                    buffer=ni.sbuf.allocate(byte_addr=pw_addr,
                            allocated_block_shape=(pw_blocks,)))

        i_addr = pw_addr + pw_blocks * M1 * M0 * sbuf_bitwidth
        i_blocks = 1
        I_SBUF = nl.ndarray((K1, nl.par_dim(K0), N1 * N0), dtype=I_DRAM.dtype,
          buffer=ni.sbuf.allocate(byte_addr=i_addr,
                  allocated_block_shape=(i_blocks,)))


        for k1 in nl.affine_range(K1):
          k_start = k2 * K1 * K0 + k1 * K0
          k_end = k_start + K0

          m_start = m2 * M1 * M0
          m_end = m_start + M1 * M0

          n_start = n2 * N1 * N0
          n_end = n_start + N1 * N0

          PW_SBUF[k1] = nl.load(PW_DRAM[k_start:k_end, m_start:m_end])
          I_SBUF[k1] = nl.load(I_DRAM[k_start:k_end, n_start:n_end])

        for m1 in nl.affine_range(M1):
          PO_PSUM = nl.ndarray((N1, nl.par_dim(M0), N0), dtype=nl.float32,
                      # buffer=ni.psum.allocate(byte_addr=0))
                      buffer=ni.psum.allocate(byte_addr=0,
                              allocated_block_shape=(1,),
                              bank_map={(0,): 0, (1,): 1, (2,): 2, (3,): 3}))
          for n1 in nl.affine_range(N1):
            PO_PSUM[n1] = ni.memset((M0, N0), 0, nl.float32)

            m_start = m1 * M0
            m_end = m_start + M0

            n_start = n1 * N0
            n_end = n_start + N0

            for k1 in nl.affine_range(K1):
              PO_PSUM[n1] += ni.nc_matmul(PW_SBUF[k1, :, m_start:m_end], I_SBUF[k1, :, n_start:n_end])

            PO_SBUF[m1, :, n_start:n_end] = nl.loop_reduce(PO_PSUM[n1], op=np.add, loop_indices=[k2], dtype=O_DRAM.dtype)

      for m1 in nl.affine_range(M1):
        m_start = m2 * M1 * M0 + m1 * M0
        m_end = m_start + M0

        n_start = n2 * N1 * N0
        n_end = n_start + N1 * N0

        nl.store(O_DRAM[m_start:m_end, n_start:n_end], value=PO_SBUF[m1])

def benchmark_kernel():
  K, M, N = (4096, 4096, 2048)

  I = np.random.random_sample([K, N]).astype(np.float16)
  PW = np.random.random_sample([K, M]).astype(np.float16)
  O = np.ndarray(shape=[M, N], dtype=np.float16)

  benchmark_func = nki.benchmark(
    save_neff_name="file.neff",
    save_trace_name="profile.ntff",
    additional_compile_opt=" --verbose warning ")(matmul)
  benchmark_func(I, PW, O)

def main():
  benchmark_kernel()

if __name__ == "__main__":
  main()

A zoomed in portion of the resulting profile:

Color guide:

bank_0: gray
bank_1: yellow-green
bank_2: bright blue
bank_3: dark_blue
bank_4: orange
bank_5: green
bank_6: red
bank_7: purple

The text was updated successfully, but these errors were encountered:

aws-qieqingy · 2024-10-11T15:16:23Z

Hi Nandeeka! When PSUM is under allocated, like in this case, the compiler has optimization that rotates the PSUM bank allocation to use all available banks. In terms of why profiler shows 1 bank is in use at a time, I will need to reproduce myself and take a closer look.

aws-qieqingy · 2024-10-16T16:40:35Z

Hi Nandeeka! It appears that the code you written is not correct. For example, in the following section from your code,

po_blocks = 1
PO_SBUF = nl.ndarray((M1, nl.par_dim(M0), N1 * N0), dtype=O_DRAM.dtype,
                  buffer=ni.sbuf.allocate(byte_addr=po_addr,
                          allocated_block_shape=(po_blocks,)))
      for m1 in nl.affine_range(M1):
        PO_SBUF[m1] = ni.memset((M0, N1 * N0), 0, dtype=O_DRAM.dtype)

There is only one block alive in PO_SBUF at the same time, yet the loop attempts to load data into every single one of them. This is undefined behaviour and would cause data race during execution.

Note that this feature is not released yet at the moment, so the API signature has changed during development.

Could you please contact @aws-serina-tan and ask her to provide additional documents for you to understand the behaviour of the allocation, and a new wheel if possible?

aws-qieqingy self-assigned this Oct 11, 2024

aws-qieqingy added the NKI label Oct 11, 2024

aws-taylor added the bug Something isn't working label Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to Use Allocation API to Keep Multiple PSUM Banks Active #1010

Unable to Use Allocation API to Keep Multiple PSUM Banks Active #1010

nandeeka commented Oct 10, 2024

aws-qieqingy commented Oct 11, 2024

aws-qieqingy commented Oct 16, 2024

Unable to Use Allocation API to Keep Multiple PSUM Banks Active #1010

Unable to Use Allocation API to Keep Multiple PSUM Banks Active #1010

Comments

nandeeka commented Oct 10, 2024

aws-qieqingy commented Oct 11, 2024

aws-qieqingy commented Oct 16, 2024