Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to Use Allocation API to Keep Multiple PSUM Banks Active #1010

Open
nandeeka opened this issue Oct 10, 2024 · 2 comments
Open

Unable to Use Allocation API to Keep Multiple PSUM Banks Active #1010

nandeeka opened this issue Oct 10, 2024 · 2 comments
Assignees
Labels
bug Something isn't working NKI

Comments

@nandeeka
Copy link

I am trying to use the NKI Allocation API to keep multiple PSUM banks active. In the documentation, I see, "rather than laying out
multiple 128x512 tensors in the same partition with offset byte_addr , and making them live with allocated_block_shape, on PSUM, we achieve similar parallelism by mapping the blocks we want live into multiple banks."

However, when I try to implement this, only one PSUM bank is in use at a time, and despite specifying that banks 0-3 should be in use, over the course of the kernel, I see all eight banks in use.

Environment: I started with the Neuron 2.20 DLAMI and installed the Allocation API using the .deb and .whl files @aws-serina-tan sent me.

Full Kernel:

from neuronxcc import nki
import neuronxcc.nki.language as nl
import neuronxcc.nki.isa as ni
import numpy as np

def matmul(I_DRAM, PW_DRAM, O_DRAM):
  K, N = I_DRAM.shape
  _, M = PW_DRAM.shape

  K0 = 128
  M0 = 128
  N0 = 512

  M1 = 4
  N1 = 4
  K1 = 8

  K2 = K // (K1 * K0)
  M2 = M // (M1 * M0)
  N2 = N // (N1 * N0)

  assert K2 * K1 * K0 == K
  assert M2 * M1 * M0 == M
  assert N2 * N1 * N0 == N

  sbuf_bitwidth = 2

  for n2 in nl.affine_range(N2):
    for m2 in nl.affine_range(M2):

      po_addr = 0
      po_blocks = 1
      PO_SBUF = nl.ndarray((M1, nl.par_dim(M0), N1 * N0), dtype=O_DRAM.dtype,
                  buffer=ni.sbuf.allocate(byte_addr=po_addr,
                          allocated_block_shape=(po_blocks,)))
      for m1 in nl.affine_range(M1):
        PO_SBUF[m1] = ni.memset((M0, N1 * N0), 0, dtype=O_DRAM.dtype)

      for k2 in nl.affine_range(K2):
        pw_addr = po_addr + po_blocks * M1 * M0 * sbuf_bitwidth
        pw_blocks = 1
        PW_SBUF = nl.ndarray((K1, nl.par_dim(K0), M1 * M0), dtype=PW_DRAM.dtype,
                    buffer=ni.sbuf.allocate(byte_addr=pw_addr,
                            allocated_block_shape=(pw_blocks,)))

        i_addr = pw_addr + pw_blocks * M1 * M0 * sbuf_bitwidth
        i_blocks = 1
        I_SBUF = nl.ndarray((K1, nl.par_dim(K0), N1 * N0), dtype=I_DRAM.dtype,
          buffer=ni.sbuf.allocate(byte_addr=i_addr,
                  allocated_block_shape=(i_blocks,)))


        for k1 in nl.affine_range(K1):
          k_start = k2 * K1 * K0 + k1 * K0
          k_end = k_start + K0

          m_start = m2 * M1 * M0
          m_end = m_start + M1 * M0

          n_start = n2 * N1 * N0
          n_end = n_start + N1 * N0

          PW_SBUF[k1] = nl.load(PW_DRAM[k_start:k_end, m_start:m_end])
          I_SBUF[k1] = nl.load(I_DRAM[k_start:k_end, n_start:n_end])

        for m1 in nl.affine_range(M1):
          PO_PSUM = nl.ndarray((N1, nl.par_dim(M0), N0), dtype=nl.float32,
                      # buffer=ni.psum.allocate(byte_addr=0))
                      buffer=ni.psum.allocate(byte_addr=0,
                              allocated_block_shape=(1,),
                              bank_map={(0,): 0, (1,): 1, (2,): 2, (3,): 3}))
          for n1 in nl.affine_range(N1):
            PO_PSUM[n1] = ni.memset((M0, N0), 0, nl.float32)

            m_start = m1 * M0
            m_end = m_start + M0

            n_start = n1 * N0
            n_end = n_start + N0

            for k1 in nl.affine_range(K1):
              PO_PSUM[n1] += ni.nc_matmul(PW_SBUF[k1, :, m_start:m_end], I_SBUF[k1, :, n_start:n_end])

            PO_SBUF[m1, :, n_start:n_end] = nl.loop_reduce(PO_PSUM[n1], op=np.add, loop_indices=[k2], dtype=O_DRAM.dtype)

      for m1 in nl.affine_range(M1):
        m_start = m2 * M1 * M0 + m1 * M0
        m_end = m_start + M0

        n_start = n2 * N1 * N0
        n_end = n_start + N1 * N0

        nl.store(O_DRAM[m_start:m_end, n_start:n_end], value=PO_SBUF[m1])

def benchmark_kernel():
  K, M, N = (4096, 4096, 2048)

  I = np.random.random_sample([K, N]).astype(np.float16)
  PW = np.random.random_sample([K, M]).astype(np.float16)
  O = np.ndarray(shape=[M, N], dtype=np.float16)

  benchmark_func = nki.benchmark(
    save_neff_name="file.neff",
    save_trace_name="profile.ntff",
    additional_compile_opt=" --verbose warning ")(matmul)
  benchmark_func(I, PW, O)

def main():
  benchmark_kernel()

if __name__ == "__main__":
  main()

A zoomed in portion of the resulting profile:

Screenshot 2024-10-10 at 3 20 07 PM

Color guide:

  • bank_0: gray
  • bank_1: yellow-green
  • bank_2: bright blue
  • bank_3: dark_blue
  • bank_4: orange
  • bank_5: green
  • bank_6: red
  • bank_7: purple
@aws-qieqingy aws-qieqingy self-assigned this Oct 11, 2024
@aws-qieqingy
Copy link
Contributor

Hi Nandeeka! When PSUM is under allocated, like in this case, the compiler has optimization that rotates the PSUM bank allocation to use all available banks. In terms of why profiler shows 1 bank is in use at a time, I will need to reproduce myself and take a closer look.

@aws-qieqingy
Copy link
Contributor

Hi Nandeeka! It appears that the code you written is not correct. For example, in the following section from your code,

po_blocks = 1
PO_SBUF = nl.ndarray((M1, nl.par_dim(M0), N1 * N0), dtype=O_DRAM.dtype,
                  buffer=ni.sbuf.allocate(byte_addr=po_addr,
                          allocated_block_shape=(po_blocks,)))
      for m1 in nl.affine_range(M1):
        PO_SBUF[m1] = ni.memset((M0, N1 * N0), 0, dtype=O_DRAM.dtype)

There is only one block alive in PO_SBUF at the same time, yet the loop attempts to load data into every single one of them. This is undefined behaviour and would cause data race during execution.

Note that this feature is not released yet at the moment, so the API signature has changed during development.

Could you please contact @aws-serina-tan and ask her to provide additional documents for you to understand the behaviour of the allocation, and a new wheel if possible?

@aws-taylor aws-taylor added the bug Something isn't working label Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working NKI
Projects
None yet
Development

No branches or pull requests

3 participants