SBUF Data Dependency Issue #1045

MBshara · 2024-11-29T20:20:16Z

Hello, I was loading a tensor into SBUF in this tiling fashion but was encountering some issues where only the last "out_" index returned the correctly loaded values in "weights_mat_mul_2":
I got no compiler issue and yet, since I am loading in sequentially, I expected w_temp to always load correct data into weights_mat_mul_2, as it w_temp was overwritten after every iteration.

weights_mat_mul_2 = nl.ndarray((filter_height, filter_width, n_tiles_c_out, nl.par_dim(c_in_pmax), n_tiles_c_in, c_in_pmax), dtype=W.dtype, buffer=nl.sbuf)
w_temp = nl.ndarray((nl.par_dim(c_in_pmax), n_tiles_c_in, c_in_pmax, filter_height, filter_width), dtype=W.dtype, buffer=nl.sbuf)
for out_ in nl.sequential_range(n_tiles_c_out):
    w_temp[...] = nl.load(W_reshaped[out_,:,:,:,:,:])
    for fH in nl.affine_range(filter_height):
        for fW in nl.affine_range(filter_width):
            for in_ in nl.affine_range(n_tiles_c_in):
                weights_mat_mul_2[fH,fW,out_,:,in_,:] = w_temp[:,in_,:,fH,fW]

I then tested simulation and every tile in weights_mat_mul_2 contained the correct values. I then tried moving the array "w_temp" declaration into the outermost for loop and all problems were resolved:

weights_mat_mul_2 = nl.ndarray((filter_height, filter_width, n_tiles_c_out, nl.par_dim(c_in_pmax), n_tiles_c_in, c_in_pmax), dtype=W.dtype, buffer=nl.sbuf)
for out_ in nl.sequential_range(n_tiles_c_out):
    w_temp = nl.ndarray((nl.par_dim(c_in_pmax), n_tiles_c_in, c_in_pmax, filter_height, filter_width), dtype=W.dtype, buffer=nl.sbuf)
    w_temp[...] = nl.load(W_reshaped[out_,:,:,:,:,:])
    for fH in nl.affine_range(filter_height):
        for fW in nl.affine_range(filter_width):
            for in_ in nl.affine_range(n_tiles_c_in):
                weights_mat_mul_2[fH,fW,out_,:,in_,:] = w_temp[:,in_,:,fH,fW]

The text was updated successfully, but these errors were encountered:

fayyadd · 2024-11-29T21:28:27Z

Thanks for reaching out, we are looking into the issue

aws-zhehongb · 2024-12-02T17:48:51Z

how w_temp was used? looks like an issue of loop-carried dependency

JonathanHenson · 2024-12-02T20:50:28Z

Hi @MBshara do you have the full kernel I could play with and repro?

AWSNB · 2024-12-02T20:59:26Z

this is part of CS149 assignment, so lets not post full kernels on github

should share directly with [email protected]

JonathanHenson · 2024-12-04T19:27:45Z

@MBshara Thank you for the email, I responded there. Just a note for future readers of the issue. External emails coming into my amazon inbox will have their attachments removed. In that case, if you can inline the code, that will likely work. If not, consider uploading the documents to a shared location and sending a link, or sharing a private GitHub repo with me.

JonathanHenson · 2024-12-06T01:48:16Z

The compiler and NKI language needs to express the semantics here better, and the way you wrote it SHOULD work. We are adding this to the backlog to improve.

I do want to offer an observation/note about the construction you're using here (and in the email) as maybe they'll be helpful.

You should be able to delete the nl.ndarray declarations for w_temp altogether and just assign to it with nl.load. If you do this you can also likely turn the nl.sequential_range() back to nl.affine_range() and get better throughput since the loop iterations don't need to wait for a shared chunk of memory to be updated. Regardless, the second option you provided is more correct to your usage. You aren't using w_temp outside of the loop so allocating it outside of the loop just adds an unnecessary loop-carried dependency to figure out.

weights_mat_mul_2 = nl.ndarray((filter_height, filter_width, n_tiles_c_out, nl.par_dim(c_in_pmax), n_tiles_c_in, c_in_pmax), dtype=W.dtype, buffer=nl.sbuf)
w_temp = nl.ndarray((nl.par_dim(c_in_pmax), n_tiles_c_in, c_in_pmax, filter_height, filter_width), dtype=W.dtype, buffer=nl.sbuf)
for out_ in nl.sequential_range(n_tiles_c_out):
    w_temp[...] = nl.load(W_reshaped[out_,:,:,:,:,:])
    for fH in nl.affine_range(filter_height):
        for fW in nl.affine_range(filter_width):
            for in_ in nl.affine_range(n_tiles_c_in):
                weights_mat_mul_2[fH,fW,out_,:,in_,:] = w_temp[:,in_,:,fH,fW]

The purpose of this construction as an optimization is to pre-allocate space in SBUF that you're going to go ahead and pre-load data into for use in another loop, (usually utilizing the "blocking" dimension).

This usually looks like this high-level pattern

pre_buffer -> allocated space to hold n iterations worth of data

loop - n
    loop iteration -> load an iteration's worth of data to the pre_buffer

loop - n
    loop iteration -> do the computation on pre_buffer[n]

If you line everything up correctly you can achieve high parallelism with nl.affine_range() since the data is local to the computation and the compiler can run the loop iterations in parallel, or the chips themselves can execute instructions optimally based on when memory is ready etc....

The way you're using it here: w_temp[...] = nl.load(W_reshaped[out_,:,:,:,:,:]) has you loading to the entire tensor, meaning it doesn't really serve a purpose differently than just a single assignment to w_temp:

w_temp = nl.load(W_reshaped[out_,:,:,:,:,:])

If you want to optimize the memory accesses for the data loaded from W_reshaped, you typically want to declare it outside the loops using it, utilize the blocking dimension, load in N blocks in a loop, and then use the pre-loaded tiles in sub-sequent loops.

JonathanHenson · 2024-12-06T01:59:31Z

Confirmation, that running your example with this has (to best I can tell) correct results on my trn1 instance:

for out_ in nl.affine_range(n_tiles_c_out):
        bias_sbuf[out_] = nl.load(bias_reshaped[out_])
        w_temp = nl.load(W_reshaped[out_,:,:,:,:,:])
        #bring in W_reshaped[i] = (128,in_channels,fH,fW)
        for fH in nl.affine_range(filter_height):
            for fW in nl.affine_range(filter_width):
                for in_ in nl.affine_range(n_tiles_c_in):
                    weights_mat_mul_2[fH,fW,out_,:,in_,:] = w_temp[:,in_,:,fH,fW]

MBshara · 2024-12-07T04:31:37Z

Yes, this works as expected, thanks for the help! * Munir

…

________________________________ From: Jonathan M. Henson ***@***.***> Sent: Thursday, December 5, 2024 5:59 PM To: aws-neuron/aws-neuron-sdk ***@***.***> Cc: Munir Nafea Bshara ***@***.***>; Mention ***@***.***> Subject: Re: [aws-neuron/aws-neuron-sdk] SBUF Data Dependency Issue (Issue #1045) Confirmation, that running your example with this has (to best I can tell) correct results on my trn1 instance: for out_ in nl.affine_range(n_tiles_c_out): bias_sbuf[out_] = nl.load(bias_reshaped[out_]) w_temp = nl.load(W_reshaped[out_,:,:,:,:,:]) #bring in W_reshaped[i] = (128,in_channels,fH,fW) for fH in nl.affine_range(filter_height): for fW in nl.affine_range(filter_width): for in_ in nl.affine_range(n_tiles_c_in): weights_mat_mul_2[fH,fW,out_,:,in_,:] = w_temp[:,in_,:,fH,fW] — Reply to this email directly, view it on GitHub<#1045 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BMEN3SYE7W4KTKYA7AMGC232EEAJTAVCNFSM6AAAAABSXWORM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRRHEZDCMJZGQ>. You are receiving this because you were mentioned.Message ID: ***@***.***>

JonathanHenson added the NKI label Dec 2, 2024

JonathanHenson self-assigned this Dec 2, 2024

JonathanHenson closed this as completed Dec 7, 2024

JonathanHenson reopened this Dec 7, 2024

JonathanHenson added the compiler label Dec 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SBUF Data Dependency Issue #1045

SBUF Data Dependency Issue #1045

MBshara commented Nov 29, 2024

fayyadd commented Nov 29, 2024

aws-zhehongb commented Dec 2, 2024

JonathanHenson commented Dec 2, 2024

AWSNB commented Dec 2, 2024

JonathanHenson commented Dec 4, 2024 •

edited

Loading

JonathanHenson commented Dec 6, 2024 •

edited

Loading

JonathanHenson commented Dec 6, 2024

MBshara commented Dec 7, 2024 via email

SBUF Data Dependency Issue #1045

SBUF Data Dependency Issue #1045

Comments

MBshara commented Nov 29, 2024

fayyadd commented Nov 29, 2024

aws-zhehongb commented Dec 2, 2024

JonathanHenson commented Dec 2, 2024

AWSNB commented Dec 2, 2024

JonathanHenson commented Dec 4, 2024 • edited Loading

JonathanHenson commented Dec 6, 2024 • edited Loading

JonathanHenson commented Dec 6, 2024

MBshara commented Dec 7, 2024 via email

JonathanHenson commented Dec 4, 2024 •

edited

Loading

JonathanHenson commented Dec 6, 2024 •

edited

Loading