Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SBUF Data Dependency Issue #1045

Open
MBshara opened this issue Nov 29, 2024 · 8 comments
Open

SBUF Data Dependency Issue #1045

MBshara opened this issue Nov 29, 2024 · 8 comments
Assignees

Comments

@MBshara
Copy link

MBshara commented Nov 29, 2024

Hello, I was loading a tensor into SBUF in this tiling fashion but was encountering some issues where only the last "out_" index returned the correctly loaded values in "weights_mat_mul_2":
I got no compiler issue and yet, since I am loading in sequentially, I expected w_temp to always load correct data into weights_mat_mul_2, as it w_temp was overwritten after every iteration.

weights_mat_mul_2 = nl.ndarray((filter_height, filter_width, n_tiles_c_out, nl.par_dim(c_in_pmax), n_tiles_c_in, c_in_pmax), dtype=W.dtype, buffer=nl.sbuf)
w_temp = nl.ndarray((nl.par_dim(c_in_pmax), n_tiles_c_in, c_in_pmax, filter_height, filter_width), dtype=W.dtype, buffer=nl.sbuf)
for out_ in nl.sequential_range(n_tiles_c_out):
    w_temp[...] = nl.load(W_reshaped[out_,:,:,:,:,:])
    for fH in nl.affine_range(filter_height):
        for fW in nl.affine_range(filter_width):
            for in_ in nl.affine_range(n_tiles_c_in):
                weights_mat_mul_2[fH,fW,out_,:,in_,:] = w_temp[:,in_,:,fH,fW]

I then tested simulation and every tile in weights_mat_mul_2 contained the correct values. I then tried moving the array "w_temp" declaration into the outermost for loop and all problems were resolved:

weights_mat_mul_2 = nl.ndarray((filter_height, filter_width, n_tiles_c_out, nl.par_dim(c_in_pmax), n_tiles_c_in, c_in_pmax), dtype=W.dtype, buffer=nl.sbuf)
for out_ in nl.sequential_range(n_tiles_c_out):
    w_temp = nl.ndarray((nl.par_dim(c_in_pmax), n_tiles_c_in, c_in_pmax, filter_height, filter_width), dtype=W.dtype, buffer=nl.sbuf)
    w_temp[...] = nl.load(W_reshaped[out_,:,:,:,:,:])
    for fH in nl.affine_range(filter_height):
        for fW in nl.affine_range(filter_width):
            for in_ in nl.affine_range(n_tiles_c_in):
                weights_mat_mul_2[fH,fW,out_,:,in_,:] = w_temp[:,in_,:,fH,fW]
@fayyadd
Copy link

fayyadd commented Nov 29, 2024

Thanks for reaching out, we are looking into the issue

@aws-zhehongb
Copy link

how w_temp was used? looks like an issue of loop-carried dependency

@JonathanHenson
Copy link

Hi @MBshara do you have the full kernel I could play with and repro?

@AWSNB
Copy link
Contributor

AWSNB commented Dec 2, 2024

this is part of CS149 assignment, so lets not post full kernels on github

should share directly with [email protected]

@JonathanHenson JonathanHenson self-assigned this Dec 2, 2024
@JonathanHenson
Copy link

JonathanHenson commented Dec 4, 2024

@MBshara Thank you for the email, I responded there. Just a note for future readers of the issue. External emails coming into my amazon inbox will have their attachments removed. In that case, if you can inline the code, that will likely work. If not, consider uploading the documents to a shared location and sending a link, or sharing a private GitHub repo with me.

@JonathanHenson
Copy link

JonathanHenson commented Dec 6, 2024

The compiler and NKI language needs to express the semantics here better, and the way you wrote it SHOULD work. We are adding this to the backlog to improve.

I do want to offer an observation/note about the construction you're using here (and in the email) as maybe they'll be helpful.

You should be able to delete the nl.ndarray declarations for w_temp altogether and just assign to it with nl.load. If you do this you can also likely turn the nl.sequential_range() back to nl.affine_range() and get better throughput since the loop iterations don't need to wait for a shared chunk of memory to be updated. Regardless, the second option you provided is more correct to your usage. You aren't using w_temp outside of the loop so allocating it outside of the loop just adds an unnecessary loop-carried dependency to figure out.

weights_mat_mul_2 = nl.ndarray((filter_height, filter_width, n_tiles_c_out, nl.par_dim(c_in_pmax), n_tiles_c_in, c_in_pmax), dtype=W.dtype, buffer=nl.sbuf)
w_temp = nl.ndarray((nl.par_dim(c_in_pmax), n_tiles_c_in, c_in_pmax, filter_height, filter_width), dtype=W.dtype, buffer=nl.sbuf)
for out_ in nl.sequential_range(n_tiles_c_out):
    w_temp[...] = nl.load(W_reshaped[out_,:,:,:,:,:])
    for fH in nl.affine_range(filter_height):
        for fW in nl.affine_range(filter_width):
            for in_ in nl.affine_range(n_tiles_c_in):
                weights_mat_mul_2[fH,fW,out_,:,in_,:] = w_temp[:,in_,:,fH,fW]

The purpose of this construction as an optimization is to pre-allocate space in SBUF that you're going to go ahead and pre-load data into for use in another loop, (usually utilizing the "blocking" dimension).

This usually looks like this high-level pattern

pre_buffer -> allocated space to hold n iterations worth of data

loop - n
    loop iteration -> load an iteration's worth of data to the pre_buffer

loop - n
    loop iteration -> do the computation on pre_buffer[n]

If you line everything up correctly you can achieve high parallelism with nl.affine_range() since the data is local to the computation and the compiler can run the loop iterations in parallel, or the chips themselves can execute instructions optimally based on when memory is ready etc....

The way you're using it here: w_temp[...] = nl.load(W_reshaped[out_,:,:,:,:,:]) has you loading to the entire tensor, meaning it doesn't really serve a purpose differently than just a single assignment to w_temp:

w_temp = nl.load(W_reshaped[out_,:,:,:,:,:])

If you want to optimize the memory accesses for the data loaded from W_reshaped, you typically want to declare it outside the loops using it, utilize the blocking dimension, load in N blocks in a loop, and then use the pre-loaded tiles in sub-sequent loops.

@JonathanHenson
Copy link

Confirmation, that running your example with this has (to best I can tell) correct results on my trn1 instance:

for out_ in nl.affine_range(n_tiles_c_out):
        bias_sbuf[out_] = nl.load(bias_reshaped[out_])
        w_temp = nl.load(W_reshaped[out_,:,:,:,:,:])
        #bring in W_reshaped[i] = (128,in_channels,fH,fW)
        for fH in nl.affine_range(filter_height):
            for fW in nl.affine_range(filter_width):
                for in_ in nl.affine_range(n_tiles_c_in):
                    weights_mat_mul_2[fH,fW,out_,:,in_,:] = w_temp[:,in_,:,fH,fW]

@MBshara
Copy link
Author

MBshara commented Dec 7, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants