-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SBUF Data Dependency Issue #1045
Comments
Thanks for reaching out, we are looking into the issue |
how |
Hi @MBshara do you have the full kernel I could play with and repro? |
this is part of CS149 assignment, so lets not post full kernels on github should share directly with [email protected] |
@MBshara Thank you for the email, I responded there. Just a note for future readers of the issue. External emails coming into my amazon inbox will have their attachments removed. In that case, if you can inline the code, that will likely work. If not, consider uploading the documents to a shared location and sending a link, or sharing a private GitHub repo with me. |
The compiler and NKI language needs to express the semantics here better, and the way you wrote it SHOULD work. We are adding this to the backlog to improve. I do want to offer an observation/note about the construction you're using here (and in the email) as maybe they'll be helpful. You should be able to delete the nl.ndarray declarations for w_temp altogether and just assign to it with nl.load. If you do this you can also likely turn the nl.sequential_range() back to nl.affine_range() and get better throughput since the loop iterations don't need to wait for a shared chunk of memory to be updated. Regardless, the second option you provided is more correct to your usage. You aren't using w_temp outside of the loop so allocating it outside of the loop just adds an unnecessary loop-carried dependency to figure out.
The purpose of this construction as an optimization is to pre-allocate space in SBUF that you're going to go ahead and pre-load data into for use in another loop, (usually utilizing the "blocking" dimension). This usually looks like this high-level pattern
If you line everything up correctly you can achieve high parallelism with nl.affine_range() since the data is local to the computation and the compiler can run the loop iterations in parallel, or the chips themselves can execute instructions optimally based on when memory is ready etc.... The way you're using it here:
If you want to optimize the memory accesses for the data loaded from W_reshaped, you typically want to declare it outside the loops using it, utilize the blocking dimension, load in N blocks in a loop, and then use the pre-loaded tiles in sub-sequent loops. |
Confirmation, that running your example with this has (to best I can tell) correct results on my trn1 instance:
|
Yes, this works as expected, thanks for the help!
*
Munir
…________________________________
From: Jonathan M. Henson ***@***.***>
Sent: Thursday, December 5, 2024 5:59 PM
To: aws-neuron/aws-neuron-sdk ***@***.***>
Cc: Munir Nafea Bshara ***@***.***>; Mention ***@***.***>
Subject: Re: [aws-neuron/aws-neuron-sdk] SBUF Data Dependency Issue (Issue #1045)
Confirmation, that running your example with this has (to best I can tell) correct results on my trn1 instance:
for out_ in nl.affine_range(n_tiles_c_out):
bias_sbuf[out_] = nl.load(bias_reshaped[out_])
w_temp = nl.load(W_reshaped[out_,:,:,:,:,:])
#bring in W_reshaped[i] = (128,in_channels,fH,fW)
for fH in nl.affine_range(filter_height):
for fW in nl.affine_range(filter_width):
for in_ in nl.affine_range(n_tiles_c_in):
weights_mat_mul_2[fH,fW,out_,:,in_,:] = w_temp[:,in_,:,fH,fW]
—
Reply to this email directly, view it on GitHub<#1045 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BMEN3SYE7W4KTKYA7AMGC232EEAJTAVCNFSM6AAAAABSXWORM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRRHEZDCMJZGQ>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Hello, I was loading a tensor into SBUF in this tiling fashion but was encountering some issues where only the last "out_" index returned the correctly loaded values in "weights_mat_mul_2":
I got no compiler issue and yet, since I am loading in sequentially, I expected w_temp to always load correct data into weights_mat_mul_2, as it w_temp was overwritten after every iteration.
I then tested simulation and every tile in weights_mat_mul_2 contained the correct values. I then tried moving the array "w_temp" declaration into the outermost for loop and all problems were resolved:
The text was updated successfully, but these errors were encountered: