-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different values when using nki.simulate_kernel #1051
Comments
could you paste your kernel & the nki.simulate_kernel code in a github secret gist and share with me? |
Here it is: |
could you try to use |
I can do that to print intermediate values but that doesn't actually help me figure out what is going wrong when I don't use simulate. And the simulate outputs are correct while the non-simulate outputs are not correct. How do you suggest I use nl.device_print()? |
you can probe into the intermediate value by store intermediate values to HBM and return it as kernel return:
|
high level hints to workaround compiler bugs:
|
I can try probing values output by the kernel. However, I'm confused by your comment about compiler bugs. Is this a compiler bug? Because no compile-time error is thrown. In fact, no run-time error is thrown, either. |
@iamsalil using res_psum += nl.matmul is fine choice. += will work and perform well when right side is matmul output because of a special hardware circuit in psum that allows accumalation in place |
@AWSNB Thank you for the suggestion but unfortunately, that did not fix it. My code no longer has any other += in it. |
@iamsalil I've asked some questions on the gist you shared. Feel free to dialog there if you'd like and I'll be sure to summarize the sharable parts back here when we're done. |
@iamsalil I found at least one thing of interest, and commented on the gist. Give it a try and let us know if that unblocks you or not. I was incorrect in my hint as I was using a different decorator to invoke the kernel. I will rerun in the morning and see what else I can find. |
Hi @JonathanHenson. Thanks so much for taking the time to look at this. I unfortunately don't see any comments on the gist other than a suggestion to share the outputs. What was your suggestion? In the meantime, here are the outputs. The inputs were a (4, 128, 30, 14) batch of images convolved with 256 filters of shape (256, 128, 3, 3) with a bias of shape (256, ) that is all zeros (no bias) and a pool_size of 1 (so no pooling occurs). What I've output is the top 5x5 square of the second channel of the first image (i.e. X_out[0, 1, :5, :5]). Note that channel index 1 is not chosen for any particular reason. The output numbers are not just slightly off. For some elements they are wildly off (element [1, 0] is off by ~ 4%) so it does not seem to me like it's just a rounding issue. Additionally, I know that the simulate results are the correct ones (I am doing this for a homework assignment at my university and the simulate results match the expected results) so the baremetal results are incorrect. ----- 128 256 ----- |
Hello, can you also share the test inputs and their dtypes? If this is BF16, could you give FP32 a shot. Feel free to link it to the exact test case in your assignment as well: https://github.com/stanford-cs149/asst4-test/blob/main/part2/test_harness.py. |
i can reproduce the issue by copy |
Also facing the same discrepancy issue when not simulating. Already replaced ranges with sequential_range, though this did not fix the error. Unfortunately changing precisions is not an option. Repo: https://github.com/andxalex/CS149/blob/main/asst4-trainium/part2/conv2d.py Reproduce: Run in simulation mode: |
I ended up figuring it out. The issue was a call the only call to reshape(), which was used to decompose a free dimension Replacing this with a for loop as shown below fixed the problem: Bad:
Good:
|
is this issue resolved? |
I'm glad you got it sorted. My suggestion last night was based on an error in my configuration, so I didn't want to lead you down the wrong path and deleted it. Thanks for sharing the details and I'm glad we were able to help. Feel free to reach out if you encounter any further issues. |
@aws-zhehongb This issue was not resolved. I don't know who andxalex is (I presume another student in the class). It seems they had a similar issue as me, posted in this thread, and then were able to resolve it. However, my original issue never resolved, unfortunately. I am not sure how much longer I will pursue solving this issue, though. I may try digging around a little more and will post any updates if I find them. Thank you NKI team for your support. |
looks like it is because reusing Could you try this? In the code remove:
and change the producer to
|
Doing that didn't seem to have fixed the error, unfortunately. The error occurs even when the bias is all 0s. |
Hi @aws-zhehongb , I've also faced the same problem. Specifically, the small image test of this configuration failed in hardware but passed in simulation:
And when
This only occurred after I modified the code such that I transpose the weights outside of the loop, which helped the performance a lot. This can be found in the preload-weights branch. Prior to this change, the code runs correctly in both hardware and simulation (part2 branch). I've added you to the private repo. Would really appreciate if you can help take a look or give us some advice on how to debug this kind of issues (i.e. passing in simulation but failing in hardware). |
@allpan3, quick note:
is inside loop |
@allpan3 i sent you a pull request. also you initialize psum with
then |
@aws-zhehongb Thanks for the help!
I wasn't aware this is a requirement. I was thinking since each X row will get reused
I'm a bit confused. So are you saying I cannot do |
Regarding your question +=, here’s one way that may be better
c += nl.matmul(a,b) is the only case that += works well today (and c must be in psum) and it is the most performant version, because matmul result goes into psum and it is the special hardware allowing accumulating mat-mul results in psum.
For any other usage, say, try: c[…] = c + a or c[…] = nl.add(c,a)
While this is described in the docs, we realize it is not clear and not intuitive, so we plan to fix this in next release. For now, let us know if this helps
From: Allen Pan ***@***.***>
Reply-To: aws-neuron/aws-neuron-sdk ***@***.***>
Date: Friday, December 6, 2024 at 10:58 PM
To: aws-neuron/aws-neuron-sdk ***@***.***>
Cc: "Bshara, Nafea" ***@***.***>, Mention ***@***.***>
Subject: Re: [aws-neuron/aws-neuron-sdk] Different values when using nki.simulate_kernel (Issue #1051)
@aws-zhehongb<https://github.com/aws-zhehongb> Thanks for the help!
is inside loop b and fx, but the dst of the load X_tensor[ih, ic, :, :] is not indexed by b and fx, this may trigger compiler bug
I wasn't aware this is a requirement. I was thinking since each X row will get reused filter_height times, so I allocate space for all rows and hope the recent loaded ones stay in SBUF. It probably needs direct allocation in order to work as intended as well. I will explore other options.
then += on psum. It is not supported currently. += on psum is only support when the rhs is a matmul because the += is done by special transistors in the psum when the input is matmul.
I'm a bit confused. So are you saying I cannot do = nl.copy here then do += later? How else should I preload the bias?
I was doing something similar before<https://github.com/allpan3/cs149-asst4/blob/652ab8adfa6863ca5abf13c7f43f1ed4c02d965b/part2/conv2d.py#L105-L110> and it worked.
—
Reply to this email directly, view it on GitHub<#1051 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFTRWCNMPLN3MBMWYFNJMOL2EKL6FAVCNFSM6AAAAABTBGSTYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRUHE3TMNZZGU>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@AWSNB Thanks for the prompt response. I tried simply changing the code based on what you said, and now the small test passed without using @aws-zhehongb's changes, which was not performant due to duplicated loading of input rows. I'm still facing issue with the larger image size.
Still not entirely sure how allocating an array inside loops without using those indices on the LHS affects the correctness. |
I think the main error still comes from how bias is loaded (even after the suggested modification). I tried the smaller image size tests with |
@allpan3 also note that there may be compiler errors that are being truncated/hidden by the test harness. we suggest you try to run the conv2d kernel directly, as @aws-serina-tan mentioned in this comment. can you try that to see if you getting more useful error ? |
I just tried that but don't see anything being printed out, even with bias turned on. |
will likely need to wait for others to chime in in the morning just to point out one common issue we saw with indices , common usually during load/store/copy that hongbin put a comment about in case that helps |
@AWSNB Yea, although I don't exactly understand this, I've both tried hongbin's fix and implemented a more performant version myself based on the suggestion. Still failing the same tests which seem to be related to bias loading/adding. I've added you to the repo if you want to see code, but huge thanks for staying online for us Friday |
the bias fail because you cannot initialize the psum and |
ok, more context for the psum behavior. when the matmul instruction write to psum, it have two mode: overwrite mode and accumulation mode
There is actually an internal flag in the matmul instruction to control overwrite mode vs accumulate mode. When we write
it is actually like:
Because currently we cannot explicitly control this overwrite/accumulate flag, the compiler will always set overwrite=True for the first matmult in the accumulation loop. As a result, any per-exisiting value in the psum will be ignore when we enter psum accumulation loop |
in your latest code you need to avoid the extra indices on X_tensor to workaround compiler bug such that it can pass
For the bias test, you need to do the bias addition without trying to use psum accumulation |
I see, this explains why tests fail with bias. I think I understand what needs to be done now, but I have to move on to other tasks. I may revisit this if I get a chance in the future. Thanks for your help! |
I've written a kernel. It has no compile-time or run-time errors. When I wrap it with nki.simulate_kernel(), it gives me a different output.
What can possibly cause this?
I've changed all my affine_range()'s to sequential_range()'s in case this error/discrepancy is being caused by some parallelism issue. However, this did not fix my issue at all.
Also, how do I even possibly debug this? I can't even look at intermediate values because nki.language.device_print() does not work unless I'm doing kernel simulation.
The text was updated successfully, but these errors were encountered: