Instruction: I-21-0 with opcode: TensorTensor couldn't be allocated in SB Error #1060

sherinei · 2024-12-07T01:48:12Z

Hello, we are getting the following error when trying to use nl.add:

`Running correctness test for conv2d kernel with larger images...[GCA035] Instruction: I-21-0 with opcode: TensorTensor couldn't be allocated in SB
Memory Location Accessed:
res.48_i0: 888 Bytes per Partition and total of: 113664 Bytes in SB
position_out_i0: 98568 Bytes per Partition and total of: 12616704 Bytes in SB
position_out_i0: 98568 Bytes per Partition and total of: 12616704 Bytes in SB
Total Accessed Bytes per partition by instruction: 198024
Total SB Partition Size: 196608

Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new. You may also be able to obtain more information using the 'XLA_IR_DEBUG' and 'XLA_HLO_DEBUG' environment variables.
Traceback (most recent call last):
File "/home/ubuntu/asst4-trainium/part2/test_harness.py", line 196, in
test_result = test_correctness_conv2d_kernel(conv2d, use_larger_images=True)
File "/home/ubuntu/asst4-trainium/part2/test_harness.py", line 85, in test_correctness_conv2d_kernel
out = kernel(*args, **kwargs)
File "neuronxcc/nki/compile.py", line 92, in neuronxcc.nki.compile.GenericKernel.call
File "neuronxcc/starfish/penguin/targets/nki/TraceKernel.py", line 174, in neuronxcc.starfish.penguin.targets.nki.TraceKernel.Kernel.call
File "neuronxcc/starfish/penguin/targets/nki/TraceKernel.py", line 422, in neuronxcc.starfish.penguin.targets.nki.TraceKernel.BaremetalKernel.post_process_call
File "neuronxcc/starfish/penguin/targets/nki/TraceKernel.py", line 425, in neuronxcc.starfish.penguin.targets.nki.TraceKernel.BaremetalKernel.post_process_call
File "neuronxcc/starfish/penguin/targets/nki/TraceKernel.py", line 508, in neuronxcc.starfish.penguin.targets.nki.TraceKernel.BaremetalKernel._compile
RuntimeError: Compilation failed for fused_conv2d_maxpool with error Command '['neuronx-cc', 'compile', '--framework', 'XLA', 'penguin.py', '--internal-tensorizer-opt-level=nki', '--pipeline', 'compile', 'SaveTemps', '--target', 'trn1', '--disable-internal-io-dge', '--output=file.neff']' returned non-zero exit status 70.`

We're not sure what's causing this error. Any help would be appreciated. Thanks.

aws-serina-tan · 2024-12-07T02:42:02Z

Each SBUF partition in NeuronCore-v2 only has 196KiB of physical memory. When a TensorTensor instruction (triggered by a call of nl.add) is executed, all the input and output tensors must fit in SBUF. More info on SBUF: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/trainium_inferentia2_arch.html#trainium-inferentia2-arch.

Here, you will need to reduce the tile size of your nl.add() calls. Use loops to iterate over different chunks of your original tensor.

sherinei · 2024-12-07T05:22:47Z

Hello, Thanks for your response. We don't believe this is a memory issue as we are not increasing the size of the output tensor, we are simply adding in place. Do you mind taking a look at our code? Thank you! Best, Sherine Ismail

…

________________________________ From: aws-serina-tan ***@***.***> Sent: Friday, December 6, 2024 6:42 PM To: aws-neuron/aws-neuron-sdk ***@***.***> Cc: Sherine M Ismail ***@***.***>; Author ***@***.***> Subject: Re: [aws-neuron/aws-neuron-sdk] Instruction: I-21-0 with opcode: TensorTensor couldn't be allocated in SB Error (Issue #1060) Each SBUF partition only has 196KiB of physical memory. When a TensorTensor instruction (triggered by a call of nl.add) is executed, all the input and output tensors must fit in SBUF. More info on SBUF: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/trainium_inferentia2_arch.html#trainium-inferentia2-arch. Here, you will need to reduce the tile size of your nl.add() calls. Use loops to iterate over different chunks of your original tensor. — Reply to this email directly, view it on GitHub<#1060 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BEITX5ZHPZS3FFV46MUJEZT2EJOA5AVCNFSM6AAAAABTFW55WCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRUHAZDAOBXHA>. You are receiving this because you authored the thread.Message ID: ***@***.***>

AWSNB · 2024-12-07T05:58:56Z

@sherinei could you share your latest code via gist or github with @aws-serina-tan @AWSNB @aws-zhehongb @JonathanHenson @aws-qieqingy @EmilyWebber

and if u can share the code you doing to adding in place ? are you using a = a+b, a+=b, a[...] = a+b, or a[...] += b ?

sherinei · 2024-12-07T05:59:40Z

Also, we noticed that our code works (runs and passes tests without errors) only when we use nki.simulate_kernel. Let me know if you think of anything that could be causing this issue. Thanks again! Best, Sherine Ismail

…

________________________________ From: Sherine M Ismail ***@***.***> Sent: Friday, December 6, 2024 9:22 PM To: aws-neuron/aws-neuron-sdk ***@***.***>; aws-neuron/aws-neuron-sdk ***@***.***> Cc: Author ***@***.***> Subject: Re: [aws-neuron/aws-neuron-sdk] Instruction: I-21-0 with opcode: TensorTensor couldn't be allocated in SB Error (Issue #1060) Hello, Thanks for your response. We don't believe this is a memory issue as we are not increasing the size of the output tensor, we are simply adding in place. Do you mind taking a look at our code? Thank you! Best, Sherine Ismail

________________________________ From: aws-serina-tan ***@***.***> Sent: Friday, December 6, 2024 6:42 PM To: aws-neuron/aws-neuron-sdk ***@***.***> Cc: Sherine M Ismail ***@***.***>; Author ***@***.***> Subject: Re: [aws-neuron/aws-neuron-sdk] Instruction: I-21-0 with opcode: TensorTensor couldn't be allocated in SB Error (Issue #1060) Each SBUF partition only has 196KiB of physical memory. When a TensorTensor instruction (triggered by a call of nl.add) is executed, all the input and output tensors must fit in SBUF. More info on SBUF: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/trainium_inferentia2_arch.html#trainium-inferentia2-arch. Here, you will need to reduce the tile size of your nl.add() calls. Use loops to iterate over different chunks of your original tensor. — Reply to this email directly, view it on GitHub<#1060 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BEITX5ZHPZS3FFV46MUJEZT2EJOA5AVCNFSM6AAAAABTFW55WCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRUHAZDAOBXHA>. You are receiving this because you authored the thread.Message ID: ***@***.***>

sherinei · 2024-12-07T06:10:54Z

Just shared it. You can find the code in part2/conv2d.py in the fused_conv2d_maxpool() function. We specifically do: # store to output array in hbm res = nl.add(position_out, bias_i) position_out_reshaped = res.reshape((c_out_pmax, tiled_out_height, out_width)) nl.store(X_out[b, out_i*c_out_pmax:out_i*c_out_pmax + c_out_pmax, start_out_height:end_out_height], value = position_out_reshaped) # account for pooling later Thanks again.

…

________________________________ From: AWSNB ***@***.***> Sent: Friday, December 6, 2024 9:59 PM To: aws-neuron/aws-neuron-sdk ***@***.***> Cc: Sherine M Ismail ***@***.***>; Mention ***@***.***> Subject: Re: [aws-neuron/aws-neuron-sdk] Instruction: I-21-0 with opcode: TensorTensor couldn't be allocated in SB Error (Issue #1060) @sherinei<https://github.com/sherinei> could you share your latest code via gist or github with @aws-serina-tan<https://github.com/aws-serina-tan> @AWSNB<https://github.com/AWSNB> @aws-zhehongb<https://github.com/aws-zhehongb> @JonathanHenson<https://github.com/JonathanHenson> @aws-qieqingy<https://github.com/aws-qieqingy> @EmilyWebber<https://github.com/EmilyWebber> and if u can share the code you doing to adding in place ? are you using a = a+b, a+=b, a[...] = a+b, or a[...] += b ? — Reply to this email directly, view it on GitHub<#1060 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BEITX5764W2WFYBFQPQP6FT2EKFDPAVCNFSM6AAAAABTFW55WCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRUHE2TKMZXGI>. You are receiving this because you were mentioned.Message ID: ***@***.***>

AWSNB · 2024-12-07T07:00:04Z

@sherinei couple of other comments on the code:

Line 185: nl.add(row_out, bias_i) ==> you are not assigning the result of add to any destination.
this should be: c = nl.add(a,b)

Line 184-192: try adding the bias after copying matmul to sbuf, and instead of +=, try nl.add

                             
                                row_out[...] = nl.matmul(w[:, in_i*c_in_pmax:in_i*c_in_pmax + c_in_pmax, i, j], x_row)
                                # nl.add(row_out, bias_i) -- move this to add after data is in sbuf
                                # copy per row output into corresponding index in sbuf array
                                row_out_sbuf = nl.ndarray(shape=row_out.shape, dtype=row_out.dtype, buffer=nl.sbuf)
                                row_out_sbuf = nl.copy(row_out, dtype=row_out.dtype) # from psum to sbuf
                                row_out_sbuf[...] = nl.add(row_out_sbuf, bias_i); # putting this here to do sbuf to sbuf
                                # print(row_out_sbuf.shape, bias_i.shape)
                                po_start_index = h * out_width
                                po_end_index = po_start_index + out_width
                                position_out[:, po_start_index:po_end_index] = nl.add( position_out[:, po_start_index:po_end_index], row_out_sbuf)   # changing from += to nl.add or can do something like a[...] = a + b

sherinei · 2024-12-07T07:08:10Z

Hi sorry for the confusion, the issue we have is actually the nl.add on line 193 in the code, we forgot to delete that earlier nl.add which shouldn't be there

…

________________________________ From: AWSNB ***@***.***> Sent: Friday, December 6, 2024 11:00 PM To: aws-neuron/aws-neuron-sdk ***@***.***> Cc: Sherine M Ismail ***@***.***>; Mention ***@***.***> Subject: Re: [aws-neuron/aws-neuron-sdk] Instruction: I-21-0 with opcode: TensorTensor couldn't be allocated in SB Error (Issue #1060) @sherinei<https://github.com/sherinei> couple of other comments on the code: Line 185: nl.add(row_out, bias_i) ==> you are not assigning the result of add to any destination. this should be: c = nl.add(a,b) Line 184-192: try adding the bias after copying matmul to sbuf, and instead of +=, try nl.add row_out[...] = nl.matmul(w[:, in_i*c_in_pmax:in_i*c_in_pmax + c_in_pmax, i, j], x_row) # nl.add(row_out, bias_i) -- move this to add after data is in sbuf # copy per row output into corresponding index in sbuf array row_out_sbuf = nl.ndarray(shape=row_out.shape, dtype=row_out.dtype, buffer=nl.sbuf) row_out_sbuf = nl.copy(row_out, dtype=row_out.dtype) # from psum to sbuf row_out_sbuf[...] = nl.add(row_out_sbuf, bias_i); # putting this here to do sbuf to sbuf # print(row_out_sbuf.shape, bias_i.shape) po_start_index = h * out_width po_end_index = po_start_index + out_width position_out[:, po_start_index:po_end_index] = nl.add( position_out[:, po_start_index:po_end_index], row_out_sbuf) # changing from += to nl.add or can do something like a[...] = a + b — Reply to this email directly, view it on GitHub<#1060 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BEITX54765IACVCSIOYFGOD2EKMIVAVCNFSM6AAAAABTFW55WCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRUHE3TONBYHE>. You are receiving this because you were mentioned.Message ID: ***@***.***>

AWSNB · 2024-12-07T07:15:44Z

Got it 1. You should still change line 191 to use nl.add instead of += 2. Did you confirm position_out & bias_i have the same shape ? From: sherinei ***@***.***> Reply-To: aws-neuron/aws-neuron-sdk ***@***.***> Date: Friday, December 6, 2024 at 11:09 PM To: aws-neuron/aws-neuron-sdk ***@***.***> Cc: "Bshara, Nafea" ***@***.***>, Mention ***@***.***> Subject: Re: [aws-neuron/aws-neuron-sdk] Instruction: I-21-0 with opcode: TensorTensor couldn't be allocated in SB Error (Issue #1060) Hi sorry for the confusion, the issue we have is actually the nl.add on line 193 in the code, we forgot to delete that earlier nl.add which shouldn't be there

…

________________________________ From: AWSNB ***@***.***> Sent: Friday, December 6, 2024 11:00 PM To: aws-neuron/aws-neuron-sdk ***@***.***> Cc: Sherine M Ismail ***@***.***>; Mention ***@***.***> Subject: Re: [aws-neuron/aws-neuron-sdk] Instruction: I-21-0 with opcode: TensorTensor couldn't be allocated in SB Error (Issue #1060) @sherinei<https://github.com/sherinei> couple of other comments on the code: Line 185: nl.add(row_out, bias_i) ==> you are not assigning the result of add to any destination. this should be: c = nl.add(a,b) Line 184-192: try adding the bias after copying matmul to sbuf, and instead of +=, try nl.add row_out[...] = nl.matmul(w[:, in_i*c_in_pmax:in_i*c_in_pmax + c_in_pmax, i, j], x_row) # nl.add(row_out, bias_i) -- move this to add after data is in sbuf # copy per row output into corresponding index in sbuf array row_out_sbuf = nl.ndarray(shape=row_out.shape, dtype=row_out.dtype, buffer=nl.sbuf) row_out_sbuf = nl.copy(row_out, dtype=row_out.dtype) # from psum to sbuf row_out_sbuf[...] = nl.add(row_out_sbuf, bias_i); # putting this here to do sbuf to sbuf # print(row_out_sbuf.shape, bias_i.shape) po_start_index = h * out_width po_end_index = po_start_index + out_width position_out[:, po_start_index:po_end_index] = nl.add( position_out[:, po_start_index:po_end_index], row_out_sbuf) # changing from += to nl.add or can do something like a[...] = a + b — Reply to this email directly, view it on GitHub<#1060 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BEITX54765IACVCSIOYFGOD2EKMIVAVCNFSM6AAAAABTFW55WCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRUHE3TONBYHE>. You are receiving this because you were mentioned.Message ID: ***@***.***> — Reply to this email directly, view it on GitHub<#1060 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFTRWCOXPGWHKWWABFQJBTD2EKNHBAVCNFSM6AAAAABTFW55WCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRUHE4DANBSGM>. You are receiving this because you were mentioned.Message ID: ***@***.***>

sherinei · 2024-12-07T07:24:36Z

For (1), we tried changing it but got the following error: position_out[:, po_start_index:po_end_index] = nl.add(position_out[:, po_start_index:po_end_index], row_out_sbuf) SyntaxError: Unexpected output dependencies, missing indices in the dst access: i, j For (2), position_out is (128, tiled_out_height * out_width) and bias_i is (128, 1); however, we were told that nl.add(...) would broadcast bias_i to the same shape as position_out.

…

________________________________ From: AWSNB ***@***.***> Sent: Friday, December 6, 2024 11:16 PM To: aws-neuron/aws-neuron-sdk ***@***.***> Cc: Sherine M Ismail ***@***.***>; Mention ***@***.***> Subject: Re: [aws-neuron/aws-neuron-sdk] Instruction: I-21-0 with opcode: TensorTensor couldn't be allocated in SB Error (Issue #1060) Got it 1. You should still change line 191 to use nl.add instead of += 2. Did you confirm position_out & bias_i have the same shape ? From: sherinei ***@***.***> Reply-To: aws-neuron/aws-neuron-sdk ***@***.***> Date: Friday, December 6, 2024 at 11:09 PM To: aws-neuron/aws-neuron-sdk ***@***.***> Cc: "Bshara, Nafea" ***@***.***>, Mention ***@***.***> Subject: Re: [aws-neuron/aws-neuron-sdk] Instruction: I-21-0 with opcode: TensorTensor couldn't be allocated in SB Error (Issue #1060) Hi sorry for the confusion, the issue we have is actually the nl.add on line 193 in the code, we forgot to delete that earlier nl.add which shouldn't be there

________________________________ From: AWSNB ***@***.***> Sent: Friday, December 6, 2024 11:00 PM To: aws-neuron/aws-neuron-sdk ***@***.***> Cc: Sherine M Ismail ***@***.***>; Mention ***@***.***> Subject: Re: [aws-neuron/aws-neuron-sdk] Instruction: I-21-0 with opcode: TensorTensor couldn't be allocated in SB Error (Issue #1060) @sherinei<https://github.com/sherinei> couple of other comments on the code: Line 185: nl.add(row_out, bias_i) ==> you are not assigning the result of add to any destination. this should be: c = nl.add(a,b) Line 184-192: try adding the bias after copying matmul to sbuf, and instead of +=, try nl.add row_out[...] = nl.matmul(w[:, in_i*c_in_pmax:in_i*c_in_pmax + c_in_pmax, i, j], x_row) # nl.add(row_out, bias_i) -- move this to add after data is in sbuf # copy per row output into corresponding index in sbuf array row_out_sbuf = nl.ndarray(shape=row_out.shape, dtype=row_out.dtype, buffer=nl.sbuf) row_out_sbuf = nl.copy(row_out, dtype=row_out.dtype) # from psum to sbuf row_out_sbuf[...] = nl.add(row_out_sbuf, bias_i); # putting this here to do sbuf to sbuf # print(row_out_sbuf.shape, bias_i.shape) po_start_index = h * out_width po_end_index = po_start_index + out_width position_out[:, po_start_index:po_end_index] = nl.add( position_out[:, po_start_index:po_end_index], row_out_sbuf) # changing from += to nl.add or can do something like a[...] = a + b — Reply to this email directly, view it on GitHub<#1060 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BEITX54765IACVCSIOYFGOD2EKMIVAVCNFSM6AAAAABTFW55WCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRUHE3TONBYHE>. You are receiving this because you were mentioned.Message ID: ***@***.***> — Reply to this email directly, view it on GitHub<#1060 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFTRWCOXPGWHKWWABFQJBTD2EKNHBAVCNFSM6AAAAABTFW55WCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRUHE4DANBSGM>. You are receiving this because you were mentioned.Message ID: ***@***.***> — Reply to this email directly, view it on GitHub<#1060 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BEITX5232QQZCIUMEU24PPL2EKODNAVCNFSM6AAAAABTFW55WCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRUHE4DEOJVGA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

AWSNB · 2024-12-07T08:01:31Z

you are right on (2) and these specific dimensions are indeed broadcastable so the code is good

re (1), see hongbin's comments about indices inside loops in case that helps

sherinei · 2024-12-07T08:20:48Z

don't think those comments help with our issue, position_out[:, po_start_index:po_end_index] += row_out_sbuf runs fine but position_out[:, po_start_index:po_end_index] = position_out[:, po_start_index:po_end_index] + row_out_sbuf and position_out[:, po_start_index:po_end_index] = nl.add(position_out[:, po_start_index:po_end_index], row_out_sbuf) return errors. position_out is of shape(c_out_pmax, tiled_out_height * out_width)

aws-zhehongb · 2024-12-08T00:28:34Z

in

                res = nl.add(position_out, bias_i)

for test Running correctness test for conv2d kernel with larger images, both res and position_out have shape (128, 24642). In order for the addition to happen, you need both position_out and res in sbuf at the same time.

for fp32 datatype, you need 24642 * 4 per partition for position_out and 24642 * 4 per partition for res, i.e. you need 24642 * 4 * 2 bytes, which exceed the sbuf capacity = 192k per partition minus some extra overhead for the RT.

You need to consider to tile the computation into multiple smaller tiles to overcome this problem

aws-serina-tan added the NKI label Dec 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instruction: I-21-0 with opcode: TensorTensor couldn't be allocated in SB Error #1060

Instruction: I-21-0 with opcode: TensorTensor couldn't be allocated in SB Error #1060

sherinei commented Dec 7, 2024 •

edited

Loading

aws-serina-tan commented Dec 7, 2024 •

edited

Loading

sherinei commented Dec 7, 2024 via email

AWSNB commented Dec 7, 2024

sherinei commented Dec 7, 2024 via email

sherinei commented Dec 7, 2024 via email

AWSNB commented Dec 7, 2024

sherinei commented Dec 7, 2024 via email

AWSNB commented Dec 7, 2024 via email

sherinei commented Dec 7, 2024 via email

AWSNB commented Dec 7, 2024

sherinei commented Dec 7, 2024

aws-zhehongb commented Dec 8, 2024

Instruction: I-21-0 with opcode: TensorTensor couldn't be allocated in SB Error #1060

Instruction: I-21-0 with opcode: TensorTensor couldn't be allocated in SB Error #1060

Comments

sherinei commented Dec 7, 2024 • edited Loading

aws-serina-tan commented Dec 7, 2024 • edited Loading

sherinei commented Dec 7, 2024 via email

AWSNB commented Dec 7, 2024

sherinei commented Dec 7, 2024 via email

sherinei commented Dec 7, 2024 via email

AWSNB commented Dec 7, 2024

sherinei commented Dec 7, 2024 via email

AWSNB commented Dec 7, 2024 via email

sherinei commented Dec 7, 2024 via email

AWSNB commented Dec 7, 2024

sherinei commented Dec 7, 2024

aws-zhehongb commented Dec 8, 2024

sherinei commented Dec 7, 2024 •

edited

Loading

aws-serina-tan commented Dec 7, 2024 •

edited

Loading