V0.10 FINN Build Flow fails at step_hw_ipgen due to Error in threshold hls layer. #1060

Arbiter-glitch · 2024-04-24T05:32:07Z

Prerequisites

Possible Bug in the current version v0.10 of FINN.

Quick summary

Ipgen step for the threshold hls layers in the latest version of finn (v0.10) fails. The stack trace is included below.

This shows the error in one layer. But when the logs of other Threshold hls layers were checked, same error was found. The log files of some layers are included here.
--> ERROR: [HLS 207-814] "Bitwidth exceeds 32768 (1 << 15), the maximum allowed value" <--
logs.zip

Steps to Reproduce

Clone the FINN repository main branch
Start the docker container with the command: sudo service docker start
Set the following export varibles
export FINN_XILINX_PATH="/your/xilinx/path"
export FINN_XILINX_VERSION=2022.2
export PYNQ_BOARD=ZCU104
Before running the docker script, I have added a custom step in the buid_dataflow steps. This step is used for preprocessing images by dividing the input by 255. For this step to be part of the default build steps add it to the build_dataflow_config .py file in the builder folder in src/finn.
Go to the cloned Finn directory and run docker script file with build_dataflow and path to the folder with the model and the config file. The model file, the required config files, and the build dataflow steps python file with the streamlining steps used are included here.
files.zip

Expected behavior

Completion of the Finn steps

Actual behavior

Fails at Ipgen step due to error in hls layers, resulting in not making the corresponding required IP.

Additional Info

I came across this bug because of an issue here. I tried to make it work in v0.10 by replacing the hls threshold layers that throw the error, with RTL variants but the expected image output was not produced. Since, I was getting the expected output in v0.9, I wanted to find the layer responsible for the wrong output in v0.10, by making all layers as hls like v0.9 but it threw an error in the threshold layers. Refer to steps above to reproduce the error.

auphelia · 2024-04-24T09:29:32Z

Hi @Arbiter-glitch, from the logfiles that you have provided, I see this error: ERROR: [HLS 207-2163] 'bitwidth' attribute requires integer constant between 1 and 8191 inclusive (/mnt/f/workdir/Vitis_HLS/2022.2/common/technology/autopilot/ap_common.h:520:35) . This looks like you are trying to use a different datatype than expected, are you using the code generation from the FINN flow?
How does the generated .cpp code for the layer that fails look like? The one from this repository: /mnt/f/linux/test/code_gen_ipgen_Thresholding_hls_5_f33du_vs/top_Thresholding_hls_5.cpp

Arbiter-glitch · 2024-04-25T02:44:03Z

Hi @auphelia, I have given "test" as my build directory by setting it through export FINN_HOST_BUILD_DIR command. Apart from changes mentioned earlier, i.e. in the config file for specializing layers and the build_dataflow file. I have made no changes anywhere else in the FINN flow. I set "test" as my build dir because I am using FINN in WSL and the default host build dir temp is inside the Linux environment which cannot be accessed after the WSL has ended.

I saw this particular info INFO: [HLS 207-4518] in instantiation of template class 'ap_uint<25245>' requested here (/mnt/f/linux/test/code_gen_ipgen_Thresholding_hls_5_f33du_vs/top_Thresholding_hls_5.cpp:28:1) in the log file as well but I am using the code generation from FINN flow itself.

fpjentzsch · 2024-04-29T18:05:28Z

HLS has a maximum stream width limitation of 8192 bits. You could be running into this because at least one internal stream within the FINN accelerator is too wide. If this is the cause, I'm afraid you'll have to decrease the parallelism (PE or SIMD) for the offending layers or switch to the newly released RTL back-end, which shouldn't have this limitation.

It's just weird that this was working in v0.9. Maybe some change caused incorrect folding of the thresholding layers?

Arbiter-glitch · 2024-04-29T18:16:09Z

I used the same folding factors that I used in version v0.9. I tried replacing the layers that threw the error with RTL layers in v0.10, it gave the synthesized bitfile but not the correct output whereas in v0.9 when using old layers (HLS), even though each thresholding layers were taking a lot of resources like around 13K luts each, it gave the correct output. I had this doubt that there might be something wrong with Thresholding RTL layers in v0.10 because they were using way fewer resources like around 800Luts each when compared to old layers in v0.9.

But its also entirely possible that in v0.10 the incorrect output may also not be due to Threshold layer (RTL) but some other layer.

Arbiter-glitch · 2024-04-30T09:44:16Z

[Update]:

To find out why the output is different in finn v0.10. I took Finn v0.9 and started to replace each of the old layers with new RTL layers, I checked at each stage. And it was only after replacing the old Thresholding hls layers with RTL layers, the output varied from the expected to give the white one again. So the problem might be with the thresholding rtl layers.

auphelia · 2024-04-30T09:47:43Z

Thanks @Arbiter-glitch for looking closer into this!
Could you provide an isolated example? So, with which settings of the Thresholding layer do you see a difference between HLS and RTL?
Maybe using the Thresholding test case can help with that: https://github.com/Xilinx/finn/blob/dev/tests/fpgadataflow/test_fpgadataflow_thresholding.py#L141

Arbiter-glitch · 2024-04-30T13:22:29Z

I have replaced all the Thresholding hls layers (mem_mode "const") of v0.9 with v0.10 rtl layers. The attributes of the different threshold rtl layers are given in the images below.

For image 1 in v0.10

n_inp_vecs = [1, 40, 40]

Function calls

test_fpgadataflow_thresholding(impl_style="hls", idt=DataType["INT24"], act=DataType["UINT8"], nf=4, ich=12, exec_mode="rtlsim", mem_mode ="internal_embedded")

Result:

Too many warnings. Cannot list them all.

test_fpgadataflow_thresholding(impl_style="hls", idt=DataType["INT24"], act=DataType["UINT8"], nf=4, ich=12, exec_mode="rtlsim", mem_mode ="internal_decoupled")

Result:

INFO: [HLS 207-4518] in instantiation of template class 'ap_uint<18360>' requested here (/mnt/f/linux/test5/code_gen_ipgen_Thresholding_hls_0_7gbhs8ae/top_Thresholding_hls_0.cpp:28:1)
ERROR: [HLS 207-2163] 'bitwidth' attribute requires integer constant between 1 and 8191 inclusive

test_fpgadataflow_thresholding(impl_style="rtl", idt=DataType["INT24"], act=DataType["UINT8"], nf=4, ich=12, exec_mode="rtlsim", mem_mode ="")

Result:

A bunch of warnings and an assertion error "assert np.isclose(exp_cycles, cycles_rtlsim, atol=15)" which seems not related to variation at output. But final output went back to white colored one as soon as I used the Threshold RTL layers in v0.9. And Threshold RTL layers in v0.10 is also giving the same white output??

                                                                                                    : ... In instance Thresholding_rtl_0_axi_wrapper.core.impl
  132 |  localparam int unsigned  MAX_PENDING = (DEEP_PIPELINE+1)*N + 3;
      |                                                       ^
                ... For warning description see https://verilator.org/warn/WIDTH?v=4.224
                ... Use "/* verilator lint_off WIDTH */" and lint_on around source to disable this message.
%Warning-WIDTH: /mnt/f/linux/test5/code_gen_ipgen_Thresholding_rtl_0_3dey3ahz/thresholding.sv:220:25: Logical operator LOGAND expects 1 bit on the LHS, but LHS's VARREF 'DEPTH_TRIGGER_URAM' generates 32 bits.
                                                                                                    : ... In instance Thresholding_rtl_0_axi_wrapper.core.impl
  220 |      DEPTH_TRIGGER_URAM && (DEPTH >= DEPTH_TRIGGER_URAM)? "ultra" :
      |                         ^~
%Warning-WIDTH: /mnt/f/linux/test5/code_gen_ipgen_Thresholding_rtl_0_3dey3ahz/thresholding.sv:221:25: Logical operator LOGAND expects 1 bit on the LHS, but LHS's VARREF 'DEPTH_TRIGGER_BRAM' generates 32 bits.
                                                                                                    : ... In instance Thresholding_rtl_0_axi_wrapper.core.impl
  221 |      DEPTH_TRIGGER_BRAM && (DEPTH >= DEPTH_TRIGGER_BRAM)? "block" :
      |                         ^~
%Warning-WIDTH: /mnt/f/linux/test5/code_gen_ipgen_Thresholding_rtl_0_3dey3ahz/thresholding.sv:223:25: Logical operator LOGAND expects 1 bit on the LHS, but LHS's VARREF 'DEPTH_TRIGGER_BRAM' generates 32 bits.
                                                                                                    : ... In instance Thresholding_rtl_0_axi_wrapper.core.impl
  223 |      DEPTH_TRIGGER_BRAM && (DEPTH >= 64)? "distributed" : "auto";
      |                         ^~
%Warning-WIDTH: /mnt/f/linux/test5/code_gen_ipgen_Thresholding_rtl_0_3dey3ahz/thresholding.sv:137:63: Operator ASSIGN expects 6 bits on the Assign RHS, but Assign RHS's SUB generates 32 bits.
                                                                                                    : ... In instance Thresholding_rtl_0_axi_wrapper.core.impl
  137 |   logic signed [$clog2(MAX_PENDING):0]  GuardSem = MAX_PENDING-1;  
      |                                                               ^
%Warning-WIDTH: /mnt/f/linux/test5/code_gen_ipgen_Thresholding_rtl_0_3dey3ahz/thresholding.sv:140:22: Operator ASSIGNDLY expects 6 bits on the Assign RHS, but Assign RHS's SUB generates 32 bits.
                                                                                                    : ... In instance Thresholding_rtl_0_axi_wrapper.core.impl
  140 |    if(rst)  GuardSem <= MAX_PENDING-1;
      |                      ^~
%Warning-WIDTH: /mnt/f/linux/test5/code_gen_ipgen_Thresholding_rtl_0_3dey3ahz/thresholding.sv:213:29: Operator EQ expects 32 or 2 bits on the LHS, but LHS's SEL generates 1 bits.
                                                                                                    : ... In instance Thresholding_rtl_0_axi_wrapper.core.impl
  213 |    uwire  cs = (p.ptr[SN:0] == 2**SN-1);
      |                             ^~
%Warning-WIDTH: /mnt/f/linux/test5/code_gen_ipgen_Thresholding_rtl_0_3dey3ahz/thresholding.sv:344:17: Operator LTS expects 32 bits on the LHS, but LHS's VARREF 'APtr' generates 6 bits.
                                                                                                    : ... In instance Thresholding_rtl_0_axi_wrapper.core.impl
  344 |     assert(APtr < $signed(A_DEPTH-1)) else begin
      |                 ^
%Warning-WIDTH: /mnt/f/linux/test5/code_gen_ipgen_Thresholding_rtl_0_3dey3ahz/thresholding.sv:362:17: Bit extraction of array[17:0] requires 5 bit index, not 6 bits.
                                                                                                    : ... In instance Thresholding_rtl_0_axi_wrapper.core.impl
  362 |     BDat <= ADat[APtr];
      |                 ^
%Warning-WIDTH: /mnt/f/linux/test5/code_gen_ipgen_Thresholding_rtl_0_3dey3ahz/thresholding.sv:369:31: Operator ADD expects 32 bits on the LHS, but LHS's SEL generates 8 bits.
                                                                                                    : ... In instance Thresholding_rtl_0_axi_wrapper.core.impl
  369 |    assign odat[pe] = BDat[pe] + BIAS;
      |                               ^
%Warning-WIDTH: /mnt/f/linux/test5/code_gen_ipgen_Thresholding_rtl_0_3dey3ahz/thresholding.sv:369:20: Operator ASSIGNW expects 8 bits on the Assign RHS, but Assign RHS's ADD generates 32 bits.
                                                                                                    : ... In instance Thresholding_rtl_0_axi_wrapper.core.impl
  369 |    assign odat[pe] = BDat[pe] + BIAS;
      |                    ^

Note: I tried the test in v0.9 as well, but the hls layers that worked and gave correct output used mem mode "const" even though there were many warnings. But for mem_mode "decoupled". It gave the same bitwidth error in v0.9.

[Update]:

In light of the above tests, I tried the finn v0.10 with Thresholding hls and mem_mode="internal_embedded" and it works.
The problem is with the mem_mode="internal_decoupled" just like in the tests. And thresholding RTL even though the tests do not show any variation in output. My final output is very varied.

Arbiter-glitch · 2024-05-02T04:49:38Z

I have posted the following images to show the variation in output due to threshold RTL layers. The first image is obtained with all threshold_HLS layers and mem_mode=internal_embedded. The Pynq board that I am targeting is the ZCU104 evaluation board. If you zoom in on the images obtained from FPGA, you can see square patches since I am processing it patch by patch and then joining them together. The second image is obtained after replacing one HLS with a Threshold RTL layer. The onnx model image is also given.

All Threshold_HLS layer

One RTL Threshold layer

Corresponding Onnx model image [from step_specialize_layers]

auphelia · 2024-05-13T10:54:57Z

Hi @Arbiter-glitch ,
We just pushed a fix for the RTL Thresholding layer, can you please check it out if that solves your problem.

Arbiter-glitch · 2024-05-14T05:23:30Z

Hi @auphelia I checked it with the dev branch, but it has not solved the problem.
However, there is a difference from the earlier all-white output.

Earlier bad output

Current bad output

The expected Correct Output taken from threshold hls layers (internal_embedded)

auphelia · 2024-05-14T07:46:17Z

Hi @Arbiter-glitch,

You are dealing with quite a high bit width (for QNNs on FPGAs) and that causes (at least in the HLS Thresholding) the error in the HLS synthesis. I reproduced according to your setting above and to represent the generated thresholds, you will need the data type INT24 (like the input). When you use the internal_decoupled mode, the threshold memory is not part of the HLS code (see here: https://finn.readthedocs.io/en/latest/internals.html#hls-variant-of-matrixvectoractivation-mem-mode) and you will have a streaming interface which is defined by the setting of PE, the threshold bit width (INT24) and the number of steps (coming from the output activation). You can find the code here: https://github.com/Xilinx/finn/blob/dev/src/finn/custom_op/fpgadataflow/hls/thresholding_hls.py#L117
This results in a weight stream width of 24480 (hls::stream<ap_uint<24480>> &weights_V) in the generated top_Thresholding_hls_0.cpp file, which exceeds the allowed 8191 (limitation from Vitis HLS). Using internal_embedded with HLS solves the problem, like you mentioned above.

Regarding the RTL Thresholding:
Could you please check again, if you could provide a test case for the RTL Thresholding that shows incorrect behaviour? You could isolate a specific node from your network.

For debugging purposes, you could execute the model either right after streamlining or after convert to hw layers with return_full_exec_context=True and doing the same with node-by-node rtlsim after specializing the layers. Then a dictionary will be returned with every intermediate tensor and you can better compare different network settings.
There is a verification notebook that can help setting up the execution manually, you can use the onnx files from the intermediate_models folder:

Arbiter-glitch · 2024-05-15T11:41:00Z

I did the execution of model after convert_to_hw layer and compared the output tensors according to the verfication notebook. The golden output was the output from the, all threshold hls layer network. Both results were the same for convert_to_hw layers.
But in node-by-node rtl sim, the output result for network model with threshold rtl was different from the threshold hls layer. I tried to get dictionary values of intermediate tensors for rtlsim by using [return_full_exec_context=True] in the onnx execution but all the displayed intermediate tensors were showing all zeros except for the final global_out. I tried using the finn.core.onnx_exec.execute_onnx_and_make_model(model, input_dict) to get the intermediate tensor in visual form in netron but it shows the same. This execute_onnx was done to get the tensor values from the threshold_hls layer network for comparison with the wrong output threshold rtl layer network. Also giving start_node and end_node for execute_onnx did not stop execution at the end_node.
check the zip file for the model file with intermediate tensor values as well as the verification notebook ipynb file.
Verification.zip

auphelia · 2024-05-16T09:20:59Z

Hi @Arbiter-glitch,

I know it's getting a bit into trial & error territory, but could you try your example with the latest dev again?
In the previous RTL Thresholding implementation, we had the requirement that input data type = threshold data type.
We implemented now the option to set these data types independent from each other. I can see in one of the images you shared above that your threshold nodes have different values for input data type and threshold data type, so the error could be coming from this.

Regarding your trial with the execution flow:
The execution functions in FINN are based on tensor names, an error that often happens is that a certain name for a tensor is assumed e.g. "global_in" but the actual name of the tensor is different, you can prevent that by not using a fixed string but instead using e.g. model.graph.input[0].name in the input_dict.

Arbiter-glitch · 2024-05-17T06:12:41Z

Hi @auphelia
Thanks...Its working now. Could you tell me whether it's possible to split a 3-channel tensor to three separate single channel tensors and process them separately in separate networks using finn. Because when a split node is introduced, I don't know how to convert it. How do I introduce a fork in onnx model. I found a concat layer but no split?

auphelia · 2024-05-17T08:20:38Z

Thanks @Arbiter-glitch,
Glad to hear that it's working with the RTL Thresholding now! Your project looks interesting, if you like you could make a post in the Show & Tell section about it to share with other users.

Could you post your next question in GitHubDiscussions please? I will close this issue and related issue #1055 and mark it as solved by PR #1077.
Keeping the discussions separately makes the repo better searchable for other users with similar questions/issues.

Arbiter-glitch added the bug Something isn't working label Apr 24, 2024

azizb-xlnx self-assigned this May 3, 2024

auphelia closed this as completed May 17, 2024

auphelia mentioned this issue May 17, 2024

The model works with V0.9 but not with new layers in V0.10. #1055

Closed

Arbiter-glitch mentioned this issue May 24, 2024

Drop in accuracy of Output between quantized Brevitas model and the streamlined Finn onnx model output. #1088

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V0.10 FINN Build Flow fails at step_hw_ipgen due to Error in threshold hls layer. #1060

V0.10 FINN Build Flow fails at step_hw_ipgen due to Error in threshold hls layer. #1060

Arbiter-glitch commented Apr 24, 2024 •

edited

Loading

auphelia commented Apr 24, 2024 •

edited

Loading

Arbiter-glitch commented Apr 25, 2024 •

edited

Loading

fpjentzsch commented Apr 29, 2024 •

edited

Loading

Arbiter-glitch commented Apr 29, 2024 •

edited

Loading

Arbiter-glitch commented Apr 30, 2024 •

edited

Loading

auphelia commented Apr 30, 2024

Arbiter-glitch commented Apr 30, 2024 •

edited

Loading

Arbiter-glitch commented May 2, 2024

auphelia commented May 13, 2024

Arbiter-glitch commented May 14, 2024 •

edited

Loading

auphelia commented May 14, 2024

Arbiter-glitch commented May 15, 2024 •

edited

Loading

auphelia commented May 16, 2024

Arbiter-glitch commented May 17, 2024 •

edited

Loading

auphelia commented May 17, 2024

V0.10 FINN Build Flow fails at step_hw_ipgen due to Error in threshold hls layer. #1060

V0.10 FINN Build Flow fails at step_hw_ipgen due to Error in threshold hls layer. #1060

Comments

Arbiter-glitch commented Apr 24, 2024 • edited Loading

Prerequisites

Quick summary

Steps to Reproduce

Expected behavior

Actual behavior

Additional Info

auphelia commented Apr 24, 2024 • edited Loading

Arbiter-glitch commented Apr 25, 2024 • edited Loading

fpjentzsch commented Apr 29, 2024 • edited Loading

Arbiter-glitch commented Apr 29, 2024 • edited Loading

Arbiter-glitch commented Apr 30, 2024 • edited Loading

auphelia commented Apr 30, 2024

Arbiter-glitch commented Apr 30, 2024 • edited Loading

For image 1 in v0.10

Function calls

Result:

Result:

Result:

Note: I tried the test in v0.9 as well, but the hls layers that worked and gave correct output used mem mode "const" even though there were many warnings. But for mem_mode "decoupled". It gave the same bitwidth error in v0.9.

[Update]:

Arbiter-glitch commented May 2, 2024

All Threshold_HLS layer

One RTL Threshold layer

Corresponding Onnx model image [from step_specialize_layers]

auphelia commented May 13, 2024

Arbiter-glitch commented May 14, 2024 • edited Loading

Earlier bad output

Current bad output

The expected Correct Output taken from threshold hls layers (internal_embedded)

auphelia commented May 14, 2024

Arbiter-glitch commented May 15, 2024 • edited Loading

auphelia commented May 16, 2024

Arbiter-glitch commented May 17, 2024 • edited Loading

auphelia commented May 17, 2024

Arbiter-glitch commented Apr 24, 2024 •

edited

Loading

auphelia commented Apr 24, 2024 •

edited

Loading

Arbiter-glitch commented Apr 25, 2024 •

edited

Loading

fpjentzsch commented Apr 29, 2024 •

edited

Loading

Arbiter-glitch commented Apr 29, 2024 •

edited

Loading

Arbiter-glitch commented Apr 30, 2024 •

edited

Loading

Arbiter-glitch commented Apr 30, 2024 •

edited

Loading

Arbiter-glitch commented May 14, 2024 •

edited

Loading

Arbiter-glitch commented May 15, 2024 •

edited

Loading

Arbiter-glitch commented May 17, 2024 •

edited

Loading