Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with inference in PYNQ-Z1 and emulation stitched IP #855

Open
ISPRrj opened this issue Jul 15, 2023 · 13 comments
Open

Problems with inference in PYNQ-Z1 and emulation stitched IP #855

ISPRrj opened this issue Jul 15, 2023 · 13 comments
Assignees
Labels
bug Something isn't working

Comments

@ISPRrj
Copy link

ISPRrj commented Jul 15, 2023

Versions

  • PYNQ Z1: v3.0.1
  • FINN: v0.9
  • Xilinx tools: 2022.2
  • Ununtu: 20.04

Commit hash

commit e76f20d (HEAD -> dev, origin/dev)
Merge: a3b6a7f 3873325
Author: auphelia [email protected]
Date: Tue Jul 11 10:21:26 2023 +0100

Merge pull request #852 from Xilinx/fix/alveo_build

Set axilite address range to a minimum of 4K

commit 3873325
Author: auphelia [email protected]
Date: Tue Jul 11 09:44:30 2023 +0100

[AlveoBuild] Set axilite address range to a minimum of 4K

commit a3b6a7f
Merge: e56e813 96fc4f5
Author: auphelia [email protected]
Date: Mon Jul 10 09:14:01 2023 +0100

Merge pull request #844 from Xilinx/feature/2022_2

Dev PR to update Docker environment to Ubuntu 22, Python 3.10 and Xilinx tool version

commit 96fc4f5
Author: auphelia [email protected]
Date: Fri Jul 7 15:54:13 2023 +0100

[Deps] Update qonnx version

commit 7924bf7
Author: auphelia [email protected]
Date: Fri Jul 7 14:31:14 2023 +0100

[NBs] Update notebooks to only use QONNX export

commit 391cd76
Author: auphelia [email protected]
Date: Fri Jul 7 12:07:42 2023 +0100

[deps] Bump clize to 5.0.1 and sigtools to 4.0.1

commit a48b503
Author: auphelia [email protected]
Date: Thu Jul 6 16:50:29 2023 +0100

[Tests] Update tests to only use qonnx export

commit 0cd757f
Author: auphelia [email protected]
Date: Thu Jul 6 15:50:01 2023 +0100

Quick summary

I am trying to implement the Lenet5 network on the PYNQ-Z1 board. For that purpose I have created the network using brevitas and I have obtained the following accuracy after training (about 55%).

252278251-51745332-dc5f-4de9-a5db-7dff876e077f

I have followed all the steps of the FINN end-to-end flow and even all the intermediate checks (including emulation via PyVerilator).

At first I thought that all the intermediate checks were working correctly and I performed the deployment on the PYNQ board getting only 8% accuracy well below the 55% obtained with brevitas.

But the other day I realised that when performing the stitched IP emulation I always get the same output value, regardless of the input value.

Details

I'm using the next dataset: https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz

Steps to Reproduce

Add what needs to be done to reproduce the bug. Add code examples where useful
and make sure to include the resulting ONNX files, and the commit hash you are working on.

  1. I create the lenet network in brevitas ( Note that I am using QuantIdentity with a bit width of 8 at the beginning and I am using biasing, except in the last layer I am not using biasing to avoid problems in the subsequent transformations to HLS layers)
BIT_WIDTH=2;

class QuantWeightActBiasLeNet(Module):
    def __init__(self):
        super(QuantWeightActBiasLeNet, self).__init__()
        self.quant_inp = qnn.QuantIdentity(bit_width=8, return_quant_tensor=True)
        self.conv1 = qnn.QuantConv2d(3, 6, 5, bias=True, weight_bit_width=BIT_WIDTH)
        self.relu1 = qnn.QuantReLU(bit_width=BIT_WIDTH, return_quant_tensor=True)
        self.conv2 = qnn.QuantConv2d(6, 16, 5, bias=True, weight_bit_width=BIT_WIDTH)
        self.relu2 = qnn.QuantReLU(bit_width=BIT_WIDTH, return_quant_tensor=True)
        self.fc1   = qnn.QuantLinear(16*5*5, 120, bias=True, weight_bit_width=BIT_WIDTH)
        self.relu3 = qnn.QuantReLU(bit_width=BIT_WIDTH, return_quant_tensor=True)
        self.fc2   = qnn.QuantLinear(120, 84, bias=True, weight_bit_width=BIT_WIDTH)
        self.relu4 = qnn.QuantReLU(bit_width=BIT_WIDTH, return_quant_tensor=True)
        self.fc3   = qnn.QuantLinear(84, 5, bias=False, weight_bit_width=BIT_WIDTH)

    def forward(self, x):
        out = self.quant_inp(x)
        out = self.relu1(self.conv1(out))
        out = F.max_pool2d(out, 2)
        out = self.relu2(self.conv2(out))
        out = F.max_pool2d(out, 2)
        out = torch.flatten(out,1)
        out = self.relu3(self.fc1(out))
        out = self.relu4(self.fc2(out))
        out = self.fc3(out)
       
        return out
  1. Network training

  2. Brevitas export

ready_model_filename = "Lenet_quantized.onnx"
export_qonnx(model,torch.randn(1,3,32,32), ready_model_filename)
qonnx_cleanup(ready_model_filename, out_file=ready_model_filename)
  1. Tidy up, pre and post processing.

PREPOST

  1. Lowering and streamlined transformations
model = ModelWrapper("lenet_quantized_pre_post.onnx")
model = model.transform(MoveScalarLinearPastInvariants())
model = model.transform(Streamline())
model = model.transform(LowerConvsToMatMul())
model = model.transform(MakeMaxPoolNHWC())
model = model.transform(absorb.AbsorbTransposeIntoMultiThreshold())

model = model.transform(MakeMaxPoolNHWC())
model = model.transform(absorb.AbsorbConsecutiveTransposes())

model = model.transform(Streamline())

model = model.transform(absorb.AbsorbScalarMulAddIntoTopK())
model = model.transform(InferDataLayouts())
model = model.transform(RemoveUnusedTensors())
model.save("lenet_quantized_streamlined.onnx")

streamliNED

6.Conversion to HLS layers

mem_mode = "decoupled"

model = model.transform(to_hls.InferBinaryMatrixVectorActivation(mem_mode))
model = model.transform(to_hls.InferQuantizedMatrixVectorActivation(mem_mode))

model = model.transform(to_hls.InferLabelSelectLayer())
model = model.transform(to_hls.InferThresholdingLayer())
model = model.transform(GiveUniqueNodeNames())

model = model.transform(to_hls.InferThresholdingLayer())
model = model.transform(to_hls.InferConvInpGen())
model = model.transform(to_hls.InferStreamingMaxPool())

model = model.transform(RemoveCNVtoFCFlatten())

model = model.transform(absorb.AbsorbConsecutiveTransposes())

model = model.transform(InferDataLayouts())

model.save("lenet_hls_layers.onnx")

HLS

7.Dataflow partitioning

8.Folding

model = ModelWrapper("flowers_dataflow_model.onnx")
fc_layers = model.get_nodes_by_op_type("MatrixVectorActivation")
# each tuple is (PE, SIMD, in_fifo_depth) for a layer
folding = [
    (6, 3),
    (2, 6),
    (2, 2),
    (1, 1),
    (1, 1),
  
]
for fcl, (pe, simd) in zip(fc_layers, folding):
    fcl_inst = getCustomOp(fcl)
    fcl_inst.set_nodeattr("PE", pe)
    fcl_inst.set_nodeattr("SIMD", simd)
    

# use same SIMD values for the sliding window operators
swg_layers = model.get_nodes_by_op_type("ConvolutionInputGenerator")
for i in range(len(swg_layers)):
    swg_inst = getCustomOp(swg_layers[i])
    simd = folding[i][1]
    swg_inst.set_nodeattr("SIMD", simd)
    

model = model.transform(GiveUniqueNodeNames())
model.save("flowers_lenet_folded.onnx")
  1. Simulation cppsim: works correctly

  2. Emulation node by node PyVerilator: works correctly

  3. Emulation stitched IP PyVerilator: PROBLEM: always get the same output value, regardless of the input value

  4. Deployment on PYNQ: PROBLEM: 8% inference accuracy

@ISPRrj ISPRrj added the bug Something isn't working label Jul 15, 2023
@auphelia
Copy link
Collaborator

Hi @ISPRrj ,

Could you also provide an example input .npy file with corresponding output reference .npy file?
FINN expects integer values for all components, is your data set quantized or are you trying to use the first MultiThreshold for quantization?

@auphelia auphelia self-assigned this Jul 18, 2023
@ISPRrj
Copy link
Author

ISPRrj commented Jul 18, 2023

Hi @auphelia,

inputoutput.zip

In this zip file you can find the input.npy and the corresponding output.npy after applying the brevitas model.

To obtain the output of the brevitas model I have previously applied a normalization (to the input) by dividing by 255 because I have trained the brevitas network with normalized tensors (by doing the ToTensor() transformation). And I have also performed a reshape (1,3,32,32) to the input to have the dimensions expected by brevitas.

When I am performing the inference in FINN instead I am not performing any normalization to the data because this is already applied after the application of the preprocessing transformation.

As for the data set question, I have not performed any quantization. I was trying to use the first MultiThreshold for quantization.

@fpjentzsch
Copy link
Collaborator

Hi @ISPRrj,

what do you mean by the "application of the preprocessing transformation"? The division by 255 that normalizes UINT8 inputs to FLOAT [0,1]? Does that mean the primary input datatype (going into the first MultiThreshold) is annotated as UINT8?

I have two suggestions:

  1. Do you transpose the input data before feeding it to the stitched-ip and hardware accelerator? The initial "Transpose" node after HLS conversion will not be handled by the accelerator. FINN's HLS layers all operate on NHWC data layout, while you train on NCHW. Reshaping will not be enough.
  2. There might be something wrong with the way you normalize/quantize the input. Could you try to train directly on UINT8 inputs, so that you do not need the initial MultiThreshold and the input pixels (0-255) can be consumed without normalization by the first ConvolutionInputGenerator?

In general (for FLOAT inputs), I will use a QuantIdentity or similar as the input quantization node in Brevitas. Since FINN's MultiThreshold node does not support FLOAT inputs, I remove this node manually from the graph and do the input quantization somewhere else (in software). I know the exact quantization range by reading it from the nodes properties (if it was dynamically determined during training) or by setting a fixed range (using min_val, max_val, and scaling_impl_type = ScalingImplType.CONST).

@ISPRrj
Copy link
Author

ISPRrj commented Jul 28, 2023

Hi @fpjentzsch,
thank you very much for your feedback!

what do you mean by the "application of the preprocessing transformation"? The division by 255 that normalizes UINT8 inputs to FLOAT [0,1]? Does that mean the primary input datatype (going into the first MultiThreshold) is annotated as UINT8?
Yes, I mean the division by 255 that normalizes UINT8 inputs to FLOAT [0,1]. But that does not mean that the first input datatype going into the first MultiThreshold is annotated as UINT8. In fact it is annotated as float32 as you can see in the image below.
Captura

Do you transpose the input data before feeding it to the stitched-ip and hardware accelerator?
Yes, I was aware of that and transposed the data before passing it to the accelerator.

There might be something wrong with the way you normalize/quantize the input. Could you try to train directly on UINT8 inputs, so that you do not need the initial MultiThreshold and the input pixels (0-255) can be consumed without normalization by the first ConvolutionInputGenerator?
I have tried to perform the training directly with UINT8 but when I try to perform the inference with FINN I get the following error. Is this because I have to manually remove the initial Multithreshold? If so, how can I do it?
Captura2
captura3

In general (for FLOAT inputs), I will use a QuantIdentity or similar as the input quantization node in Brevitas. Since FINN's MultiThreshold node does not support FLOAT inputs, I remove this node manually from the graph and do the input quantization somewhere else (in software). I know the exact quantization range by reading it from the nodes properties (if it was dynamically determined during training) or by setting a fixed range (using min_val, max_val, and scaling_impl_type = ScalingImplType.CONST).
So in my case, would you change the QuantRelu for QuantIdentity?

Thank you very much for your time!

@fpjentzsch
Copy link
Collaborator

Hi,

I would not change the QuantRelu within your model, but you may try to set a fixed quantization range for the input quant node like this (maybe adjust it to use UINT8 instead of INT8):

from brevitas.inject.defaults import Int8ActPerTensorFloatMinMaxInit
from brevitas.inject.enum import ScalingImplType
    class InputQuantizer(Int8ActPerTensorFloatMinMaxInit):
        min_val = -config["in_quant_range"]
        max_val = config["in_quant_range"]
        scaling_impl_type = ScalingImplType.CONST
        bit_width = config["in_quant_bits"]
self.quant_inp = qnn.QuantHardTanh(act_quant=InputQuantizer, return_quant_tensor=True)

then you could remove the initial MultiThreshold manually like this (after converting from QONNX to FINN-ONNX):

first_node = model.graph.node[0]
if first_node.op_type == "MultiThreshold":
    quantized_input_dtype = model.get_tensor_datatype(first_node.output[0])
    # remove nodes up to first Mul (= MT + Add used for input quant)
    new_input_node = model.get_nodes_by_op_type("Mul")[0]
    new_input_tensor = model.get_tensor_valueinfo(new_input_node.input[0])
    old_input_tensor = model.graph.input[0]
    model.graph.input.remove(old_input_tensor)
    model.graph.input.append(new_input_tensor)
    model.graph.value_info.remove(new_input_tensor) # remove redundant value_info
    new_input_index = model.get_node_index(new_input_node)
    del model.graph.node[0:new_input_index]
    # make sure input datatype is set correctly
    model.set_tensor_datatype(model.graph.input[0].name, quantized_input_dtype)
else:
    model.set_tensor_datatype(model.graph.input[0].name, DataType["UINT8"])

In any case, the input data type should be properly annotated (see the last line). I'm a bit confused that you didn't run into other issues down the line while your input is set to FLOAT32...

@ISPRrj
Copy link
Author

ISPRrj commented Aug 3, 2023

Hi @fpjentzsch !

I have made the changes and now I get the following error in the inference with finn:

InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Unexpected input data type. Actual: (tensor(uint8)) , expected: (tensor(float))

@fpjentzsch
Copy link
Collaborator

It looks like the actual tensor dataype is UINT8 now, so ONNXRuntime complains. The tensor datatype should still be float throughput the whole model, FINN uses it as a container datatype to hold integers. The true datatype from FINN's perspective is just annotated in the "quantization: finn_datatype" attribute, which is set using model.set_tensor_datatype(model.graph.input[0].name, DataType["UINT8"]) for example.

What if you cast the input tensor to float datatype before feeding it to the model?

@ISPRrj
Copy link
Author

ISPRrj commented Aug 3, 2023

If I cast the input to float datatype before feeding to the model the inference is made. But now the problem I can see is that out of 100 test cases I am running, the inference of FINN and brevitas only match in 40.

@rassB
Copy link

rassB commented Sep 28, 2023

Hello @ISPRrj @fpjentzsch , any developments regarding this issue? I have reproduced all the steps so far with quite a similar NN that runs 10% accuracy on MNIST hardware (always infers 0).

@fpjentzsch
Copy link
Collaborator

Hi @rassB, did you also run verification/simulation at different build steps to narrow down where the error is introduced? If the problem is related to input/output shaping and quantization, maybe our new tutorial notebook could be a useful resource, as it covers a custom build step to deal with 8-bit RGB inputs.

@shakeelakram00
Copy link

Hi @fpjentzsch @rassB @ISPRrj @maltanar @auphelia ,
I have been closely following the ongoing discussion and encountered a similar accuracy discrepancy issue with my setup. Specifically, the accuracy of the Brevitas model is 86%, while the Accelerator on ZCU102 shows only 66%. After performing Initial Tidyup Transformations below, the accuracy of ONNX-converted model remains consistent with the brevitas model. However, applying the remaining transformations and building the accelerator results in a drop to 66% accuracy on the board.

Initial Tidyup Transformation:
bo.export_finn_onnx(brevitas_model, (1, 1, 14, 14), "export.onnx");
model = ModelWrapper("export.onnx")
model = model.transform(InferShapes())
model = model.transform(FoldConstants())
...
output_dict = oxe.execute_onnx(model_t, input_dict)

I am seeking assistance to identify and resolve this issue. Here are some details about my environment:

ZCU102: PYNQ Linux, based on Ubuntu 18.04 (GNU/Linux 4.19.0-xilinx-v2019.1 a)
FINN: v0.9
Xilinx tools: 2022.2
Ubuntu: 22.04.1 LTS

I have been working with the cnv_end2end_example and successfully modified it to build the Accelerator on a different dataset. The brevitas model was trained on a dataset with a shape of 1x1x14x14 and dtype torch.float32.

Following the cnv_end2end_example, the first layer that exists does the quantization and the ONNX conversion includes pre-processing (ToTensor(), i.e., division by 255 for normalization UINT8 inputs to FLOAT [0,1]) and post-processing (TopK=1). The ONNX model, after create_dataflow_partition, provides all the blocks converted into HLS_Layers, except the initial Transpose.

Given that the first Transpose was not converted to an HLS layer, and the accelerator works with a dataset of shape 1x14x14x1 and dtype UINT8, I reshaped the dataset accordingly for inference on ZCU102 (1x14x14x1 and dtype np.uint8 (dataset*255.astype(np.uint8))).

Runtime_writeable_weights are enabled (set to 1) in the .json file for MVAU of CNV and Linear Layers, following the guidelines in 4_advanced_builder_settings and cnv-w1a1_folding_config.

I would appreciate any assistance in debugging this issue.

@fpjentzsch, you mentioned in your previous reply that reshaping alone might not be sufficient. Could you please provide further guidance, considering my specific setup, to achieve the desired accuracy on the accelerator?

Thank you in advance for your help.

Hi @ISPRrj,
what do you mean by the "application of the preprocessing transformation"? The division by 255 that normalizes UINT8 inputs to FLOAT [0,1]? Does that mean the primary input datatype (going into the first MultiThreshold) is annotated as UINT8?
I have two suggestions:

  1. Do you transpose the input data before feeding it to the stitched-ip and hardware accelerator? The initial "Transpose" node after HLS conversion will not be handled by the accelerator. FINN's HLS layers all operate on NHWC data layout, while you train on NCHW. Reshaping will not be enough.
  2. There might be something wrong with the way you normalize/quantize the input. Could you try to train directly on UINT8 inputs, so that you do not need the initial MultiThreshold and the input pixels (0-255) can be consumed without normalization by the first ConvolutionInputGenerator?
    In general (for FLOAT inputs), I will use a QuantIdentity or similar as the input quantization node in Brevitas. Since FINN's MultiThreshold node does not support FLOAT inputs, I remove this node manually from the graph and do the input quantization somewhere else (in software). I know the exact quantization range by reading it from the nodes properties (if it was dynamically determined during training) or by setting a fixed range (using min_val, max_val, and scaling_impl_type = ScalingImplType.CONST).

@shakeelakram00
Copy link

shakeelakram00 commented Mar 13, 2024

Hi @fpjentzsch @rassB @ISPRrj @maltanar @auphelia @heborras @Tobi-Alonso @quetric @mmrahorovic @preusser,
I've been diligently verifying each stage of FINN Flow for the above query, and I've run into a perplexing issue that I could use some guidance on.

Initially, during the ONNX execution, I achieved a commendable accuracy of 86% after applying tidy-up transformations, pre and post-processing transformations. However, upon proceeding with the streamline transformations, I encountered a significant drop in accuracy to 68%. This drop persisted when deploying the model onto an FPGA.

To give you a clearer picture, here are the streamline transformations I've implemented:
model = model.transform(MoveScalarLinearPastInvariants())
model = model.transform(Streamline())
model = model.transform(LowerConvsToMatMul())
model = model.transform(MakeMaxPoolNHWC())
model = model.transform(Streamline())
model = model.transform(absorb.AbsorbTransposeIntoMultiThreshold())
model = model.transform(ConvertBipolarMatMulToXnorPopcount())
model = model.transform(Streamline())
model = model.transform(absorb.AbsorbScalarMulAddIntoTopK())
model = model.transform(InferDataLayouts())
model = model.transform(RemoveUnusedTensors())

I also tried the finn.builder.build_dataflow, it still showed the same issue i.e. when streamline transformations are applied there is a drop in accuracy.

Only when I take "model = model.transform(LowerConvsToMatMul())" this trasnformation off, I get the same 86% accuracy. And I know to convert the model to hls-compatible node we have to convert convs to matmul and we need this transformation. And the only difference other than this I see with and without transformation multithreshold_1 and multithreshold_2 finn_datatype are Binaray (with LowerConvsToMatMul: giving an accuracy of 68%), and are Bipolar (without LowerConvsToMatMul: giving an accuracy of 86%) respectively.

I'm at a loss as to why this transformation is causing such a significant accuracy drop. Is it due to the even Kernel Size i.e 6x6 I am using in quantconv2d? Any insights or suggestions you could offer would be greatly appreciated.

Thank you for your time and assistance.

@joannapng
Copy link

I had the same issue and I think the reason is the bias in the convolution layers. During the export to qonnx format, the bias quant initializer was not exported so the ExtractConvBias transformation that happens during the conversion from qonnx to finn-onnx failed to add an "Add" node in front of the "Conv" node. Check the logs to see if you encounter a "Could not extract bias from node" warning and remove the biases from the network if so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants