Possible ways to improve FPS #433

rpitonak · 2021-11-16T19:39:43Z

rpitonak
Nov 16, 2021

Hi,

We are trying to deploy quantized CNN for classification using FINN and probably reached the limit of the FPS we can achieve (we are targeting the Pynq-Z2 board), but we want to be sure that we are not missing something. The bottleneck seems to be on the first convolutional layer where we have used maximal folding and synthesis showed that increasing folding in the next layers did not improve the throughput.

When we tried to use AllocateResources from finn-experimental it produced the following warning and resulted in the maximum folding on the initial layer and PE=1 and SIMD=1 on the following layers which probably proved our assumption that bottleneck is on the first layer.

/workspace/finn/src/finn/transformation/fpgadataflow/set_folding.py:195: UserWarning: Node ConvolutionInputGenerator_0 is bottleneck with 6556180 cycles, running second pass

Because we already used maximum folding, we would like to ask if there is something that can be done to increase the parallelization of initial layer. Or how to avoid bottlenecks when you still have onboard resources available.

We have also tried to use InsertAndSetFIFODepths which resulted in inserted StreamingFIFO nodes without changing StreamingFCLayer fifo depths (they were 0) and again did not improve throughput in any way.

We are processing 512x512x3 input and the initial layer has the following parameters,

initial layer:
                kernel size: 5
                stride: 1
                input channels: 3
                output channels: 10
                padding: 2
                weight_bit: 8
                weight_quant: CommonIntWeightPerChannelQuant
                bias: false
                folding (PE, SIMD): (10,3)
next_layer:
                kernel size: 3
                stride: 1
                input channels: 10
                output channels: 6
                padding: 1
                weight_bit: 4
                weight_quant: CommonIntWeightPerChannelQuant
                bias: false
                folding (PE, SIMD): (6,10)

Thank you for any help,
Rado.

Answered by fpjentzsch

Nov 17, 2021

Hi,

indeed, you exhausted the (currently) available parallelism via PE & SIMD. There are 2 bottlenecks you would need to overcome for the input layer:

First bottleneck:
The current sliding window generator (SWG) (aka "ConvolutionInputGenerator") implementation (https://github.com/Xilinx/finn-hlslib/blob/master/slidingwindow.h#L172) outputs each window element (5x5 in your case) in a separate clock cycle and not in parallel. This limits you to the 512x512x5x5 ~ 6556180 cycles you are seeing. Theoretically, this IP core could be modified to run in 512x512x1 ~ 262k cycles, but of course the following layers would also have to be parallelized accordingly.

For 1D convolutions, we already have …

View full answer

rbcarlos · 2021-11-17T11:13:29Z

rbcarlos
Nov 17, 2021

Hi (Ahoj) Rado,

As you said, there is no point in parallelizing the rest of the layers, as the slowest layer determines the throughput of the whole network. Ideally, all the layers should run in the same number of cycles. Otherwise, you are not utilizing the resources effectively.

However, you could look into pruning the conv layers. This would decrease the resources used and therefore allow you to increase the SIMDs or PEs further. This would mean that you would incur some accuracy loss, but if the throughput is your main concern, it might be exploring this option as well.

Note that pruning is not part of Brevitas or FINN for that matter right now.
But some ground has already been covered so there are some resources available.

1 reply

rbcarlos Nov 17, 2021

For the first point, we made a script that runs the loop and always increases the SIMDs or PEs for the bottleneck layer. This one does not only increase them by power of 2, but works in the lowest possible increments. When doing so, you have to be aware of all of the constraints, because the possible choices for the folding factors are also determined by other nodes in the graph.

fpjentzsch · 2021-11-17T13:55:34Z

fpjentzsch
Nov 17, 2021
Collaborator

Hi,

indeed, you exhausted the (currently) available parallelism via PE & SIMD. There are 2 bottlenecks you would need to overcome for the input layer:

First bottleneck:
The current sliding window generator (SWG) (aka "ConvolutionInputGenerator") implementation (https://github.com/Xilinx/finn-hlslib/blob/master/slidingwindow.h#L172) outputs each window element (5x5 in your case) in a separate clock cycle and not in parallel. This limits you to the 512x512x5x5 ~ 6556180 cycles you are seeing. Theoretically, this IP core could be modified to run in 512x512x1 ~ 262k cycles, but of course the following layers would also have to be parallelized accordingly.

For 1D convolutions, we already have a SWG variant that works in parallel (https://github.com/Xilinx/finn-hlslib/blob/master/slidingwindow.h#L1663). I'm currently working on this problem for the general 2D case, but can't give you a timeline.

Second bottleneck:
Even with such a parallel-output SWG, we are still limited to 1 input pixel per cycle. Support for parallelism accross this dimension is still very experimental and not usable at this point.

0 replies

rpitonak · 2021-11-18T07:41:36Z

rpitonak
Nov 18, 2021
Author

Thank you both for your replies @rbcarlos @fpjentzsch!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible ways to improve FPS #433

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Possible ways to improve FPS #433

rpitonak Nov 16, 2021

Replies: 3 comments · 1 reply

rbcarlos Nov 17, 2021

rbcarlos Nov 17, 2021

fpjentzsch Nov 17, 2021 Collaborator

rpitonak Nov 18, 2021 Author

rpitonak
Nov 16, 2021

Replies: 3 comments 1 reply

rbcarlos
Nov 17, 2021

fpjentzsch
Nov 17, 2021
Collaborator

rpitonak
Nov 18, 2021
Author