Possible ways to improve FPS #433
-
Hi, We are trying to deploy quantized CNN for classification using FINN and probably reached the limit of the FPS we can achieve (we are targeting the Pynq-Z2 board), but we want to be sure that we are not missing something. The bottleneck seems to be on the first convolutional layer where we have used maximal folding and synthesis showed that increasing folding in the next layers did not improve the throughput. When we tried to use
Because we already used maximum folding, we would like to ask if there is something that can be done to increase the parallelization of initial layer. Or how to avoid bottlenecks when you still have onboard resources available. We have also tried to use We are processing 512x512x3 input and the initial layer has the following parameters,
Thank you for any help, |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
Hi (Ahoj) Rado, As you said, there is no point in parallelizing the rest of the layers, as the slowest layer determines the throughput of the whole network. Ideally, all the layers should run in the same number of cycles. Otherwise, you are not utilizing the resources effectively. However, you could look into pruning the conv layers. This would decrease the resources used and therefore allow you to increase the SIMDs or PEs further. This would mean that you would incur some accuracy loss, but if the throughput is your main concern, it might be exploring this option as well. Note that pruning is not part of Brevitas or FINN for that matter right now. |
Beta Was this translation helpful? Give feedback.
-
Hi, indeed, you exhausted the (currently) available parallelism via PE & SIMD. There are 2 bottlenecks you would need to overcome for the input layer: First bottleneck: For 1D convolutions, we already have a SWG variant that works in parallel (https://github.com/Xilinx/finn-hlslib/blob/master/slidingwindow.h#L1663). I'm currently working on this problem for the general 2D case, but can't give you a timeline. Second bottleneck: |
Beta Was this translation helpful? Give feedback.
-
Thank you both for your replies @rbcarlos @fpjentzsch! |
Beta Was this translation helpful? Give feedback.
Hi,
indeed, you exhausted the (currently) available parallelism via PE & SIMD. There are 2 bottlenecks you would need to overcome for the input layer:
First bottleneck:
The current sliding window generator (SWG) (aka "ConvolutionInputGenerator") implementation (https://github.com/Xilinx/finn-hlslib/blob/master/slidingwindow.h#L172) outputs each window element (5x5 in your case) in a separate clock cycle and not in parallel. This limits you to the 512x512x5x5 ~ 6556180 cycles you are seeing. Theoretically, this IP core could be modified to run in 512x512x1 ~ 262k cycles, but of course the following layers would also have to be parallelized accordingly.
For 1D convolutions, we already have …