-
Notifications
You must be signed in to change notification settings - Fork 409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unrolled CNN implementation #600
Conversation
Generally this is looking good to me, but there are pytests, including on convolutions, that failed. |
The qkeras pytest failure has been occurring recently, so it's not related to this pull requests. I am curious, though about the conv1d and sepconv2d failures. |
I addressed some issues, but the conv1d test still fails. I need to investigate more. |
Conv1D failed because the model used is rather big, so the generated code ends up being huge, taking a lot of time to compile, and this causes the test to timeout. I've replaced it with a smaller model from the example-models repo. |
Unrolled CNN implementation
Description
This is the refined version of a Conv1D/2D implementation that urolls the input feature matrix of im2col algorithm for the
io_parallel
implementation. The general idea is to generate code for im2col transformation with exact instructions for each layer instead of synthesizing a generic C++ function because the HLS compiler has issues with it. With this implementation, I was able to synthesize layers with <= 4096 elements (the usual partitioning limit). The old implementations had trouble with far smaller layers.Based on the unrolled im2col step, the implementation further uses an adapted matrix-vector multiplication for
Resource
orLatency
strategy. Note that using overallLatency
strategy won't work as that will pipeline the entire design and cause all the loops to be urolled and this breaks the synthesis. Therefore, usingLatency
strategy for the model will issue a warning and switch to theResource
strategy (aka "dataflow"). Individual layers may still use theLatency
strategy.A new turning knob is introduced to be combined with the
ReuseFactor
to control the amount of parallelism:ParallelizationFactor
. This controls the number of output pixels processed in parallel. Defaults to 1, implying no parallelization. Valid values are divisors of theout_height * out_width
, though hls4ml will warn if the an incorrectParallelizationFactor
is used.One feature of this implementation that wasn't part of the original implementation from last year is the predictable II. In general, for
Resource
strategy,II = (ReuseFactor + C) * out_height * out_width / ParallelizationFactor + 1
whereC
is ~4. ForLatency
strategyC
is 1-2. The +1 is for the function call itself.This only touches the base Conv1D/2D layers, SeparableConv1D/2D will come as a later PR. PointwiseConv1D/2D needs investigation if it should be a special case at all with this implementation.
Limitations:
In order to wire all this, the core of the layers had to be extended. A new type of an attribute is introduced
Source
, representing generated source code. Layers can have any number of generated source codes. Writer can pick up this information.Type of change
Breaking in a sense that it replaces previous implementations and changes slightly the mechanics of how strategy.
Tests
The existing tests confirm the accuracy of the implementation.
Test Configuration:
Run any Conv1D/2D tests, just ensure
io_parallel
is used. Play withParallelizationFactor
andReuseFactor
as desired. Don't forget the limitations above!Checklist