Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unrolled CNN implementation #600

Merged
merged 16 commits into from
Oct 4, 2022
Merged

Conversation

vloncar
Copy link
Contributor

@vloncar vloncar commented Jul 13, 2022

Description

This is the refined version of a Conv1D/2D implementation that urolls the input feature matrix of im2col algorithm for the io_parallel implementation. The general idea is to generate code for im2col transformation with exact instructions for each layer instead of synthesizing a generic C++ function because the HLS compiler has issues with it. With this implementation, I was able to synthesize layers with <= 4096 elements (the usual partitioning limit). The old implementations had trouble with far smaller layers.

Based on the unrolled im2col step, the implementation further uses an adapted matrix-vector multiplication for Resource or Latency strategy. Note that using overall Latency strategy won't work as that will pipeline the entire design and cause all the loops to be urolled and this breaks the synthesis. Therefore, using Latency strategy for the model will issue a warning and switch to the Resource strategy (aka "dataflow"). Individual layers may still use the Latency strategy.

A new turning knob is introduced to be combined with the ReuseFactor to control the amount of parallelism: ParallelizationFactor. This controls the number of output pixels processed in parallel. Defaults to 1, implying no parallelization. Valid values are divisors of the out_height * out_width, though hls4ml will warn if the an incorrect ParallelizationFactor is used.

One feature of this implementation that wasn't part of the original implementation from last year is the predictable II. In general, for Resource strategy, II = (ReuseFactor + C) * out_height * out_width / ParallelizationFactor + 1 where C is ~4. For Latency strategy C is 1-2. The +1 is for the function call itself.

This only touches the base Conv1D/2D layers, SeparableConv1D/2D will come as a later PR. PointwiseConv1D/2D needs investigation if it should be a special case at all with this implementation.

Limitations:

in_height  * in_width  * n_chan <= 4096
out_height * out_width * n_filt <= 4096

In order to wire all this, the core of the layers had to be extended. A new type of an attribute is introduced Source, representing generated source code. Layers can have any number of generated source codes. Writer can pick up this information.

Type of change

  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Breaking in a sense that it replaces previous implementations and changes slightly the mechanics of how strategy.

Tests

The existing tests confirm the accuracy of the implementation.

Test Configuration:

Run any Conv1D/2D tests, just ensure io_parallel is used. Play with ParallelizationFactor and ReuseFactor as desired. Don't forget the limitations above!

Checklist

  • I have read the guidelines for contributing.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • I have added tests that prove my fix is effective or that my feature works.

@thesps thesps mentioned this pull request Jul 18, 2022
5 tasks
@vloncar vloncar mentioned this pull request Aug 4, 2022
6 tasks
@jmitrevs
Copy link
Contributor

Generally this is looking good to me, but there are pytests, including on convolutions, that failed.

@jmitrevs
Copy link
Contributor

The qkeras pytest failure has been occurring recently, so it's not related to this pull requests. I am curious, though about the conv1d and sepconv2d failures.

@vloncar
Copy link
Contributor Author

vloncar commented Oct 4, 2022

I addressed some issues, but the conv1d test still fails. I need to investigate more.

@vloncar
Copy link
Contributor Author

vloncar commented Oct 4, 2022

Conv1D failed because the model used is rather big, so the generated code ends up being huge, taking a lot of time to compile, and this causes the test to timeout. I've replaced it with a smaller model from the example-models repo.

@vloncar vloncar requested a review from jmitrevs October 4, 2022 14:59
@jmitrevs jmitrevs merged commit 90d760a into fastmachinelearning:main Oct 4, 2022
@vloncar vloncar deleted the instruct_cnn branch March 5, 2023 17:42
calad0i pushed a commit to calad0i/hls4ml that referenced this pull request Jul 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants