Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fpga, prim] Add support for Ultrascale and CW340 + CW341 #19295

Merged
merged 6 commits into from
Aug 10, 2023

Conversation

a-will
Copy link
Contributor

@a-will a-will commented Jul 26, 2023

Add primitives for a new Ultrascale library. The clocking architecture and primitive set is new vs. 7 series.

In the interest of moving more of the clocking out of LUTs / keeping it on dedicated clock channels, make prim_clock_div abstract and add Ultrascale-specific clock_div and clock_inv implementations.

Add hardware support for the CW340 + CW341 boards.

Note: The build times are going to be higher for this board than need to be. We might want to seriously consider removing generic mode (#15452), as the current spi_device structure makes it difficult to close timing and poses various CDC challenges. The timing for generic mode is wildly different from flash and TPM modes, and they mix poorly together.

On my machine, the hour-plus build goes down to 40-45 mins if generic_mode is removed. CI will be much slower that that, though ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to have these empty waivers? Should we just delete them? (carryover from prim_xilinx)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could go either way on that. Fine with removing them (just have to update / comment the associated lines in the corefiles).

@a-will a-will force-pushed the cw340 branch 2 times, most recently from 3a7f4c5 to 4fe8a14 Compare July 27, 2023 02:48
@a-will
Copy link
Contributor Author

a-will commented Jul 27, 2023

Just noting: Build time for CW340 in CI was 1h46m with spi_device still containing generic mode. But that seems fast, since CW310 took only 1h01m.

@a-will a-will force-pushed the cw340 branch 2 times, most recently from b15ace9 to eca5e1a Compare July 28, 2023 17:29
@a-will a-will self-assigned this Jul 28, 2023
@a-will a-will marked this pull request as ready for review July 28, 2023 20:31
@a-will
Copy link
Contributor Author

a-will commented Jul 28, 2023

CC @engdoreis

hw/bitstream/vivado/BUILD Outdated Show resolved Hide resolved
@engdoreis
Copy link
Contributor

CC @GregAC @johngt

Copy link
Contributor

@engdoreis engdoreis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @a-will, This PR seems sensible to me, but I don't know enough to approved it.

hw/bitstream/vivado/BUILD Outdated Show resolved Hide resolved
hw/bitstream/vivado/BUILD Outdated Show resolved Hide resolved
hw/top_earlgrey/data/pins_cw341.xdc Show resolved Hide resolved
@@ -509,6 +509,50 @@ jobs:
displayName: Upload artifacts for CW310
condition: failed()

- job: chip_earlgrey_cw340
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is actually a template in the integrated repo, and it would be good to bring it over, tweak it, and apply it to all bitstream jobs. However, this PR is rather large already, hehe.

I plan to do that separately.

Copy link
Contributor

@msfschaffner msfschaffner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work here - this LGTM! Just a few questions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could go either way on that. Fine with removing them (just have to update / comment the associated lines in the corefiles).

if (HasScanMode) begin : gen_scan
BUFGCTRL #(
.IS_I0_INVERTED(1'b1),
.IS_S0_INVERTED(1'b1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that this takes care of selection inversion below (just because the same clock / selects are assigned to both channels)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right! I0 having inverted sense means that when the I0 pin is selected, O will be inverted. Likewise, S0 is set to have inverted sense, so the same select signal can be used for both S0 and S1.

@@ -4,7 +4,7 @@

`include "prim_assert.sv"

module prim_clock_div #(
module prim_generic_clock_div #(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool idea! I bet this results in better timing since the tool can use proper clocking resources now...


`include "prim_assert.sv"

module prim_xilinx_ultrascale_clock_div #(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this also available for the 7 series? should we add that as well?

Copy link
Contributor Author

@a-will a-will Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but at least for Earl Grey, we wouldn't be able to make use of it, I think. Unlike Ultrascale, 7 series only has 32 BUFGs total, to use across muxes, dividers, general global buffers, etc. And in CW310, we're using almost all of them already.

Edit: Actually, I forgot. The dividers are only on regional buffers (BUFR) in 7series. They're more complicated to use, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we use a BUFH e.g. for AES to enforce a specific AES placement on CW310. The issue is that if you don't enforce the placement when using BUFH, Vivado (or at least the version we are currently using) tends to do irrational things, see #8138 (comment). I expect the same will happen with BUFH. Maybe it would be worth to at one point check if this still happens with newer Vivado releases and if not, upgrade. Of course not as part of this PR.

@a-will
Copy link
Contributor Author

a-will commented Aug 10, 2023

For the curious about achievable performance, a quick build for CW340 closed timing for 30 MHz CLK_SYS and 33.333 MHz CLK_IO with a lot of slack still on CLK_IO (enough that the build could hit 50 MHz for CLK_IO). The build took no more time than the 10 MHz variant.

CLK_SYS had less headroom; some OTBN paths with 55 layers of logic left it with just over 4 ns of slack. Though, I should say that 4 ns is still large enough that it might not reflect the end of what's easy.

Meanwhile, on that build, SPI passthrough passed timing for better than 24 MHz (full cycle sampling).

Copy link
Contributor

@vogelpi vogelpi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @a-will , LGTM!

@@ -1069,7 +1100,7 @@ module chip_${top["name"]}_${target["name"]} #(
// Also need to add AST simulation and FPGA emulation models for things like entropy source -
// otherwise Verilator / FPGA will hang.
top_${top["name"]} #(
% if target["name"] == "cw310":
% if target["name"] in ["cw310", "cw340"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, we will want to add a separate branch for the CW340/CW341 here to enable also KMAC masking on FPGA. For this, both KeymgrKmacEnMasking and KmacEnMasking need to be set to 1. But I recommend doing this as a follow-up. It will have a big impact on utilization, timing and maybe also build time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With those enabled, it consumed another 3% of the available logic, and timing was not materially affected. The build time did go up another 15-20% locally (about 6-8 minutes) for the 30 MHz SYS / 33.333 MHz IO build.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow! The FPGA must be really big then :-) Thanks for giving it a try and reporting back @a-will !

@@ -0,0 +1,19 @@
// Copyright lowRISC contributors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have been nice if we found a way to share those primitives between 7-series and ultrascale that can be shared - probably everything except for the clock/reset stuff. This would save quite some code duplication. But I understand this would require additional tooling work and it isn't high prio IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could probably lump that in with a wider primgen overhaul. We kind of need to either fix fusesoc to handle dependency injection after running generators or to move primgen activity outside fusesoc (i.e. something that runs prior to fusesoc).


`include "prim_assert.sv"

module prim_xilinx_ultrascale_clock_div #(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we use a BUFH e.g. for AES to enforce a specific AES placement on CW310. The issue is that if you don't enforce the placement when using BUFH, Vivado (or at least the version we are currently using) tends to do irrational things, see #8138 (comment). I expect the same will happen with BUFH. Maybe it would be worth to at one point check if this still happens with newer Vivado releases and if not, upgrade. Of course not as part of this PR.

Copy link
Contributor

@jwnrt jwnrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We went through this in the software team and didn't have any concerns, just one nitpick.

Thanks @a-will.

hw/bitstream/vivado/BUILD Outdated Show resolved Hide resolved
@a-will a-will force-pushed the cw340 branch 2 times, most recently from d914ed4 to 8ec2dbf Compare August 10, 2023 18:00
Add a library specific to the Xilinx / AMD Ultrascale FPGAs. The
clocking architecture is quite different for Ultrascale vs. 7 series,
and the available primitives are not the same. Thus, it needed its own
set of primitives.

Signed-off-by: Alexander Williams <[email protected]>
Create a primitive for clock inverters on Ultrascale devices that use
clock buffers.

Use the clock buffers for clock inverters where possible, so clocks may
continue to be routed on dedicated clock routing tracks. Unlike the case
with 7-series, there are many clock buffers available in Ultrascale
design.

Signed-off-by: Alexander Williams <[email protected]>
Make prim_clock_div an abstract core, allowing the use of
technology-specific clock divider primitives that meet the requirements.

Add prim_clock_inv to prim:all. fusesoc appears to not gather all
dependencies when generated cores depend on abstract cores.
prim_generic_clock_div depends on prim_clock_inv, but the core was left
behind when building clkmgr sims.

Signed-off-by: Alexander Williams <[email protected]>
Use the clock dividers in Ultrascale for small divisors, so most
clocking can remain on the dedicated routing channels.

Signed-off-by: Alexander Williams <[email protected]>
Add FPGA build support for the CW340 + CW341 with Earl Grey. Port the
SPI constraints from the ASIC to the CW340, and move the USB timing
constraints out of the physical constraints file.

Adjust build scripts for the CW340's primitives for tasks like MMI
generation.

Use CW310 software for now. The device-side software should be mostly
compatible, but this could change in the future.

Signed-off-by: Alexander Williams <[email protected]>
This will add CW340 builds to the bitstream caches, so tooling and
software development may proceed without each developer needing to build
the design.

Signed-off-by: Alexander Williams <[email protected]>
@a-will
Copy link
Contributor Author

a-will commented Aug 10, 2023

Since CW310 is now at 24 MHz, I've increased the frequency here to match.

@a-will a-will merged commit 0eb1a55 into lowRISC:master Aug 10, 2023
24 checks passed
@a-will a-will deleted the cw340 branch August 10, 2023 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants