-
Notifications
You must be signed in to change notification settings - Fork 781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fpga, prim] Add support for Ultrascale and CW340 + CW341 #19295
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to have these empty waivers? Should we just delete them? (carryover from prim_xilinx)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could go either way on that. Fine with removing them (just have to update / comment the associated lines in the corefiles).
3a7f4c5
to
4fe8a14
Compare
Just noting: Build time for CW340 in CI was 1h46m with spi_device still containing generic mode. But that seems fast, since CW310 took only 1h01m. |
b15ace9
to
eca5e1a
Compare
CC @engdoreis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @a-will, This PR seems sensible to me, but I don't know enough to approved it.
553b3a4
to
d2a1549
Compare
@@ -509,6 +509,50 @@ jobs: | |||
displayName: Upload artifacts for CW310 | |||
condition: failed() | |||
|
|||
- job: chip_earlgrey_cw340 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be achieved using matrix? https://learn.microsoft.com/en-us/azure/devops/pipelines/yaml-schema/jobs-job-strategy?view=azure-pipelines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is actually a template in the integrated repo, and it would be good to bring it over, tweak it, and apply it to all bitstream jobs. However, this PR is rather large already, hehe.
I plan to do that separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work here - this LGTM! Just a few questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could go either way on that. Fine with removing them (just have to update / comment the associated lines in the corefiles).
if (HasScanMode) begin : gen_scan | ||
BUFGCTRL #( | ||
.IS_I0_INVERTED(1'b1), | ||
.IS_S0_INVERTED(1'b1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand correctly that this takes care of selection inversion below (just because the same clock / selects are assigned to both channels)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right! I0
having inverted sense means that when the I0
pin is selected, O
will be inverted. Likewise, S0
is set to have inverted sense, so the same select signal can be used for both S0
and S1
.
@@ -4,7 +4,7 @@ | |||
|
|||
`include "prim_assert.sv" | |||
|
|||
module prim_clock_div #( | |||
module prim_generic_clock_div #( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool idea! I bet this results in better timing since the tool can use proper clocking resources now...
|
||
`include "prim_assert.sv" | ||
|
||
module prim_xilinx_ultrascale_clock_div #( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this also available for the 7 series? should we add that as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but at least for Earl Grey, we wouldn't be able to make use of it, I think. Unlike Ultrascale, 7 series only has 32 BUFGs total, to use across muxes, dividers, general global buffers, etc. And in CW310, we're using almost all of them already.
Edit: Actually, I forgot. The dividers are only on regional buffers (BUFR) in 7series. They're more complicated to use, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we use a BUFH e.g. for AES to enforce a specific AES placement on CW310. The issue is that if you don't enforce the placement when using BUFH, Vivado (or at least the version we are currently using) tends to do irrational things, see #8138 (comment). I expect the same will happen with BUFH. Maybe it would be worth to at one point check if this still happens with newer Vivado releases and if not, upgrade. Of course not as part of this PR.
For the curious about achievable performance, a quick build for CW340 closed timing for 30 MHz CLK_SYS and 33.333 MHz CLK_IO with a lot of slack still on CLK_IO (enough that the build could hit 50 MHz for CLK_IO). The build took no more time than the 10 MHz variant. CLK_SYS had less headroom; some OTBN paths with 55 layers of logic left it with just over 4 ns of slack. Though, I should say that 4 ns is still large enough that it might not reflect the end of what's easy. Meanwhile, on that build, SPI passthrough passed timing for better than 24 MHz (full cycle sampling). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @a-will , LGTM!
@@ -1069,7 +1100,7 @@ module chip_${top["name"]}_${target["name"]} #( | |||
// Also need to add AST simulation and FPGA emulation models for things like entropy source - | |||
// otherwise Verilator / FPGA will hang. | |||
top_${top["name"]} #( | |||
% if target["name"] == "cw310": | |||
% if target["name"] in ["cw310", "cw340"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually, we will want to add a separate branch for the CW340/CW341
here to enable also KMAC masking on FPGA. For this, both KeymgrKmacEnMasking
and KmacEnMasking
need to be set to 1. But I recommend doing this as a follow-up. It will have a big impact on utilization, timing and maybe also build time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With those enabled, it consumed another 3% of the available logic, and timing was not materially affected. The build time did go up another 15-20% locally (about 6-8 minutes) for the 30 MHz SYS / 33.333 MHz IO build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow! The FPGA must be really big then :-) Thanks for giving it a try and reporting back @a-will !
@@ -0,0 +1,19 @@ | |||
// Copyright lowRISC contributors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have been nice if we found a way to share those primitives between 7-series and ultrascale that can be shared - probably everything except for the clock/reset stuff. This would save quite some code duplication. But I understand this would require additional tooling work and it isn't high prio IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we could probably lump that in with a wider primgen overhaul. We kind of need to either fix fusesoc to handle dependency injection after running generators or to move primgen activity outside fusesoc (i.e. something that runs prior to fusesoc).
|
||
`include "prim_assert.sv" | ||
|
||
module prim_xilinx_ultrascale_clock_div #( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we use a BUFH e.g. for AES to enforce a specific AES placement on CW310. The issue is that if you don't enforce the placement when using BUFH, Vivado (or at least the version we are currently using) tends to do irrational things, see #8138 (comment). I expect the same will happen with BUFH. Maybe it would be worth to at one point check if this still happens with newer Vivado releases and if not, upgrade. Of course not as part of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We went through this in the software team and didn't have any concerns, just one nitpick.
Thanks @a-will.
d914ed4
to
8ec2dbf
Compare
Add a library specific to the Xilinx / AMD Ultrascale FPGAs. The clocking architecture is quite different for Ultrascale vs. 7 series, and the available primitives are not the same. Thus, it needed its own set of primitives. Signed-off-by: Alexander Williams <[email protected]>
Create a primitive for clock inverters on Ultrascale devices that use clock buffers. Use the clock buffers for clock inverters where possible, so clocks may continue to be routed on dedicated clock routing tracks. Unlike the case with 7-series, there are many clock buffers available in Ultrascale design. Signed-off-by: Alexander Williams <[email protected]>
Make prim_clock_div an abstract core, allowing the use of technology-specific clock divider primitives that meet the requirements. Add prim_clock_inv to prim:all. fusesoc appears to not gather all dependencies when generated cores depend on abstract cores. prim_generic_clock_div depends on prim_clock_inv, but the core was left behind when building clkmgr sims. Signed-off-by: Alexander Williams <[email protected]>
Use the clock dividers in Ultrascale for small divisors, so most clocking can remain on the dedicated routing channels. Signed-off-by: Alexander Williams <[email protected]>
Add FPGA build support for the CW340 + CW341 with Earl Grey. Port the SPI constraints from the ASIC to the CW340, and move the USB timing constraints out of the physical constraints file. Adjust build scripts for the CW340's primitives for tasks like MMI generation. Use CW310 software for now. The device-side software should be mostly compatible, but this could change in the future. Signed-off-by: Alexander Williams <[email protected]>
This will add CW340 builds to the bitstream caches, so tooling and software development may proceed without each developer needing to build the design. Signed-off-by: Alexander Williams <[email protected]>
Since CW310 is now at 24 MHz, I've increased the frequency here to match. |
Add primitives for a new Ultrascale library. The clocking architecture and primitive set is new vs. 7 series.
In the interest of moving more of the clocking out of LUTs / keeping it on dedicated clock channels, make prim_clock_div abstract and add Ultrascale-specific clock_div and clock_inv implementations.
Add hardware support for the CW340 + CW341 boards.
Note: The build times are going to be higher for this board than need to be. We might want to seriously consider removing generic mode (#15452), as the current spi_device structure makes it difficult to close timing and poses various CDC challenges. The timing for generic mode is wildly different from flash and TPM modes, and they mix poorly together.
On my machine, the hour-plus build goes down to 40-45 mins if generic_mode is removed. CI will be much slower that that, though ;)