-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stall-vs-Swap approach to atomic updates of large register banks, with fields wider than CPU bus #1
Comments
Here are a few relevant excerpts from other discussion threads on this question: ----1----
----2----
----3----
----4----
----5----
|
I have updated README.md to include a description of the atomic CSR update mechanism. Here I will also present a draft of latency calculations and buffer sizing. If we assume that the DPE just started processing a 1500B frame, PAUSE setup latency at the multiplexer can be expressed as:
Furthermore, PAUSE to READY latency can be expressed as:
For the cut-through pipeline:
For the store-and-forward DPE (word-case scenario):
Hence, PAUSE to READY latency can be calculated as follows:
Update latency per byte of written CSR data if a 32-bit bus is used:
Given that the Routing Table contains 16 entries of 300 bytes each, the update latency can be calculated as follows:
Finally, the total CSR update latency can be calculated as follows:
Since no packets are received in the DPE during the FCR handshake procedure, from the establishment of the pause until the completion of the CSR update, it is necessary to size the input FIFOs to at least:
Given that there are four Rx FIFOs connected to 1GbE interfaces, the total required capacity is equal to:
As the CSR update latency has the biggest contribution to the total latency, we can conclude that the additional consumption of BRAM (needed for Rx FIFOs) is approximately equal to 2 bytes of memory for each byte of written CSR data. If we want to support Jumbo frames, then each Rx FIFO must be at least 9000 bytes in size, which gives us a huge margin for CSR update latency (4-5 times more than needed). Considering the large number of variables that will depend on the implementation details, I will make a spreadsheet for easier calculation of expected latency and necessary buffer sizes. |
The While a spreadsheet is a great idea that will facilitate design space exploration, could we, for the sake of an argument, also specify a ball park number of peers that the commercial products of this kind typically have, or need?! |
Commercial-grade devices in this range (1Gbps) support up to 20-25 peers:
As far as I have observed the experiences of users, the needs are mostly up to 100 peers, so commercial vendors are working to support that. |
We are looking for a way to ensure atomic writes to registers wider than 32 bits (which is our CPU access bus).
Some control structures (such as crypto key routing table) must be updated as a whole (as partial update could cause unexpected behavior in the DataPlane Engine - DPE).
SystemRDL supports the write-buffered registers. They however seem inefficient for FPGA resources.
Here is an example that specifies a table of 16 records, each 224 bits in size, for the total of 3584 flops. that is control bits from the DPE viewpoint. However, the SystemRDL-generated RTL output expends almost 3x that number of flops!
live
values that go out to the DPE.shadow
.bit-enable
flops. Granted, these per-bit enables can be rationalized, and even completely eliminated when full 32 writes (w/o even the byte enables) are acceptable.For maximum resource utilization, we are most seriously considering the option of
stalling
the DPE while CPU is updating its control registers. That would ensure access atomicity using ordinary, i.e. the least expensive SystemRDL reg types. Granted, this flop gain would be paid in BRAM loss, as the depth of RxFIFOs would now need to be enlarged to absorb the slack while the DPE processing is held off.As an alternative, we are also assessing the cost of register bank
swap
method. This is similar to the Z80 register swaps. The first thought is to have SystemRDL create two full sets of ordinary registers: A and B. Then implement a mux in our RTL, outside of SystemRDL, so that A<->B swap can be instantly executed on CPU command.Ideally, the SystemRDL would introduce the Swap register type, so that the user can define only one set, and flow under the hood creates the other set, along with a mux and register for selection of the bank.
The text was updated successfully, but these errors were encountered: