Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hammerblade #101

Open
amithmath opened this issue Sep 27, 2024 · 11 comments
Open

Hammerblade #101

amithmath opened this issue Sep 27, 2024 · 11 comments

Comments

@amithmath
Copy link

I am trying to implement hammerblade example in pynqz2,ultra96v2, and vu47p but it is running out of resources. This issue has been raised in #76 but even for ultra96v2 it is running out of resources. Please let me know which board to use. Thanks.

@dpetrisko
Copy link
Collaborator

It should definitely fit on vu47p. Perhaps the manycore config is set too large. Do you have a utilization report you can post?

@amithmath
Copy link
Author

Following is the report:

ERROR: [DRC UTLZ-1] Resource utilization: LUT as Logic over-utilized in Top Level Design (This design requires more LUT as Logic cells than are available in the target device. This design requires 71984 of such cell types but only 70560 compatible sites are available in the target device. Please analyze your synthesis results and constraints to ensure the design is mapped to Xilinx primitives as expected. If so, please consider targeting a larger device. Please set tcl parameter "drc.disableLUTOverUtilError" to 1 to change this error to warning.)

@dpetrisko
Copy link
Collaborator

Can you post the actual reports and not just the error? Would need to see the hierarchical breakdown to see where LUTs are going

@amithmath
Copy link
Author

amithmath commented Sep 28, 2024

Copyright 1986-2022 Xilinx, Inc. All Rights Reserved.

| Tool Version : Vivado v.2022.1 (lin64) Build 3526262 Mon Apr 18 15:47:01 MDT 2022
| Date : Sat Sep 28 20:11:13 2024
| Host : amd64 running 64-bit CentOS Linux release 7.9.2009 (Core)
| Command : report_utilization -file hammerblade_bd_1_wrapper_utilization_synth.rpt -pb hammerblade_bd_1_wrapper_utilization_synth.pb
| Design : hammerblade_bd_1_wrapper
| Device : xczu3eg-sbva484-1-e
| Speed File : -1
| Design State : Synthesized

Utilization Design Information

Table of Contents

  1. CLB Logic
    1.1 Summary of Registers by Type

  2. BLOCKRAM

  3. ARITHMETIC

  4. I/O

  5. CLOCK

  6. ADVANCED

  7. CONFIGURATION

  8. Primitives

  9. Black Boxes

  10. Instantiated Netlists

  11. CLB Logic


+----------------------------+-------+-------+------------+-----------+--------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+----------------------------+-------+-------+------------+-----------+--------+
| CLB LUTs* | 81591 | 0 | 0 | 70560 | 115.63 |
| LUT as Logic | 72479 | 0 | 0 | 70560 | 102.72 |
| LUT as Memory | 9112 | 0 | 0 | 28800 | 31.64 |
| LUT as Distributed RAM | 8933 | 0 | | | |
| LUT as Shift Register | 179 | 0 | | | |
| CLB Registers | 26938 | 0 | 0 | 141120 | 19.09 |
| Register as Flip Flop | 26669 | 0 | 0 | 141120 | 18.90 |
| Register as Latch | 269 | 0 | 0 | 141120 | 0.19 |
| CARRY8 | 1059 | 0 | 0 | 8820 | 12.01 |
| F7 Muxes | 591 | 0 | 0 | 35280 | 1.68 |
| F8 Muxes | 78 | 0 | 0 | 17640 | 0.44 |
| F9 Muxes | 0 | 0 | 0 | 8820 | 0.00 |
+----------------------------+-------+-------+------------+-----------+--------+

  • Warning! The Final LUT count, after physical optimizations and full implementation, is typically lower. Run opt_design after synthesis, if not already completed, for a more realistic count.

1.1 Summary of Registers by Type

+-------+--------------+-------------+--------------+
| Total | Clock Enable | Synchronous | Asynchronous |
+-------+--------------+-------------+--------------+
| 0 | _ | - | - |
| 0 | _ | - | Set |
| 0 | _ | - | Reset |
| 0 | _ | Set | - |
| 0 | _ | Reset | - |
| 0 | Yes | - | - |
| 0 | Yes | - | Set |
| 395 | Yes | - | Reset |
| 1140 | Yes | Set | - |
| 25403 | Yes | Reset | - |
+-------+--------------+-------------+--------------+

  1. BLOCKRAM

+-------------------+------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+-------------------+------+-------+------------+-----------+-------+
| Block RAM Tile | 40.5 | 0 | 0 | 216 | 18.75 |
| RAMB36/FIFO* | 38 | 0 | 0 | 216 | 17.59 |
| RAMB36E2 only | 38 | | | | |
| RAMB18 | 5 | 0 | 0 | 432 | 1.16 |
| RAMB18E2 only | 5 | | | | |
+-------------------+------+-------+------------+-----------+-------+

  • Note: Each Block RAM Tile only has one FIFO logic available and therefore can accommodate only one FIFO36E2 or one FIFO18E2. However, if a FIFO18E2 occupies a Block RAM Tile, that tile can still accommodate a RAMB18E2
  1. ARITHMETIC

+----------------+------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+----------------+------+-------+------------+-----------+-------+
| DSPs | 19 | 0 | 0 | 360 | 5.28 |
| DSP48E2 only | 19 | | | | |
+----------------+------+-------+------------+-----------+-------+

  1. I/O

+------------+------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+------------+------+-------+------------+-----------+-------+
| Bonded IOB | 0 | 0 | 0 | 82 | 0.00 |
+------------+------+-------+------------+-----------+-------+

  1. CLOCK

+----------------------+------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+----------------------+------+-------+------------+-----------+-------+
| GLOBAL CLOCK BUFFERs | 9 | 0 | 0 | 196 | 4.59 |
| BUFGCE | 2 | 0 | 0 | 88 | 2.27 |
| BUFGCE_DIV | 0 | 0 | 0 | 12 | 0.00 |
| BUFG_PS | 1 | 0 | 0 | 72 | 1.39 |
| BUFGCTRL* | 3 | 0 | 0 | 24 | 12.50 |
| PLL | 0 | 0 | 0 | 6 | 0.00 |
| MMCM | 1 | 0 | 0 | 3 | 33.33 |
+----------------------+------+-------+------------+-----------+-------+

  • Note: Each used BUFGCTRL counts as two GLOBAL CLOCK BUFFERs. This table does not include global clocking resources, only buffer cell usage. See the Clock Utilization Report (report_clock_utilization) for detailed accounting of global clocking resource availability.
  1. ADVANCED

+-----------+------+-------+------------+-----------+--------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+-----------+------+-------+------------+-----------+--------+
| PS8 | 1 | 0 | 0 | 1 | 100.00 |
| SYSMONE4 | 0 | 0 | 0 | 1 | 0.00 |
+-----------+------+-------+------------+-----------+--------+

  1. CONFIGURATION

+-------------+------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+-------------+------+-------+------------+-----------+-------+
| BSCANE2 | 0 | 0 | 0 | 4 | 0.00 |
| DNA_PORTE2 | 0 | 0 | 0 | 1 | 0.00 |
| EFUSE_USR | 0 | 0 | 0 | 1 | 0.00 |
| FRAME_ECCE4 | 0 | 0 | 0 | 1 | 0.00 |
| ICAPE3 | 0 | 0 | 0 | 2 | 0.00 |
| MASTER_JTAG | 0 | 0 | 0 | 1 | 0.00 |
| STARTUPE3 | 0 | 0 | 0 | 1 | 0.00 |
+-------------+------+-------+------------+-----------+-------+

  1. Primitives

+------------+-------+---------------------+
| Ref Name | Used | Functional Category |
+------------+-------+---------------------+
| LUT6 | 36979 | CLB |
| FDRE | 25403 | Register |
| LUT5 | 14887 | CLB |
| RAMD32 | 13976 | CLB |
| LUT4 | 13086 | CLB |
| LUT3 | 10708 | CLB |
| LUT2 | 7359 | CLB |
| RAMS32 | 2078 | CLB |
| LUT1 | 1278 | CLB |
| FDSE | 1140 | Register |
| CARRY8 | 1059 | CLB |
| RAMD64E | 868 | CLB |
| MUXF7 | 591 | CLB |
| LDCE | 269 | Register |
| FDCE | 126 | Register |
| SRL16E | 111 | CLB |
| MUXF8 | 78 | CLB |
| SRLC32E | 68 | CLB |
| RAMB36E2 | 38 | BLOCKRAM |
| RAMS64E | 35 | CLB |
| DSP48E2 | 19 | Arithmetic |
| RAMB18E2 | 5 | BLOCKRAM |
| BUFGCTRL | 3 | Clock |
| BUFGCE | 2 | Clock |
| PS8 | 1 | Advanced |
| MMCME4_ADV | 1 | Clock |
| BUFG_PS | 1 | Clock |
+------------+-------+---------------------+

  1. Black Boxes

+----------+------+
| Ref Name | Used |
+----------+------+

  1. Instantiated Netlists

+----------+------+
| Ref Name | Used |
+----------+------+

@amithmath
Copy link
Author

amithmath commented Sep 29, 2024

Above report is from ultra96v2. There is no vps_zynq_bd.vu47p.tcl in https://github.com/black-parrot-hdk/zynq-parrot/tree/master/cosim/tcl/bd

@dpetrisko
Copy link
Collaborator

Yes, so it should fit on vu47p. There is no vps configuration file as the vu47p is not a zynq part (no PS)

For the vu47p, we use a uart bridge to emulate the PS. You can see the connection here: https://github.com/black-parrot-hdk/zynq-parrot/blob/master/cosim/xdc/board.vu47p.xdc and the cosimulation here: https://github.com/black-parrot-hdk/zynq-parrot/blob/master/cosim/include/bridge/bsg_zynq_pl.h but we haven’t open-sourced a hardware configuration as it’s a fairly custom solution.

For the ultra96v2 that report is indicating it is very close to fitting. Reducing sizes of the structures in the BlackParrot cores may get you there. Take a look at the TinyParrot configuration in the aviary and experiment with reducing branch predictors and caches

@amithmath
Copy link
Author

I am wondering, is there any possibility to port the hardware to Alveo U250 data center card (https://www.amd.com/en/products/accelerators/alveo/u250/a-u250-a64g-pq-g.html), if I can what are the changes to be done? By the way, bsg_manycore accelerator is 32 bit, can one change to 64 bit? If possible, what are the changes to be done?

@dpetrisko
Copy link
Collaborator

These are both very substantial projects.

The U250 has a pynq port, so beginning there and working through the cosim examples is the way to start. Once cosim is working, hardware examples should port in a straightforward manner

There was a student who ported the manycore toolchain to 64b:
bespoke-silicon-group/bsg_manycore#720

The hardware would require more changes, but primarily in parameterization. The actual RV64I ISA difference is minimal, especially if only F support is needed

Both projects would require a highly motivated student for likely two+ quarters. Feel free to reach out to discuss funding for these efforts

@amithmath
Copy link
Author

amithmath commented Sep 30, 2024

Thanks let me see. I was running vcs simulation from /home/ynq-parrot/cosim/hammerblade-example/vcs, I am getting following errors:

"/home/zynq-parrot/cosim/v/bsg_zynq_pl_shell.sv", 405: bsg_nonsynth_zynq_testbench.dut.top_fpga_inst.zps.pl_to_ps[0].unnamed$$_0: started at 0ps failed at 0ps
Offending '(((~S_AXI_ARESETN) | (~slv_rd_sel_one_hot[(num_regs_ps_to_pl_p + 0)])) | pl_to_ps_fifo_valid_lo[0])'
Error: "/home/sonal/ViBram/zynq-parrot/cosim/v/bsg_zynq_pl_shell.sv", 405: bsg_nonsynth_zynq_testbench.dut.top_fpga_inst.zps.pl_to_ps[0].unnamed$$_0: at time 0 ps
read from empty fifo
"/home/zynq-parrot/cosim/v/bsg_zynq_pl_shell.sv", 405: bsg_nonsynth_zynq_testbench.dut.top_fpga_inst.zps.pl_to_ps[1].unnamed$$_0: started at 0ps failed at 0ps
Offending '(((~S_AXI_ARESETN) | (~slv_rd_sel_one_hot[(num_regs_ps_to_pl_p + 1)])) | pl_to_ps_fifo_valid_lo[1])'
Error: "/home/zynq-parrot/cosim/v/bsg_zynq_pl_shell.sv", 405: bsg_nonsynth_zynq_testbench.dut.top_fpga_inst.zps.pl_to_ps[1].unnamed$$_0: at time 0 ps
read from empty fifo

bsg_tag_master transitioning to error state; be sure to run gate-level netlist to avoid sim/synth mismatch (bsg_nonsynth_zynq_testbench.dut.top_fpga_inst.master)

@amithmath
Copy link
Author

amithmath commented Oct 3, 2024

Please help, I am getting these errors in VCS:

BSG ERROR (bsg_nonsynth_zynq_testbench.axil4.wready_gpio): final block executed before fini() was called
Fatal: "/home/sonal/ViBram/zynq-parrot/import/basejump_stl/bsg_test/bsg_nonsynth_dpi_gpio.sv", 64: bsg_nonsynth_zynq_testbench.axil4.bresp_gpio: at time 56425001 ps
BSG ERROR (bsg_nonsynth_zynq_testbench.axil4.bresp_gpio): final block executed before fini() was called
Fatal: "/home/sonal/ViBram/zynq-parrot/import/basejump_stl/bsg_test/bsg_nonsynth_dpi_gpio.sv", 64: bsg_nonsynth_zynq_testbench.axil4.bvalid_gpio: at time 56425001 ps
BSG ERROR (bsg_nonsynth_zynq_testbench.axil4.bvalid_gpio): final block executed before fini() was called
Fatal: "/home/sonal/ViBram/zynq-parrot/import/basejump_stl/bsg_test/bsg_nonsynth_dpi_gpio.sv", 64: bsg_nonsynth_zynq_testbench.axil4.bready_gpio: at time 56425001 ps
BSG ERROR (bsg_nonsynth_zynq_testbench.axil4.bready_gpio): final block executed before fini() was called
Fatal: "/home/sonal/ViBram/zynq-parrot/import/basejump_stl/bsg_test/bsg_nonsynth_dpi_gpio.sv", 64: bsg_nonsynth_zynq_testbench.axil4.araddr_gpio: at time 56425001 ps
BSG ERROR (bsg_nonsynth_zynq_testbench.axil4.araddr_gpio): final block executed before fini() was called
Fatal: "/home/sonal/ViBram/zynq-parrot/import/basejump_stl/bsg_test/bsg_nonsynth_dpi_gpio.sv", 64: bsg_nonsynth_zynq_testbench.axil4.arprot_gpio: at time 56425001 ps
BSG ERROR (bsg_nonsynth_zynq_testbench.axil4.arprot_gpio): final block executed before fini() was called
Fatal: "/home/sonal/ViBram/zynq-parrot/import/basejump_stl/bsg_test/bsg_nonsynth_dpi_gpio.sv", 64: bsg_nonsynth_zynq_testbench.axil4.arvalid_gpio: at time 56425001 ps
BSG ERROR (bsg_nonsynth_zynq_testbench.axil4.arvalid_gpio): final block executed before fini() was called
Fatal: "/home/sonal/ViBram/zynq-parrot/import/basejump_stl/bsg_test/bsg_nonsynth_dpi_gpio.sv", 64: bsg_nonsynth_zynq_testbench.axil4.arready_gpio: at time 56425001 ps
BSG ERROR (bsg_nonsynth_zynq_testbench.axil4.arready_gpio): final block executed before fini() was called
Fatal: "/home/sonal/ViBram/zynq-parrot/import/basejump_stl/bsg_test/bsg_nonsynth_dpi_gpio.sv", 64: bsg_nonsynth_zynq_testbench.axil4.rdata_gpio: at time 56425001 ps
BSG ERROR (bsg_nonsynth_zynq_testbench.axil4.rdata_gpio): final block executed before fini() was called
Fatal: "/home/sonal/ViBram/zynq-parrot/import/basejump_stl/bsg_test/bsg_nonsynth_dpi_gpio.sv", 64: bsg_nonsynth_zynq_testbench.axil4.rresp_gpio: at time 56425001 ps
BSG ERROR (bsg_nonsynth_zynq_testbench.axil4.rresp_gpio): final block executed before fini() was called
Fatal: "/home/sonal/ViBram/zynq-parrot/import/basejump_stl/bsg_test/bsg_nonsynth_dpi_gpio.sv", 64: bsg_nonsynth_zynq_testbench.axil4.rvalid_gpio: at time 56425001 ps
BSG ERROR (bsg_nonsynth_zynq_testbench.axil4.rvalid_gpio): final block executed before fini() was called
Fatal: "/home/sonal/ViBram/zynq-parrot/import/basejump_stl/bsg_test/bsg_nonsynth_dpi_gpio.sv", 64: bsg_nonsynth_zynq_testbench.axil4.rready_gpio: at time 56425001 ps
BSG ERROR (bsg_nonsynth_zynq_testbench.axil4.rready_gpio): final block executed before fini() was called
SG ERROR (bsg_nonsynth_zynq_testbench.axil4.rready_gpio): final block executed before fini() was called
V C S S i m u l a t i o n R e p o r t
Time: 56425001 ps
CPU Time: 1.510 seconds; Data structure size: 4.5Mb

@amithmath
Copy link
Author

Can you please point file and line number where I can experiment with aviary by reducing branch predictors and caches?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants