[VTA] de10-nano driver #3394

liangfu · 2019-06-19T05:33:51Z

Following the compilation script for Intel FPGA ( #3494 ), this PR brings the driver for the DE10-Nano.

vegaluisjose · 2019-06-19T16:47:53Z

Pretty cool @liangfu , just out of curiosity how did you test the driver code? did you use the HLS VTA source on Intel's tools? or did you have a small design to test read/write registers/memory?

liangfu · 2019-06-19T23:09:57Z

Hi @vegaluisjose, similar to your chisel design of VTA, I have a PR #1694 . I have tested the driver with my own implement of VTA, which is completely compatible with your design.

vegaluisjose · 2019-06-19T23:25:45Z

Awesome @liangfu , last question is the nano-board using 32-bit addresses vta_phy_addr_t like the pynq board?

I don't know if I have rights to review code yet or how we would want to do this @tmoreau89 ?

liangfu · 2019-06-19T23:40:43Z

it is implemented in 32-bit address.

I think everyone is welcome to perform code review.

vegaluisjose · 2019-06-20T00:31:47Z

I think overall looks fine, just to double check is the purpose of the "kernel_module" directory to host the source code for the cma driver?

If that is the case and this cma implementation is particular to this board, then what do you think on renaming that to cma?

liangfu · 2019-06-20T01:10:05Z

The kernel module would be compiled to cma.ko, which could be installed with

insmod cma.ko

on the device. I don't think the kernel module is particularly designed for the de10-nano board, it might be useful for many other boards as well, as the interface between Cortex-A9 and FPGA is almost the same among different implements.

My concern is that Linux kernel modules are mostly released under GPL License, it may not fit in the place here that TVM is released under Apache License.

tmoreau89 · 2019-06-20T01:49:09Z

This is good work @liangfu! I'd love to test it on a DE10 Nano; however can you also upload the FPGA compilation scripts? It seems like the old PR was closed, do you want to merge your old flow with @vegaluisjose 's new Chisel based design? I think we would need to push both compilation support with the metal tests.

In addition, it looks like you re-wrote parts of the metal test Makefile. Did you ensure that it didn't break compilation for the pynq?

I can test that, once I have the means to reproduce your test (which would require the sources, and the bitstream compilation scripts).

tmoreau89 · 2019-06-20T01:49:13Z

Re: license. We need @tqchen's take on that. Where did you get the cma file from? Can we include from an altera quartus library assuming it's on the path (this is ultimately what we do with the pynq).

tqchen · 2019-06-20T03:31:23Z

Re: License, To be cautious, it would be great to put the additional file in 3rdparty or make it as an include dependency rather than source

liangfu · 2019-06-20T06:18:56Z

@tmoreau89 I would definitely merge with @vegaluisjose 's new design, and I would like to send the compilation script for DE10-Nano in another PR, since the driver is relatively independent (except for the address space alignment) with the VTA implementation and the compilation script.

I was very careful not to break up the compilation for pynq, but it has to be tested to ensure the corrected.

For the License part, as suggested by @tqchen , I would put the kernel module implement in an independent repo and import it into the 3rdparty directory .

liangfu · 2019-06-20T12:14:11Z

I have fixed a bug in mapping phy address, and moved kernel module source code into 3rdparty directory.

liangfu · 2019-06-21T07:22:16Z

@tmoreau89 when I migrate to use Chisel implement of VTA, the address space are quite different see here for tsim_driver.cc and here for test_lib.cc in metal_test script. Shall we migrate to use the address space newly implemented with Chisel and tsim, and modify the one implemented in metal_test and Xilinx compilation script?

tmoreau89 · 2019-06-21T07:28:34Z

Thank you @liangfu ; I think that placing the compilation scripts in another PR is totally fine.

Re: address maps. These are different indeed; we'll need to check with @vegaluisjose but ideally we should be using the same maps so the software drivers stay the same.

I suggest that we merge the other compilation script PR before this one so we can test the complete metal tests on the DE10 board. Does that work with your timeline?

liangfu · 2019-06-21T07:37:15Z

@tmoreau89 Sure

vegaluisjose · 2019-06-21T17:26:34Z

Hey @liangfu @tmoreau89 , yeah address maps are different in the Chisel version. They are now contiguous (4 bytes increment). These three lines show how they are generated.

I think we could use this new addressing?

@liangfu given the fact that these metal tests work on your end, why not trying the unit-tests right away?

tmoreau89 · 2019-06-21T18:26:16Z

I agree, let's use the new addressing imposed by the Chisel design memory map. The HLS version will get phased out once we have FPGA support on the Zynq and Ultra-96.

Re: metal tests - these have bit-rotted; and it would be good to use the python based test infrastructure (unit-tests). I'll need to address the metal tests in a separate PR.

liangfu · 2019-07-04T09:07:25Z

@vegaluisjose okay, i think i would follow your unit test script, since I just come to aware that the instruction layout is not compatible with HLS implement as well.

@tmoreau89 please refer to #3494 for compilation script for de10-nano.

vegaluisjose · 2019-07-04T17:11:53Z

Hey @liangfu, did you find something about the instruction layout that is not the same? It should be otherwise we would not be able to compile VTA programs.

liangfu · 2019-07-12T11:16:46Z

@vegaluisjose sorry, I think I've made a mistake in comparing the instruction layout.

For now, after changing dram_base to phy_addr directly, in FlushUopLoad, LoadBuffer2D, StoreBuffer2D functions in vta/src/runtime.cc, I could have a successful evaluation with Chisel-based VTA design on the DE10-Nano board.

vegaluisjose · 2019-07-12T16:27:40Z

Hey @liangfu , no worries. Thanks for letting me know.

liangfu · 2019-07-15T11:34:18Z

@vegaluisjose You're welcome.
I constantly had problem release uop memory buffer, even when running test_runtime_array function. Do you have any idea? BTW, how did you evaluate tes_vta_insn.py on PYNQ while the pynq_driver has not been modified to reflect the change in address space?

vegaluisjose · 2019-07-15T15:15:44Z

Hey @liangfu

What do you mean by releasing uop memory buffer? do you mean running VTAMemFree() in this board? or?

I did some testing, a hacked version of metal tests, on AWS F1 directly but we decided to go with just hardware simulation (TSIM) because we wanted to have python support for these from the beginning. We are evaluating/working on the changes needed for edge (PYNQ) and cloud (F1) in terms of runtime.

liangfu · 2019-07-15T23:09:51Z

Hi @vegaluisjose, yes, I meant running VTAMemFree to release uop buffer on the de10-nano board. It might be the issue that modified XFER to 1<<21 rather than 1<<22, but I've seen other buffer are released successfully.

On the other hand, I'm very glad to see python support in the very beginning. However, AFAIK, there is no ARM processor on F1 instance, how would you implement rpc to communicate with the FPGA device?

vegaluisjose · 2019-07-16T00:13:28Z

@liangfu oh I see, yeah it seems 4GB limit? perhaps a limitation of that cma library you are using?

In terms of AWS F1, we are still working on runtime side of things because the system is based on discrete FPGA card. We do have a host but not under the same shared memory environment as the SoC, so we have to dma instructions/data instead. Seems trivial, but there are some synchronization challenge we are figuring out.

liangfu · 2019-07-16T01:20:37Z

@vegaluisjose Thanks for your explanation, that makes total sense.

tmoreau89 · 2019-07-16T23:46:44Z

@liangfu I'm refactoring the runtime to make it easier to plug in different backends. It should now allocate the CMA buffers to be sized correctly rather than being sized according to the maximum DMA transfer defined by XFER. This should make the support of other drivers easier. I will update you when it's merged.

liangfu · 2019-07-18T07:08:03Z

@tmoreau89 I'm very glad to see there would be an update in the runtime, and thanks for letting me know.

@vegaluisjose When I'm debugging with test_vta_insn.py, I've seen this line defined the unpadded input with

x_np = np.random.randint(1, 2, size=(n, m, env.BATCH, env.BLOCK_OUT)).astype(x.dtype)

This actually didn't generate any random number, but instead produced ones in the output. But if I modify the random value range to 1~9, the test case can't be successful. I wonder whether producing ones is a purposeful design, or there might be a mistake.

tmoreau89 · 2019-07-18T16:59:59Z

this is probably a mistake, and should be changed to increase the range of the matrix values

tmoreau89 · 2019-07-18T17:04:04Z

I'll revisit this test and identify the issue; for now please ignore the padded load test

tmoreau89 · 2019-09-01T01:56:32Z

@liangfu it's likely test_padded_load is problematic, I think it also fails on HLS VTA hardware design (but passes in SIM).

Have you tried running the end to end resnet test on the DE10?

liangfu · 2019-09-01T02:38:26Z

@tmoreau89 Thanks for your attention. I don't think current driver could pass the resnet test. But with small tensor sizes, the driver can pass the tests one-by-one except the padded load test. Therefore, I think the problem exists in the cma implement, and i'm currently debugging into this.

vegaluisjose · 2019-09-01T03:35:20Z

I see, btw @liangfu do you think we are missing -fPIC on these two rules?

I am getting compilation errors on both Mac/Linux now when I try to create the share library. I managed to run the unittest by reverting the Makefile before this

liangfu · 2019-09-01T04:01:34Z

@vegaluisjose Ah, yes. The -fPIC is missing for now, and requires running make command twice, since the list of .cpp files is missing the time of first make command. These mistakes are introduced in the PR #3797 . A potential fix to this is to use for loop inside the compile command, and add the -fPIC as well. I can follow up another PR for fixing this.

vegaluisjose · 2019-09-01T04:12:44Z

Well, running the Makefile twice only works on OSX but it fails on Linux. I have some takes about having loops on Makefile rules.

What do you think about finding a middle ground with the previous version before #3797 and the debugging features you wanted to add (OSX/SDK)?

liangfu · 2019-09-01T04:20:49Z

@vegaluisjose I think it's fine to have such middle ground.

vegaluisjose · 2019-09-01T18:43:55Z

Great, do you mind taking a stab at it?. I am going to cleanup the `USE_TSIM` macro from runtime and also add `scalafmt` support for lint/format the codebase.

On Sat, Aug 31, 2019 at 9:20 PM Liangfu Chen ***@***.***> wrote: @vegaluisjose <https://github.com/vegaluisjose> I think it's fine to have such middle ground. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3394>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAS5B72JM3UO4NH36I7VFSTQHM7KNANCNFSM4HZGAOZA> .

-- Luis Vega

liangfu · 2019-09-03T11:40:15Z

@tmoreau89 I think the memory issue has been fixed. I can run the test_vta_insn.py test script successfully, but not yet tested with resnet-18. (The mxnet version I have does not seemingly compatible with current VTA library, and it shows segment fault on importing). Since current implement passed test_vta_insn.py, it would be easy to test with resnet-18 . Do you mind taking a review on current implementation ?

tmoreau89 · 2019-09-04T17:39:23Z

docs/vta/install.md

@@ -20,8 +20,9 @@ VTA Installation Guide

 We present three installation guides, each extending on the previous one:
 1. [Simulator installation](#vta-simulator-installation)
-2. [Hardware test setup](#vta-pynq-based-test-setup)
-3. [FPGA toolchain installation](#vta-fpga-toolchain-installation)
+2. [PYNQ-based test setup](#vta-pynq-based-test-setup)


Perhaps we can rename these categories to "FPGA test setup" and have sub categories for "Pynq board", "DE10Nano board" etc.?

tmoreau89 · 2019-09-04T17:46:21Z

@liangfu I reviewed the PR and overall it looks good. There's a bit of a risk involved in getting this merged when e2e isn't tested, but we need to merge this work in to make sure it doesn't bit rot. There are known issues with the Chisel design that prevent e2e resnet to run successfully on the pynq target. One worry I have is that people will start using the designs and be confused by the errors they will get. Should we perhaps start an issue that tracks platform coverage (pynq, ultra96, de10nano) and what tests are currently passing failing? This will ensure that we provide full visibility over the current state of progress to the rest of the community. Throughts @vegaluisjose ?

vegaluisjose · 2019-09-04T18:17:24Z

One of the challenges of TVM/VTA is that software is developed incrementally while hardware is asked to be built all at once. This is reflected on how the whole VTA repository is currently organized.

One way of dealing with this currently is to create a tuple of (language, configuration, backend) that will be checked at the beginning of each python script against a coverage file or something?

The idea is that a person tries to run one of the scatter python test files in the repo, i.e. tutorials or tests/python, and the tuple is not supported, then a message will be displayed saying that this tuple is not supported currently?

Side note, maybe we should follow a hierarchy similar to what TPU has for models and benchmarks

tmoreau89 · 2019-09-04T18:28:54Z

Having a coverage file is a good idea, at least to provide some indication about the state of tested coverage. I think we can cross reference the coverage file with a Github issue that will track coverage. This will be helpful to a lot of people jumping on board. Also we can track who's working on what in the community. I'll take a stab at it by the e.o.w.

tmoreau89 · 2019-09-04T18:30:54Z

@liangfu if you want this merged, please remove the WIP label

liangfu · 2019-09-05T08:58:32Z

@vegaluisjose @tmoreau89 Thanks for your attention and valuable comments on this.

I reverted the MXNet version back to v1.3.1, and I could have a end-to-end test on the board. However, the test is not yet successful. My observation is that, I can't have sufficient contiguous memory allocated for the inference of resnet-18. I would like to have this fixed in a separate PR. @tmoreau89 What do you think?

On the test coverage topic, I think it's a good idea to keep track of tested tuples in a issue. In addition, I think it would be more helpful to provide board support packages (BSPs) for the supported platforms, so that users could have a jump start before taking a deep dive. Those BSPs could have precompiled FPGA bitstreams builtin and programmed by default when Linux system is loaded.

tmoreau89 · 2019-09-05T16:51:28Z

Thanks for the update @liangfu . I propose that we move ahead with this PR and get it merged, while we address the memory allocation issue in a separate PR. Do you know how large of a buffer the CMA library can allocate on the DE10Nano? And do you have an idea if it fails in a data buffer or instruction buffer, or micro-op buffer allocation?

* rework; * `de10-nano` -> `de10nano`; * fix compilation error; * bug fix; * Update install.md * Update install.md * Update install.md * update with current runtime; * add debug messages; * bug fix in cma kernel module;

liangfu · 2019-09-10T08:43:00Z

Do you know how large of a buffer the CMA library can allocate on the DE10Nano? And do you have an idea if it fails in a data buffer or instruction buffer, or micro-op buffer allocation?

@tmoreau89 I think the maximum buffer size that the CMA library can allocate on the DE10Nano board is 16MB for now. This is what I can get on the board.

root@de10-nano:~# cat /proc/meminfo | grep -i cma
CmaTotal:          16384 kB
CmaFree:           15972 kB

In addition, when I run end-to-end test with resnet-18, the CmaFree variable dropped quickly until it's trying to allocate 0x240000 bytes of contiguous memory and reported an error in memory allocation.

The debug message shows that it is a kind of DataBuffer that requires such amount of memory.

liangfu · 2019-09-10T09:39:42Z

AKAIK, PYNQ has as large as 128MB for cma_alloc, see its documentation here.

To increase the buffer size for contiguous memory allocation, I guess we need to recompile the kernel and set FORCE_MAX_ZONEORDER to 16 for 128 MiB CmaTotal, or set CONFIG_CMA_SIZE_MBYTES=128 directly in .config file.

tmoreau89 · 2019-09-10T17:38:19Z

Thanks @liangfu; indeed if we want to increase the cma_alloc pool size we'd need to rebuild the kernel. It's already a limitation for some larger networks that have large parameter footprints.

* rework; * `de10-nano` -> `de10nano`; * fix compilation error; * bug fix; * Update install.md * Update install.md * Update install.md * update with current runtime; * add debug messages; * bug fix in cma kernel module;

liangfu mentioned this pull request Jul 4, 2019

[VTA] VTA Compilation Script for Intel FPGA #3494

Merged

liangfu added 3 commits September 2, 2019 18:19

add debug messages;

7f90931

bug fix in cma kernel module;

355ada7

Merge branch 'master' into de10-nano_driver

6f3f2b1

tmoreau89 reviewed Sep 4, 2019

View reviewed changes

tmoreau89 approved these changes Sep 4, 2019

View reviewed changes

tmoreau89 removed the status: WIP label Sep 5, 2019

tmoreau89 merged commit 734df8d into apache:master Sep 5, 2019

tqchen mentioned this pull request Nov 8, 2019

[RELEASE][DRAFT] TVM v0.6 Release candidate #4259

Closed

[VTA] de10-nano driver #3394

[VTA] de10-nano driver #3394

Conversation

liangfu commented Jun 19, 2019 • edited Loading

vegaluisjose commented Jun 19, 2019

liangfu commented Jun 19, 2019 • edited Loading

vegaluisjose commented Jun 19, 2019

liangfu commented Jun 19, 2019

vegaluisjose commented Jun 20, 2019

liangfu commented Jun 20, 2019

tmoreau89 commented Jun 20, 2019

tmoreau89 commented Jun 20, 2019

tqchen commented Jun 20, 2019

liangfu commented Jun 20, 2019 • edited Loading

liangfu commented Jun 20, 2019

liangfu commented Jun 21, 2019

tmoreau89 commented Jun 21, 2019

liangfu commented Jun 21, 2019

vegaluisjose commented Jun 21, 2019

tmoreau89 commented Jun 21, 2019

liangfu commented Jul 4, 2019

vegaluisjose commented Jul 4, 2019

liangfu commented Jul 12, 2019

vegaluisjose commented Jul 12, 2019

liangfu commented Jul 15, 2019

vegaluisjose commented Jul 15, 2019

liangfu commented Jul 15, 2019

vegaluisjose commented Jul 16, 2019

liangfu commented Jul 16, 2019

tmoreau89 commented Jul 16, 2019

liangfu commented Jul 18, 2019

tmoreau89 commented Jul 18, 2019

tmoreau89 commented Jul 18, 2019

tmoreau89 commented Sep 1, 2019

liangfu commented Sep 1, 2019

vegaluisjose commented Sep 1, 2019

liangfu commented Sep 1, 2019

vegaluisjose commented Sep 1, 2019

liangfu commented Sep 1, 2019

vegaluisjose commented Sep 1, 2019 via email

liangfu commented Sep 3, 2019

tmoreau89 Sep 4, 2019

Choose a reason for hiding this comment

tmoreau89 commented Sep 4, 2019

vegaluisjose commented Sep 4, 2019

tmoreau89 commented Sep 4, 2019

tmoreau89 commented Sep 4, 2019

liangfu commented Sep 5, 2019

tmoreau89 commented Sep 5, 2019

liangfu commented Sep 10, 2019

liangfu commented Sep 10, 2019 • edited Loading

tmoreau89 commented Sep 10, 2019

liangfu commented Jun 19, 2019 •

edited

Loading

liangfu commented Jun 19, 2019 •

edited

Loading

liangfu commented Jun 20, 2019 •

edited

Loading

liangfu commented Sep 10, 2019 •

edited

Loading