Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VTA] de10-nano driver #3394

Merged
merged 14 commits into from
Sep 5, 2019
Merged

[VTA] de10-nano driver #3394

merged 14 commits into from
Sep 5, 2019

Conversation

liangfu
Copy link
Member

@liangfu liangfu commented Jun 19, 2019

Following the compilation script for Intel FPGA ( #3494 ), this PR brings the driver for the DE10-Nano.

@vegaluisjose
Copy link
Member

Pretty cool @liangfu , just out of curiosity how did you test the driver code? did you use the HLS VTA source on Intel's tools? or did you have a small design to test read/write registers/memory?

@liangfu
Copy link
Member Author

liangfu commented Jun 19, 2019

Hi @vegaluisjose, similar to your chisel design of VTA, I have a PR #1694 . I have tested the driver with my own implement of VTA, which is completely compatible with your design.

@vegaluisjose
Copy link
Member

Awesome @liangfu , last question is the nano-board using 32-bit addresses vta_phy_addr_t like the pynq board?

I don't know if I have rights to review code yet or how we would want to do this @tmoreau89 ?

@liangfu
Copy link
Member Author

liangfu commented Jun 19, 2019

it is implemented in 32-bit address.

I think everyone is welcome to perform code review.

@vegaluisjose
Copy link
Member

I think overall looks fine, just to double check is the purpose of the "kernel_module" directory to host the source code for the cma driver?

If that is the case and this cma implementation is particular to this board, then what do you think on renaming that to cma?

@liangfu
Copy link
Member Author

liangfu commented Jun 20, 2019

The kernel module would be compiled to cma.ko, which could be installed with

insmod cma.ko

on the device. I don't think the kernel module is particularly designed for the de10-nano board, it might be useful for many other boards as well, as the interface between Cortex-A9 and FPGA is almost the same among different implements.

My concern is that Linux kernel modules are mostly released under GPL License, it may not fit in the place here that TVM is released under Apache License.

@tmoreau89
Copy link
Contributor

This is good work @liangfu! I'd love to test it on a DE10 Nano; however can you also upload the FPGA compilation scripts? It seems like the old PR was closed, do you want to merge your old flow with @vegaluisjose 's new Chisel based design? I think we would need to push both compilation support with the metal tests.

In addition, it looks like you re-wrote parts of the metal test Makefile. Did you ensure that it didn't break compilation for the pynq?

I can test that, once I have the means to reproduce your test (which would require the sources, and the bitstream compilation scripts).

@tmoreau89
Copy link
Contributor

Re: license. We need @tqchen's take on that. Where did you get the cma file from? Can we include from an altera quartus library assuming it's on the path (this is ultimately what we do with the pynq).

@tqchen
Copy link
Member

tqchen commented Jun 20, 2019

Re: License, To be cautious, it would be great to put the additional file in 3rdparty or make it as an include dependency rather than source

@liangfu
Copy link
Member Author

liangfu commented Jun 20, 2019

@tmoreau89 I would definitely merge with @vegaluisjose 's new design, and I would like to send the compilation script for DE10-Nano in another PR, since the driver is relatively independent (except for the address space alignment) with the VTA implementation and the compilation script.

I was very careful not to break up the compilation for pynq, but it has to be tested to ensure the corrected.

For the License part, as suggested by @tqchen , I would put the kernel module implement in an independent repo and import it into the 3rdparty directory .

@liangfu
Copy link
Member Author

liangfu commented Jun 20, 2019

I have fixed a bug in mapping phy address, and moved kernel module source code into 3rdparty directory.

@liangfu
Copy link
Member Author

liangfu commented Jun 21, 2019

@tmoreau89 when I migrate to use Chisel implement of VTA, the address space are quite different see here for tsim_driver.cc and here for test_lib.cc in metal_test script. Shall we migrate to use the address space newly implemented with Chisel and tsim, and modify the one implemented in metal_test and Xilinx compilation script?

@tmoreau89
Copy link
Contributor

Thank you @liangfu ; I think that placing the compilation scripts in another PR is totally fine.

Re: address maps. These are different indeed; we'll need to check with @vegaluisjose but ideally we should be using the same maps so the software drivers stay the same.

I suggest that we merge the other compilation script PR before this one so we can test the complete metal tests on the DE10 board. Does that work with your timeline?

@liangfu
Copy link
Member Author

liangfu commented Jun 21, 2019

@tmoreau89 Sure

@vegaluisjose
Copy link
Member

Hey @liangfu @tmoreau89 , yeah address maps are different in the Chisel version. They are now contiguous (4 bytes increment). These three lines show how they are generated.

I think we could use this new addressing?

@liangfu given the fact that these metal tests work on your end, why not trying the unit-tests right away?

@tmoreau89
Copy link
Contributor

I agree, let's use the new addressing imposed by the Chisel design memory map. The HLS version will get phased out once we have FPGA support on the Zynq and Ultra-96.

Re: metal tests - these have bit-rotted; and it would be good to use the python based test infrastructure (unit-tests). I'll need to address the metal tests in a separate PR.

@liangfu
Copy link
Member Author

liangfu commented Jul 4, 2019

@vegaluisjose okay, i think i would follow your unit test script, since I just come to aware that the instruction layout is not compatible with HLS implement as well.

@tmoreau89 please refer to #3494 for compilation script for de10-nano.

@vegaluisjose
Copy link
Member

Hey @liangfu, did you find something about the instruction layout that is not the same? It should be otherwise we would not be able to compile VTA programs.

@liangfu
Copy link
Member Author

liangfu commented Jul 12, 2019

@vegaluisjose sorry, I think I've made a mistake in comparing the instruction layout.

For now, after changing dram_base to phy_addr directly, in FlushUopLoad, LoadBuffer2D, StoreBuffer2D functions in vta/src/runtime.cc, I could have a successful evaluation with Chisel-based VTA design on the DE10-Nano board.

@vegaluisjose
Copy link
Member

Hey @liangfu , no worries. Thanks for letting me know.

@liangfu
Copy link
Member Author

liangfu commented Jul 15, 2019

@vegaluisjose You're welcome.
I constantly had problem release uop memory buffer, even when running test_runtime_array function. Do you have any idea? BTW, how did you evaluate tes_vta_insn.py on PYNQ while the pynq_driver has not been modified to reflect the change in address space?

@vegaluisjose
Copy link
Member

Hey @liangfu

What do you mean by releasing uop memory buffer? do you mean running VTAMemFree() in this board? or?

I did some testing, a hacked version of metal tests, on AWS F1 directly but we decided to go with just hardware simulation (TSIM) because we wanted to have python support for these from the beginning. We are evaluating/working on the changes needed for edge (PYNQ) and cloud (F1) in terms of runtime.

@liangfu
Copy link
Member Author

liangfu commented Jul 15, 2019

Hi @vegaluisjose, yes, I meant running VTAMemFree to release uop buffer on the de10-nano board. It might be the issue that modified XFER to 1<<21 rather than 1<<22, but I've seen other buffer are released successfully.

On the other hand, I'm very glad to see python support in the very beginning. However, AFAIK, there is no ARM processor on F1 instance, how would you implement rpc to communicate with the FPGA device?

@vegaluisjose
Copy link
Member

@liangfu oh I see, yeah it seems 4GB limit? perhaps a limitation of that cma library you are using?

In terms of AWS F1, we are still working on runtime side of things because the system is based on discrete FPGA card. We do have a host but not under the same shared memory environment as the SoC, so we have to dma instructions/data instead. Seems trivial, but there are some synchronization challenge we are figuring out.

@liangfu
Copy link
Member Author

liangfu commented Jul 16, 2019

@vegaluisjose Thanks for your explanation, that makes total sense.

@tmoreau89
Copy link
Contributor

@liangfu I'm refactoring the runtime to make it easier to plug in different backends. It should now allocate the CMA buffers to be sized correctly rather than being sized according to the maximum DMA transfer defined by XFER. This should make the support of other drivers easier. I will update you when it's merged.

@liangfu
Copy link
Member Author

liangfu commented Jul 18, 2019

@tmoreau89 I'm very glad to see there would be an update in the runtime, and thanks for letting me know.

@vegaluisjose When I'm debugging with test_vta_insn.py, I've seen this line defined the unpadded input with

x_np = np.random.randint(1, 2, size=(n, m, env.BATCH, env.BLOCK_OUT)).astype(x.dtype)

This actually didn't generate any random number, but instead produced ones in the output. But if I modify the random value range to 1~9, the test case can't be successful. I wonder whether producing ones is a purposeful design, or there might be a mistake.

@tmoreau89
Copy link
Contributor

this is probably a mistake, and should be changed to increase the range of the matrix values

@tmoreau89
Copy link
Contributor

I'll revisit this test and identify the issue; for now please ignore the padded load test

@tmoreau89
Copy link
Contributor

@liangfu it's likely test_padded_load is problematic, I think it also fails on HLS VTA hardware design (but passes in SIM).

Have you tried running the end to end resnet test on the DE10?

@liangfu
Copy link
Member Author

liangfu commented Sep 1, 2019

@tmoreau89 Thanks for your attention. I don't think current driver could pass the resnet test. But with small tensor sizes, the driver can pass the tests one-by-one except the padded load test. Therefore, I think the problem exists in the cma implement, and i'm currently debugging into this.

@vegaluisjose
Copy link
Member

I see, btw @liangfu do you think we are missing -fPIC on these two rules?

I am getting compilation errors on both Mac/Linux now when I try to create the share library. I managed to run the unittest by reverting the Makefile before this

@liangfu
Copy link
Member Author

liangfu commented Sep 1, 2019

@vegaluisjose Ah, yes. The -fPIC is missing for now, and requires running make command twice, since the list of .cpp files is missing the time of first make command. These mistakes are introduced in the PR #3797 . A potential fix to this is to use for loop inside the compile command, and add the -fPIC as well. I can follow up another PR for fixing this.

@vegaluisjose
Copy link
Member

Well, running the Makefile twice only works on OSX but it fails on Linux. I have some takes about having loops on Makefile rules.

What do you think about finding a middle ground with the previous version before #3797 and the debugging features you wanted to add (OSX/SDK)?

@liangfu
Copy link
Member Author

liangfu commented Sep 1, 2019

@vegaluisjose I think it's fine to have such middle ground.

@vegaluisjose
Copy link
Member

vegaluisjose commented Sep 1, 2019 via email

@liangfu
Copy link
Member Author

liangfu commented Sep 3, 2019

@tmoreau89 I think the memory issue has been fixed. I can run the test_vta_insn.py test script successfully, but not yet tested with resnet-18. (The mxnet version I have does not seemingly compatible with current VTA library, and it shows segment fault on importing). Since current implement passed test_vta_insn.py, it would be easy to test with resnet-18 . Do you mind taking a review on current implementation ?

@@ -20,8 +20,9 @@ VTA Installation Guide

We present three installation guides, each extending on the previous one:
1. [Simulator installation](#vta-simulator-installation)
2. [Hardware test setup](#vta-pynq-based-test-setup)
3. [FPGA toolchain installation](#vta-fpga-toolchain-installation)
2. [PYNQ-based test setup](#vta-pynq-based-test-setup)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can rename these categories to "FPGA test setup" and have sub categories for "Pynq board", "DE10Nano board" etc.?

@tmoreau89
Copy link
Contributor

@liangfu I reviewed the PR and overall it looks good. There's a bit of a risk involved in getting this merged when e2e isn't tested, but we need to merge this work in to make sure it doesn't bit rot. There are known issues with the Chisel design that prevent e2e resnet to run successfully on the pynq target. One worry I have is that people will start using the designs and be confused by the errors they will get. Should we perhaps start an issue that tracks platform coverage (pynq, ultra96, de10nano) and what tests are currently passing failing? This will ensure that we provide full visibility over the current state of progress to the rest of the community. Throughts @vegaluisjose ?

@vegaluisjose
Copy link
Member

One of the challenges of TVM/VTA is that software is developed incrementally while hardware is asked to be built all at once. This is reflected on how the whole VTA repository is currently organized.

One way of dealing with this currently is to create a tuple of (language, configuration, backend) that will be checked at the beginning of each python script against a coverage file or something?

The idea is that a person tries to run one of the scatter python test files in the repo, i.e. tutorials or tests/python, and the tuple is not supported, then a message will be displayed saying that this tuple is not supported currently?

Side note, maybe we should follow a hierarchy similar to what TPU has for models and benchmarks

@tmoreau89
Copy link
Contributor

Having a coverage file is a good idea, at least to provide some indication about the state of tested coverage. I think we can cross reference the coverage file with a Github issue that will track coverage. This will be helpful to a lot of people jumping on board. Also we can track who's working on what in the community. I'll take a stab at it by the e.o.w.

@tmoreau89
Copy link
Contributor

@liangfu if you want this merged, please remove the WIP label

@liangfu
Copy link
Member Author

liangfu commented Sep 5, 2019

@vegaluisjose @tmoreau89 Thanks for your attention and valuable comments on this.

I reverted the MXNet version back to v1.3.1, and I could have a end-to-end test on the board. However, the test is not yet successful. My observation is that, I can't have sufficient contiguous memory allocated for the inference of resnet-18. I would like to have this fixed in a separate PR. @tmoreau89 What do you think?

On the test coverage topic, I think it's a good idea to keep track of tested tuples in a issue. In addition, I think it would be more helpful to provide board support packages (BSPs) for the supported platforms, so that users could have a jump start before taking a deep dive. Those BSPs could have precompiled FPGA bitstreams builtin and programmed by default when Linux system is loaded.

@tmoreau89
Copy link
Contributor

Thanks for the update @liangfu . I propose that we move ahead with this PR and get it merged, while we address the memory allocation issue in a separate PR. Do you know how large of a buffer the CMA library can allocate on the DE10Nano? And do you have an idea if it fails in a data buffer or instruction buffer, or micro-op buffer allocation?

@tmoreau89 tmoreau89 merged commit 734df8d into apache:master Sep 5, 2019
MarisaKirisame pushed a commit to MarisaKirisame/tvm that referenced this pull request Sep 7, 2019
* rework;

* `de10-nano` -> `de10nano`;

* fix compilation error;

* bug fix;

* Update install.md

* Update install.md

* Update install.md

* update with current runtime;

* add debug messages;

* bug fix in cma kernel module;
@liangfu
Copy link
Member Author

liangfu commented Sep 10, 2019

Do you know how large of a buffer the CMA library can allocate on the DE10Nano? And do you have an idea if it fails in a data buffer or instruction buffer, or micro-op buffer allocation?

@tmoreau89 I think the maximum buffer size that the CMA library can allocate on the DE10Nano board is 16MB for now. This is what I can get on the board.

root@de10-nano:~# cat /proc/meminfo | grep -i cma
CmaTotal:          16384 kB
CmaFree:           15972 kB

In addition, when I run end-to-end test with resnet-18, the CmaFree variable dropped quickly until it's trying to allocate 0x240000 bytes of contiguous memory and reported an error in memory allocation.

The debug message shows that it is a kind of DataBuffer that requires such amount of memory.

@liangfu
Copy link
Member Author

liangfu commented Sep 10, 2019

AKAIK, PYNQ has as large as 128MB for cma_alloc, see its documentation here.

To increase the buffer size for contiguous memory allocation, I guess we need to recompile the kernel and set FORCE_MAX_ZONEORDER to 16 for 128 MiB CmaTotal, or set CONFIG_CMA_SIZE_MBYTES=128 directly in .config file.

@tmoreau89
Copy link
Contributor

Thanks @liangfu; indeed if we want to increase the cma_alloc pool size we'd need to rebuild the kernel. It's already a limitation for some larger networks that have large parameter footprints.

wweic pushed a commit to wweic/tvm that referenced this pull request Sep 16, 2019
* rework;

* `de10-nano` -> `de10nano`;

* fix compilation error;

* bug fix;

* Update install.md

* Update install.md

* Update install.md

* update with current runtime;

* add debug messages;

* bug fix in cma kernel module;
wweic pushed a commit to wweic/tvm that referenced this pull request Sep 16, 2019
* rework;

* `de10-nano` -> `de10nano`;

* fix compilation error;

* bug fix;

* Update install.md

* Update install.md

* Update install.md

* update with current runtime;

* add debug messages;

* bug fix in cma kernel module;
wweic pushed a commit to neo-ai/tvm that referenced this pull request Sep 16, 2019
* rework;

* `de10-nano` -> `de10nano`;

* fix compilation error;

* bug fix;

* Update install.md

* Update install.md

* Update install.md

* update with current runtime;

* add debug messages;

* bug fix in cma kernel module;
tqchen pushed a commit to tqchen/tvm that referenced this pull request Mar 29, 2020
* rework;

* `de10-nano` -> `de10nano`;

* fix compilation error;

* bug fix;

* Update install.md

* Update install.md

* Update install.md

* update with current runtime;

* add debug messages;

* bug fix in cma kernel module;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants