-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VTA] de10-nano driver #3394
[VTA] de10-nano driver #3394
Conversation
Pretty cool @liangfu , just out of curiosity how did you test the driver code? did you use the HLS VTA source on Intel's tools? or did you have a small design to test read/write registers/memory? |
Hi @vegaluisjose, similar to your chisel design of VTA, I have a PR #1694 . I have tested the driver with my own implement of VTA, which is completely compatible with your design. |
Awesome @liangfu , last question is the nano-board using 32-bit addresses I don't know if I have rights to review code yet or how we would want to do this @tmoreau89 ? |
it is implemented in 32-bit address. I think everyone is welcome to perform code review. |
I think overall looks fine, just to double check is the purpose of the "kernel_module" directory to host the source code for the If that is the case and this cma implementation is particular to this board, then what do you think on renaming that to |
The kernel module would be compiled to
on the device. I don't think the kernel module is particularly designed for the My concern is that Linux kernel modules are mostly released under GPL License, it may not fit in the place here that TVM is released under Apache License. |
This is good work @liangfu! I'd love to test it on a DE10 Nano; however can you also upload the FPGA compilation scripts? It seems like the old PR was closed, do you want to merge your old flow with @vegaluisjose 's new Chisel based design? I think we would need to push both compilation support with the metal tests. In addition, it looks like you re-wrote parts of the metal test Makefile. Did you ensure that it didn't break compilation for the pynq? I can test that, once I have the means to reproduce your test (which would require the sources, and the bitstream compilation scripts). |
Re: license. We need @tqchen's take on that. Where did you get the cma file from? Can we include from an altera quartus library assuming it's on the path (this is ultimately what we do with the pynq). |
Re: License, To be cautious, it would be great to put the additional file in 3rdparty or make it as an include dependency rather than source |
@tmoreau89 I would definitely merge with @vegaluisjose 's new design, and I would like to send the compilation script for DE10-Nano in another PR, since the driver is relatively independent (except for the address space alignment) with the VTA implementation and the compilation script. I was very careful not to break up the compilation for pynq, but it has to be tested to ensure the corrected. For the License part, as suggested by @tqchen , I would put the kernel module implement in an independent repo and import it into the |
I have fixed a bug in mapping phy address, and moved kernel module source code into |
@tmoreau89 when I migrate to use Chisel implement of VTA, the address space are quite different see here for tsim_driver.cc and here for test_lib.cc in |
Thank you @liangfu ; I think that placing the compilation scripts in another PR is totally fine. Re: address maps. These are different indeed; we'll need to check with @vegaluisjose but ideally we should be using the same maps so the software drivers stay the same. I suggest that we merge the other compilation script PR before this one so we can test the complete metal tests on the DE10 board. Does that work with your timeline? |
@tmoreau89 Sure |
Hey @liangfu @tmoreau89 , yeah address maps are different in the Chisel version. They are now contiguous (4 bytes increment). These three lines show how they are generated. I think we could use this new addressing? @liangfu given the fact that these metal tests work on your end, why not trying the unit-tests right away? |
I agree, let's use the new addressing imposed by the Chisel design memory map. The HLS version will get phased out once we have FPGA support on the Zynq and Ultra-96. Re: metal tests - these have bit-rotted; and it would be good to use the python based test infrastructure (unit-tests). I'll need to address the metal tests in a separate PR. |
@vegaluisjose okay, i think i would follow your unit test script, since I just come to aware that the instruction layout is not compatible with HLS implement as well. @tmoreau89 please refer to #3494 for compilation script for de10-nano. |
Hey @liangfu, did you find something about the instruction layout that is not the same? It should be otherwise we would not be able to compile VTA programs. |
@vegaluisjose sorry, I think I've made a mistake in comparing the instruction layout. For now, after changing |
Hey @liangfu , no worries. Thanks for letting me know. |
@vegaluisjose You're welcome. |
Hey @liangfu What do you mean by releasing I did some testing, a hacked version of metal tests, on AWS F1 directly but we decided to go with just hardware simulation (TSIM) because we wanted to have python support for these from the beginning. We are evaluating/working on the changes needed for edge (PYNQ) and cloud (F1) in terms of runtime. |
Hi @vegaluisjose, yes, I meant running On the other hand, I'm very glad to see python support in the very beginning. However, AFAIK, there is no ARM processor on F1 instance, how would you implement rpc to communicate with the FPGA device? |
@liangfu oh I see, yeah it seems 4GB limit? perhaps a limitation of that In terms of AWS F1, we are still working on runtime side of things because the system is based on discrete FPGA card. We do have a host but not under the same shared memory environment as the SoC, so we have to dma instructions/data instead. Seems trivial, but there are some synchronization challenge we are figuring out. |
@vegaluisjose Thanks for your explanation, that makes total sense. |
@liangfu I'm refactoring the runtime to make it easier to plug in different backends. It should now allocate the CMA buffers to be sized correctly rather than being sized according to the maximum DMA transfer defined by |
@tmoreau89 I'm very glad to see there would be an update in the runtime, and thanks for letting me know. @vegaluisjose When I'm debugging with x_np = np.random.randint(1, 2, size=(n, m, env.BATCH, env.BLOCK_OUT)).astype(x.dtype) This actually didn't generate any random number, but instead produced |
this is probably a mistake, and should be changed to increase the range of the matrix values |
I'll revisit this test and identify the issue; for now please ignore the padded load test |
@liangfu it's likely Have you tried running the end to end resnet test on the DE10? |
@tmoreau89 Thanks for your attention. I don't think current driver could pass the resnet test. But with small tensor sizes, the driver can pass the tests one-by-one except the padded load test. Therefore, I think the problem exists in the cma implement, and i'm currently debugging into this. |
I see, btw @liangfu do you think we are missing I am getting compilation errors on both Mac/Linux now when I try to create the share library. I managed to run the unittest by reverting the Makefile before this |
@vegaluisjose Ah, yes. The |
Well, running the Makefile twice only works on OSX but it fails on Linux. I have some takes about having loops on Makefile rules. What do you think about finding a middle ground with the previous version before #3797 and the debugging features you wanted to add (OSX/SDK)? |
@vegaluisjose I think it's fine to have such middle ground. |
Great, do you mind taking a stab at it?. I am going to cleanup the
`USE_TSIM` macro from runtime and also add `scalafmt` support for
lint/format the codebase.
On Sat, Aug 31, 2019 at 9:20 PM Liangfu Chen ***@***.***> wrote:
@vegaluisjose <https://github.com/vegaluisjose> I think it's fine to have
such middle ground.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3394>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAS5B72JM3UO4NH36I7VFSTQHM7KNANCNFSM4HZGAOZA>
.
--
Luis Vega
|
@tmoreau89 I think the memory issue has been fixed. I can run the |
@@ -20,8 +20,9 @@ VTA Installation Guide | |||
|
|||
We present three installation guides, each extending on the previous one: | |||
1. [Simulator installation](#vta-simulator-installation) | |||
2. [Hardware test setup](#vta-pynq-based-test-setup) | |||
3. [FPGA toolchain installation](#vta-fpga-toolchain-installation) | |||
2. [PYNQ-based test setup](#vta-pynq-based-test-setup) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can rename these categories to "FPGA test setup" and have sub categories for "Pynq board", "DE10Nano board" etc.?
@liangfu I reviewed the PR and overall it looks good. There's a bit of a risk involved in getting this merged when e2e isn't tested, but we need to merge this work in to make sure it doesn't bit rot. There are known issues with the Chisel design that prevent e2e resnet to run successfully on the |
One of the challenges of TVM/VTA is that software is developed incrementally while hardware is asked to be built all at once. This is reflected on how the whole VTA repository is currently organized. One way of dealing with this currently is to create a tuple of The idea is that a person tries to run one of the scatter python test files in the repo, i.e. Side note, maybe we should follow a hierarchy similar to what TPU has for models and benchmarks |
Having a coverage file is a good idea, at least to provide some indication about the state of tested coverage. I think we can cross reference the coverage file with a Github issue that will track coverage. This will be helpful to a lot of people jumping on board. Also we can track who's working on what in the community. I'll take a stab at it by the e.o.w. |
@liangfu if you want this merged, please remove the WIP label |
@vegaluisjose @tmoreau89 Thanks for your attention and valuable comments on this. I reverted the MXNet version back to v1.3.1, and I could have a end-to-end test on the board. However, the test is not yet successful. My observation is that, I can't have sufficient contiguous memory allocated for the inference of resnet-18. I would like to have this fixed in a separate PR. @tmoreau89 What do you think? On the test coverage topic, I think it's a good idea to keep track of tested tuples in a issue. In addition, I think it would be more helpful to provide board support packages (BSPs) for the supported platforms, so that users could have a jump start before taking a deep dive. Those BSPs could have precompiled FPGA bitstreams builtin and programmed by default when Linux system is loaded. |
Thanks for the update @liangfu . I propose that we move ahead with this PR and get it merged, while we address the memory allocation issue in a separate PR. Do you know how large of a buffer the CMA library can allocate on the DE10Nano? And do you have an idea if it fails in a data buffer or instruction buffer, or micro-op buffer allocation? |
* rework; * `de10-nano` -> `de10nano`; * fix compilation error; * bug fix; * Update install.md * Update install.md * Update install.md * update with current runtime; * add debug messages; * bug fix in cma kernel module;
@tmoreau89 I think the maximum buffer size that the CMA library can allocate on the DE10Nano board is 16MB for now. This is what I can get on the board.
In addition, when I run end-to-end test with resnet-18, the The debug message shows that it is a kind of |
AKAIK, PYNQ has as large as 128MB for cma_alloc, see its documentation here. To increase the buffer size for contiguous memory allocation, I guess we need to recompile the kernel and set |
Thanks @liangfu; indeed if we want to increase the |
* rework; * `de10-nano` -> `de10nano`; * fix compilation error; * bug fix; * Update install.md * Update install.md * Update install.md * update with current runtime; * add debug messages; * bug fix in cma kernel module;
* rework; * `de10-nano` -> `de10nano`; * fix compilation error; * bug fix; * Update install.md * Update install.md * Update install.md * update with current runtime; * add debug messages; * bug fix in cma kernel module;
* rework; * `de10-nano` -> `de10nano`; * fix compilation error; * bug fix; * Update install.md * Update install.md * Update install.md * update with current runtime; * add debug messages; * bug fix in cma kernel module;
* rework; * `de10-nano` -> `de10nano`; * fix compilation error; * bug fix; * Update install.md * Update install.md * Update install.md * update with current runtime; * add debug messages; * bug fix in cma kernel module;
Following the compilation script for Intel FPGA ( #3494 ), this PR brings the driver for the DE10-Nano.