-
-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when executing "example-rnn-regression" juice example on aarch64 target #134
Comments
Thanks for reporting the issue! Much appreciated. Unfortunately the error swallows a bunch of information, in @lissahyacinth I remember vaguely we had a similar issue with the Which one do you have? B01 or A02 - @lissahyacinth do you still have access to one? Technically, there are only limitations as imposed by |
It could be #106 |
Thanks for your answer, I'm using jetson nano B01 with cuda 10.2. I will try to add the prints tomorrow, I keep you up to date ! |
See #135 that should help with the error printing, and alleviate your pain :) |
I've got a B01 hanging around, but I haven't used it in a while. Didn't realise not propagating the error would come back to haunt me! |
Hello, Here is the new backtrace with a better error message:
I wonder about workspace size ? Etienne |
That is really helpful! CUDA is pretty difficult to debug around there, as the error says, it can be any of those issues. I can't imagine the memory is causing the issue, but it could be? If you run If not, I guess I'll have to debug it here :) |
@Etienne-56 this is not related to your cargo / or disk workspace, but only to the scratch memory for the cudnn rnn layer ( os a chunk of GPU mapped memory iirc). I think the most likely cause is one of those two:
accoding to https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnRNNForwardTraining . @lissahyacinth this is a hunch, nothing more, nothing less :) I have a feeling the size of our scratchspace is just too small depending on the input and output, but I'd have to read up on that.
☝️ which is not what we do 😰 |
@lissahyacinth Thanks for the suggestion but nvidia-smi binary is not on the jetson RAM 566/3964MB (lfb 706x4MB) SWAP 0/1982MB (cached 0MB) CPU [2%@102,1%@102,0%@102,0%@102] EMC_FREQ 0% GR3D_FREQ 0% [email protected] [email protected] iwlwifi@41C PMIC@100C [email protected] AO@35C [email protected] RAM 566/3964MB (lfb 706x4MB) SWAP 0/1982MB (cached 0MB) CPU [1%@102,3%@102,1%@102,0%@102] EMC_FREQ 0% GR3D_FREQ 0% [email protected] CPU@31C iwlwifi@41C PMIC@100C [email protected] AO@35C [email protected] RAM 566/3964MB (lfb 706x4MB) SWAP 0/1982MB (cached 0MB) CPU [3%@102,1%@102,1%@102,0%@102] EMC_FREQ 0% GR3D_FREQ 0% [email protected] CPU@31C iwlwifi@41C PMIC@100C [email protected] [email protected] [email protected] RAM 566/3964MB (lfb 706x4MB) SWAP 0/1982MB (cached 0MB) CPU [2%@102,0%@102,1%@102,0%@102] EMC_FREQ 0% GR3D_FREQ 0% [email protected] [email protected] iwlwifi@41C PMIC@100C GPU@29C AO@35C [email protected] RAM 566/3964MB (lfb 706x4MB) SWAP 0/1982MB (cached 0MB) CPU [1%@102,3%@102,1%@102,0%@102] EMC_FREQ 0% GR3D_FREQ 0% PLL@26C CPU@31C iwlwifi@41C PMIC@100C [email protected] AO@35C [email protected] RAM 598/3964MB (lfb 693x4MB) SWAP 0/1982MB (cached 0MB) CPU [25%@1479,23%@1479,22%@1479,34%@1479] EMC_FREQ 0% GR3D_FREQ 0% [email protected] CPU@31C iwlwifi@41C PMIC@100C [email protected] [email protected] thermal@30C RAM 687/3964MB (lfb 659x4MB) SWAP 0/1982MB (cached 0MB) CPU [10%@1428,5%@1428,29%@1428,29%@1428] EMC_FREQ 0% GR3D_FREQ 21% [email protected] CPU@32C iwlwifi@41C PMIC@100C [email protected] AO@35C [email protected] RAM 794/3964MB (lfb 620x4MB) SWAP 0/1982MB (cached 0MB) CPU [14%@1479,12%@1479,27%@1479,30%@1479] EMC_FREQ 0% GR3D_FREQ 15% PLL@27C [email protected] iwlwifi@41C PMIC@100C [email protected] AO@35C [email protected] RAM 866/3964MB (lfb 589x4MB) SWAP 0/1982MB (cached 0MB) CPU [10%@1132,18%@1132,25%@1132,6%@1132] EMC_FREQ 0% GR3D_FREQ 10% PLL@27C [email protected] iwlwifi@41C PMIC@100C [email protected] [email protected] [email protected] RAM 981/3964MB (lfb 553x4MB) SWAP 0/1982MB (cached 0MB) CPU [10%@1479,24%@1479,47%@1479,6%@1479] EMC_FREQ 0% GR3D_FREQ 0% PLL@27C [email protected] iwlwifi@41C PMIC@100C [email protected] AO@35C [email protected] RAM 683/3964MB (lfb 564x4MB) SWAP 0/1982MB (cached 0MB) CPU [24%@1479,14%@1479,14%@1479,15%@1479] EMC_FREQ 0% GR3D_FREQ 0% PLL@27C CPU@32C iwlwifi@41C PMIC@100C [email protected] AO@35C [email protected] RAM 683/3964MB (lfb 564x4MB) SWAP 0/1982MB (cached 0MB) CPU [1%@102,2%@102,0%@102,0%@102] EMC_FREQ 0% GR3D_FREQ 0% [email protected] [email protected] iwlwifi@41C PMIC@100C GPU@29C [email protected] thermal@30C RAM 683/3964MB (lfb 564x4MB) SWAP 0/1982MB (cached 0MB) CPU [1%@102,3%@102,1%@102,0%@102] EMC_FREQ 0% GR3D_FREQ 0% [email protected] CPU@31C iwlwifi@41C PMIC@100C GPU@29C AO@35C [email protected] RAM 683/3964MB (lfb 564x4MB) SWAP 0/1982MB (cached 0MB) CPU [7%@102,1%@102,3%@102,1%@102] EMC_FREQ 0% GR3D_FREQ 0% [email protected] [email protected] iwlwifi@41C PMIC@100C GPU@29C AO@35C [email protected] RAM 683/3964MB (lfb 564x4MB) SWAP 0/1982MB (cached 0MB) CPU [2%@102,3%@102,1%@102,0%@102] EMC_FREQ 0% GR3D_FREQ 0% PLL@27C [email protected] iwlwifi@41C PMIC@100C GPU@29C AO@35C thermal@30C |
This has nothing todo with the memory available, that would yield an allocation error which would render a different error. |
@Etienne-56 I got a simple example test case working, see #139 / https://ci.spearow.io/teams/spearow/pipelines/juice/jobs/pr-test-juice-fedora-cuda/builds/41 - it was some hidden layer dimensionality mismatch that causes this, unrelated to the temporary space allocations. Hoping to fix the RNN layer tonight. |
Ok thanks @drahnr for feedback :) |
Quick update, I got a simple layer working now, but it seems either the mackey-glass is triggering an edge case or using an invalid parameterization or when embedded within multiple layers another issue is triggered. |
It's fixed for the most part now, training should be working just fine, there is one tensor size mismatch left in the |
@drahnr Thanks for the support, I recently switched from a R&D project to a client project so I can't replay the test now, but a new colleague is going to take my place next week. I will transmit him what to do and keep you informed of the result ! |
Describe the bug
Hello everyone,
I'm new to github this is my first post :)
My objective is to use juice library on a Nvidia Jetson Nano board. For that I managed to cross compile (after many struggling) the juice repository using the rust library "cross" (https://github.com/rust-embedded/cross) and docker container :
Dockerfile.jetson-balena.txt
The jetson nano has the following setup : Linux droopi-desktop 4.9.201-tegra #1 SMP PREEMPT Fri Feb 19 08:40:32 PST 2021 aarch64 aarch64 aarch64 GNU/Linux
I built the docker image with :
In juice repo i created a Cross.toml file containing:
I started cross compilation with:
The example "mnist-image-multiclass-classification" works fine on the jetson but the issue is that example "example-rnn-regression" fails at execution and I want to be shure that everything works fine in the library before going further.
The error message is :
example-rnn-regression.log.txt
Any idea why it happens? Is it even possible to have every juice feature working on arm64 target?
Thank you very much,
Etienne
The text was updated successfully, but these errors were encountered: