Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use unmodified (blob-free) U-Boot for the Rock64 board #145506

Closed
wants to merge 1 commit into from

Conversation

royneary
Copy link
Contributor

Motivation for this change

I think we can simplify the ubootRock64 package and make it blob-free.

Back then proprietary blobs were used because of "random memory corruption". I think that has been fixed in the meantime, maybe by these patches. I ran my two Rock64 boards with a blob-free U-boot for some months now and didn't encounter any issues.

The extraConfig can be removed too because it is in mainline u-boot now.

Things done
  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 21.11 Release Notes (or backporting 21.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

@royneary royneary force-pushed the uboot-rock64-blobfree branch from eae9654 to fe85a24 Compare November 11, 2021 16:23
@ofborg ofborg bot requested review from samueldr, dezgeg and lopsided98 November 11, 2021 16:34
@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux labels Nov 11, 2021
@samueldr
Copy link
Member

I'd like to hear confirmation from Rock64 board owners, but it sure looks likely good given the patch you linked (and your experience).

@mirrexagon
Copy link
Contributor

mirrexagon commented Nov 20, 2021

I've tested this on hardware and it looks good!

I can boot a lightly-modified generic Aarch64 image using U-Boot from this branch on a 2 GB and a 4 GB Rock64. I get to login, automatic login works, and I get a working shell. Nothing seems off poking around for a minute.

Steps I took:

  • Built U-Boot from this branch (cross compilation via pkgsCross.aarch64-multiplatform).
  • Took a generic Aarch64 image (https://hydra.nixos.org/build/158402880) and wrote it to an SD card.
  • Used fdisk to remove the first partition because we are putting U-Boot there instead.
  • Wrote idbloader.img to sector 64 and u-boot.itb to sector 16384.
  • Mounted the root partition and edited boot/extlinux/extlinux.conf to get console output - replace console=ttyS0,115200n8 with console=ttyS2,1500000n8.
  • Put the SD card in the Rock64 and power it up.
  • Confirm that U-Boot TPL is running instead of the vendor blob from the serial output.

@samueldr
Copy link
Member

(I don't know why I didn't think about it when reviewing initially)

On an RK3399 system, I was able to produce bad results only under stress. Stressors included:

  • Running memtester
  • Running a nix eval
  • Running a zfs scrub

Either at once, or even (less reliably) independently. nix eval issues would generally fail with random impossible errors (non-sensical missing argument to callPackage). memtester issues were self-explanatory. ZFS issues were generally hangs in the kernel ZFS implementation. Add some infrequent misc. segfaults and weird crashes under stress.

So maybe run a nixos system nix eval in a loop and run memtester in a loop for a while? I generally could crash that specific platform reliably in a few minutes. So probably a successful 20-30 minutes run could be a good indicator?

@mirrexagon
Copy link
Contributor

mirrexagon commented Nov 20, 2021

I've cherry-picked this onto my RockPro64/Rock64 U-Boot PR: #146725

I'll see if I can run some of that testing, using U-Boot built from my branch.

@mirrexagon
Copy link
Contributor

I had a bit of trouble with this.

TL;DR:

  • Random reset and kernel oops, but now it's now been running for 10 minutes or so without issues. I'll leave it going for a bit and see if anything else happens.
  • Wasn't sure exactly how to do the nix eval, the command I tried eventually reports an error. What should I try instead?

Creating a bootable SD card as before but using u-boot-rockchip.bin from my branch with this change, I tried to run these on my 4 GB Rock64:

  • nixos-generate-config to get a NixOS system configuration to evaluate.
  • nix-shell -p tmux memtester
  • In one tmux pane: while true; do nix --extra-experimental-features 'nix-command' eval -f '<nixpkgs/nixos>' system; done
  • In another tmux pane: sudo memtester 2G

I had a random reset shortly after booting into Linux the first time, and a kernel oops the second time:

Kernel output
[  134.892091] ------------[ cut here ]------------
[  134.892104] kernel BUG at lib/radix-tree.c:442!
[  134.892524] kernel BUG at lib/radix-tree.c:442!
[  134.892937] Internal error: Oops - BUG: 0 [#1] SMP
[  134.894158] Modules linked in: cfg80211 ip6_tables rfkill 8021q xt_conntrack nf_conntrack garp mrp nf_defrag_ipv6 nf_defrag_ipv4 xt_tcpudp snd_soc_hdmi_codec hantro_vpu(C) dw_hdmi_cec dw_hdmi_i2s_audio ip6t_rpfilter v4l2_h264 ipt_rpfilter v4l2_mem2mem gpio_ir_recv videobuf2_vmalloc xt_pkttype crct10dif_ce nft_compat videobuf2_dma_contig snd_soc_spdif_tx videobuf2_memops videobuf2_v4l2 dwmac_rk stmmac_platform stmmac rc_core rtc_rk808 videobuf2_common snd_soc_rk3328 gpio_syscon pcs_xpcs nft_counter phy_rockchip_inno_hdmi lima videodev gpu_sched mc snd_soc_rockchip_i2s snd_soc_simple_card snd_soc_rockchip_pcm snd_soc_simple_card_utils snd_soc_rockchip_spdif uio_pdrv_genirq uio rockchip_thermal nf_tables libcrc32c nfnetlink sch_fq_codel zfs(PO) zunicode(PO) zzstd(O) zlua(O) zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) tap macvlan bridge stp llc fuse ip_tables x_tables rockchipdrm dw_hdmi dw_mipi_dsi cec analogix_dp drm_kms_helper drm dm_mod
[  134.901634] CPU: 1 PID: 0 Comm: swapper/1 Tainted: P         C O      5.15.2 #1-NixOS
[  134.902334] Hardware name: Pine64 Rock64 (DT)
[  134.902724] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  134.903343] pc : radix_tree_extend+0x12c/0x1b0
[  134.903750] lr : radix_tree_extend+0xe0/0x1b0
[  134.904141] sp : ffff80001000bdc0
[  134.904437] x29: ffff80001000bdc0 x28: ffff800012bdafa0 x27: ffff0000fe765438
[  134.905073] x26: 0000000000000000 x25: 0000000000000001 x24: ffffffffffffffff
[  134.905710] x23: 00000000fe7653e0 x22: ffff0000fe7658d0 x21: ffff800012bdafa0
[  134.906345] x20: 0000000013a1bc20 x19: 0000000013a1bc3e x18: 0000000000000000
[  134.906982] x17: ffff8000ec833000 x16: ffff80001000c000 x15: 00003d08f7ff2c00
[  134.907618] x14: 0000000000017700 x13: fffffffffff938d8 x12: 003d08f7ff3dc24c
[  134.908255] x11: 0000000000000000 x10: 0000000000000008 x9 : 0000000000000000
[  134.908893] x8 : ffff00000b478488 x7 : 0000000000000000 x6 : 000000000000003f
[  134.909530] x5 : 0000000000000040 x4 : fffffffffffffff0 x3 : 0000000000000000
[  134.910166] x2 : 00000000ffff8000 x1 : 0000000000000001 x0 : ffff00000b478248
[  134.910803] Call trace:
[  134.911027]  radix_tree_extend+0x12c/0x1b0
[  134.911399]  timerqueue_del+0x48/0x70
[  134.911731]  __remove_hrtimer+0x5c/0xa0
[  134.912079]  __hrtimer_run_queues+0xf8/0x2e0
[  134.912466]  hrtimer_interrupt+0x11c/0x308
[  134.912835]  arch_timer_handler_phys+0x3c/0x50
[  134.913237]  handle_percpu_devid_irq+0x90/0x1c8
[  134.913645]  handle_domain_irq+0x68/0x98
[  134.914002]  gic_handle_irq+0x70/0xa0
[  134.914332]  call_on_irq_stack+0x28/0x50
[  134.914687]  do_interrupt_handler+0x5c/0x68
[  134.915065]  el1_interrupt+0x30/0x48
[  134.915390]  el1h_64_irq_handler+0x18/0x28
[  134.915757]  el1h_64_irq+0x78/0x7c
[  134.916065]  cpuidle_enter_state+0xbc/0x398
[  134.916442]  cpuidle_enter+0x40/0x58
[  134.916766]  call_cpuidle+0x24/0x48
[  134.917083]  do_idle+0x200/0x268
[  134.917375]  cpu_startup_entry+0x2c/0x78
[  134.917729]  secondary_start_kernel+0x160/0x170
[  134.918137]  __secondary_switched+0x90/0x94
[  134.918520] Code: 7101029f b2400021 f9011c01 54fffb69 (d4210000)
[  134.919065] ---[ end trace e2e6d933ad4ef4ff ]---
[  134.919478] Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt
[  134.920131] SMP: stopping secondary CPUs
[  136.087151] SMP: failed to stop secondary CPUs 0-1
[  136.087583] Kernel Offset: disabled
[  136.087894] CPU features: 0x00001001,00000846
[  136.088282] Memory Limit: none
[  136.088558] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt ]---

Running nix --extra-experimental-features 'nix-command' eval -f '<nixpkgs/nixos>' system for the system eval, I consistently got this error:

error: anonymous function at /nix/store/cvsr7m4mr31brf7x4j2yqzvgk6hlx24a-nixos-21.11pre332033.715f6341195/nixos/pkgs/build-support/fetchurl/boot.nix:5:1 called with unexpected argument 'meta' 

       at /nix/store/cvsr7m4mr31brf7x4j2yqzvgk6hlx24a-nixos-21.11pre332033.715f6341195/nixos/pkgs/build-support/fetchzip/default.nix:22:2:

           21|
           22| (fetchurl (let 
             |  ^                                                                                                                                        
           23|   tmpFilename =
(use '--show-trace' to show detailed location information)

Running with the --debug flag shows it is definitely going through a whole lot of files though.


Versions:

Rock64 v2.0 4 GB

ubootRock64 from 0a2a148b5613ef49cc5632b226195c0a6569d2c6
/nix/store/xvfh8jygsp070jx1jj89v9wywapk2h55-uboot-rock64-rk3328_defconfig-aarch64-unknown-linux-gnu-2021.10

[nixos@nixos:~]$ nix --version
nix (Nix) 2.4 

[nixos@nixos:~]$ nix-instantiate --eval -E '(import <nixpkgs> {}).lib.version'
"21.11pre332033.715f6341195"

@samueldr
Copy link
Member

I had a random reset shortly after booting into Linux the first time, and a kernel oops the second time:

This in itself doesn't look good.

As far as evaluating, using nixos-rebuild dry-build would be enough.

@mirrexagon
Copy link
Contributor

I am getting some memtester failures:

[nixos@nixos:~]$ sudo memtester 2G
memtester version 4.5.0 (64-bit)
Copyright (C) 2001-2020 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 2048MB (2147483648 bytes)
got  2048MB (2147483648 bytes), trying mlock ...locked.
Loop 1:
  Stuck Address       : ok
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok
  Block Sequential    : ok
  Checkerboard        : ok
  Bit Spread          : ok
  Bit Flip            : testing 256FAILURE: 0x100000000 != 0x00000000 at offset 0x00564ea8.
FAILURE: 0x00000000 != 0x100000000 at offset 0x02b4c678.
FAILURE: 0x100000000 != 0x00000000 at offset 0x05cd0ea8.
FAILURE: 0x00000000 != 0x100000000 at offset 0x0a3679b8.
FAILURE: 0x100000000 != 0x00000000 at offset 0x1d650728.
FAILURE: 0x00000000 != 0x100000000 at offset 0x2256f678.
FAILURE: 0x100000000 != 0x00000000 at offset 0x225cbfe8.
FAILURE: 0x100000000 != 0x00000000 at offset 0x2262dc68.
FAILURE: 0x00000000 != 0x100000000 at offset 0x239cfdf8.
FAILURE: 0x100000000 != 0x00000000 at offset 0x33c307a8.
FAILURE: 0x00000000 != 0x100000000 at offset 0x39f87778.
FAILURE: 0x00000000 != 0x100000000 at offset 0x3d0647b8.
  Walking Ones        : ok
  Walking Zeroes      : ok

I'll let this run for a bit longer, then try with the current master U-Boot to try and rule out my board being bad.

@samueldr
Copy link
Member

Don't forget to test against the build with the proprietary ram training!

@mirrexagon
Copy link
Contributor

Results from more testing:

  • 4 GB board, U-Boot from master (proprietary RAM training), sudo memtester 2G: no memtester errors, but several kernel NULL pointer dereferences and oopses, and eventually encountered a kernel panic and locked up.
  • 2 GB board, U-Boot from my branch (upstream RAM training), sudo memtester 1G: no issues whatsoever after a few hours.

I'll do a run with the 2 GB board and proprietary RAM training to complete the set, but this seems to mean one of these:

  • My 4 GB board has issues.
  • 4 GB support never worked properly (which seems unlikely).

@samueldr
Copy link
Member

  • My 4 GB board has issues.
  • 4 GB support never worked properly (which seems unlikely).

I wouldn't discount either outright. But considering the proprietary ram training fails it's odd. (To be fair, knowing which binary to use is an open question too, which is generally answered through a big dose of cargo-culting.)

I guess it may help to know what issue @lopsided98's rock64 had, and if it was a 4GB variant.

@lopsided98
Copy link
Contributor

Yes, mine is 4GB. It just started locking up more and more often. I haven't had a chance to look into it, but I can probably do some testing in a few days when I come home from school.

@mirrexagon
Copy link
Contributor

2 GB with proprietary RAM training: no issues.

@royneary
Copy link
Contributor Author

royneary commented Nov 22, 2021

To add a data point, I ran the following tests on my Rock64 v3 (4GB RAM):

  • sudo memtester 2G (three runs) and
  • while true; do nixos-rebuild dry-build; done in parallel

I ran it with both proprietary RAM training (uboot 2021.04 from NixOS 21.05) and upstream RAM training (uboot 2021.10 from this PR). No errors / crashes occurred. I ran it on NixOS 21.05 with kernel 5.10.79 though.

I have two v3 boards (4GB and 1GB). @mirrexagon: are both of your boards v2? On the pine64 forum [1], [2] some people say that on the v2 version RAM instabilities are a known hardware issue.

[1] https://forum.pine64.org/showthread.php?tid=11209
[2] https://forum.pine64.org/showthread.php?tid=14343

@samueldr
Copy link
Member

samueldr commented Nov 22, 2021

Quoting from one of the threads:

Only distro that dont segfault during compiling is ayufans ubu from his repo

It would be interesting to figure out which RAM training is cargo-culted there.

and patching the image to initialize the clocks at 333 MHz will likely fix things. I'll be giving this a try shortly.

Hmmm... What if I read more...

It seems patching the SD card was not very difficult. I downloaded

rk3328_ddr_333MHz_v1.16.bin
rk3328_miniloader_v2.46.bin

Oooh...

-      ./tools/mkimage -n rk3328 -T rksd -d ${rkbin}/rk33/rk3328_ddr_786MHz_v1.13.bin idbloader.img

That could be it.

Assuming this helps, anything we do won't make matters worse. But maybe there's a way forward for an alternative build. It makes no sense to harm perfs on the good machines.

First "trivial" fix to try would be using the 333MHz training blob.

Then, a "better" fix would be to patch the OSS training to do 333MHz... I looked quickly and it looks tricky... I wouldn't call it an OSS training, but more like a few blobby bits into OSS harness.

Note that we don't need the miniloader.


To recap:

  • Default build should use whatever mainline does
  • Provide additional build (or params?) for v2 boards to use proprietary 333MHz training. (Assuming it helps)

We already have the first point done (here, rebased in the other PR). The other would need to be tested, if anyone's up to it.

Hopefully that will help @lopsided98's board too, assuming v2 4GB.

If all of this pans out, it might be warranted to add a note about it on the Pine64 wiki. (And the NixOS on ARM wiki pages.)

@mirrexagon
Copy link
Contributor

[...] @mirrexagon: are both of your boards v2? On the pine64 forum [1], [2] some people say that on the v2 version RAM instabilities are a known hardware issue.

Yep, both are v2.


Quoting from one of the threads:

Only distro that dont segfault during compiling is ayufans ubu from his repo

It would be interesting to figure out which RAM training is cargo-culted there.

On this, I tried quickly poking around Ayufan's U-Boot repos. In the vendor-fork-based U-Boot there is this:

ifeq (rock64,$(BOARD_TARGET))

UBOOT_DEFCONFIG ?= rock64-rk3328_defconfig
UBOOT_TPL ?= tmp/rkbin/rk33/rk3328_ddr_786MHz_v1.13.bin
BL31 ?= tmp/rkbin/rk33/rk3328_bl31_v1.39.bin
BOARD_CHIP ?= rk3328

But in the latest upstream-based U-Boot [1] which seems to be used in the latest Linux image builds [2] doesn't specify this. It has a LOADER_BIN ?= tmp/rkbin/rk33/rk3328_loader_ddr333_v1.08.244.bin but it looks like that's just related to loading stuff in maskrom mode. So maybe it's using the upstream U-Boot RAM training?

I was looking for a sign that Ayufan was using 333 MHz training, but I haven't seen it yet.

I also noticed that (what I assume is) upstream rkbin only has 333 MHz and 400 MHz binaries for the RK3328 [3], and I can't find any 786 MHz ones in commit history. The RK3328 binaries are also up to v1.17. Ayufan's rkbin repo only has up to v1.13 [4], and the forum user was using v1.16.


[...]
To recap:

  • Default build should use whatever mainline does
  • Provide additional build (or params?) for v2 boards to use proprietary 333MHz training. (Assuming it helps)

We already have the first point done (here, rebased in the other PR). The other would need to be tested, if anyone's up to it.

I'll test with the 333 MHz training when I can. I'll pull the latest v1.17 binary from upstream rkbin, unless there's a reason to try a different version.


[1] https://github.com/ayufan-rock64/linux-mainline-u-boot/releases/tag/2021.07-ayufan-2019-ga679a8ce
[2] https://github.com/ayufan-rock64/linux-build/blob/0.11.2/Makefile.latest.mk#L1
[3] https://github.com/rockchip-linux/rkbin/tree/master/bin/rk33
[4] https://github.com/ayufan-rock64/rkbin/tree/master/rk33

@lopsided98
Copy link
Contributor

lopsided98 commented Nov 24, 2021

My board is a 4GB v2.0. It was running U-Boot 2020.10 with the proprietary v1.13 768MHz RAM training. I know I have had problems (presumably memory errors) with this setup, but I just ran memtester overnight and it reported no errors. I also tested a few loops of memtester with the upstream memory training (this PR) and had no errors either.

@samueldr
Copy link
Member

@lopsided98 it's entirely possible just memtester is not enough load.

For the assumedly similar issue I had with the Helios64, it required other load in addition to cause issues. E.g. ZFS scrub was a surefire way to cause issues. CPU load seemed enough, e.g. evaluating the NixOS system.

Load was assumed to be a factor as reducing the CPU speed and locking cpu frequency scaling helped (but never fixed) the issues.

@lopsided98
Copy link
Contributor

My Rock64 used to have a ZFS pool connected, which likely helped trigger the issues. Unfortunately, I moved that pool to my RockPro64 since I started experiencing issues with the Rock64, so it isn't easy to test anymore. I've been trying to reproduce the problem using stress-ng alongside memtester, but it has almost finished two successful memtester loops now.

@mirrexagon
Copy link
Contributor

On my 4 GB board I ran the same test as before (nixos-rebuild dry-build in a loop and memtester):

  • 333 MHz v1.17 proprietary RAM training (https://github.com/mirrexagon/nixpkgs/tree/rock64-uboot-333mhz): No issues after three memtester loops!
  • Retesting with upstream training from this PR to confirm the issue still exists there: As expected, I get memtester errors on the first loop and a kernel panic shortly after.
  • Bonus 400 MHz proprietary training (as with 333 MHz but using the 400 MHz binary from the same repo): Also no issues after three memtester loops!

So, the way forward is to have two Rock64 U-Boot builds, one with upstream RAM training and one with proprietary 333 or 400 MHz RAM training?

In my PR #146725, I'm exposing the combined upstream TPL+SPL+U-Boot Proper image u-boot-rockchip.bin, which is generated by padding U-Boot Proper (u-boot.itb) to CONFIG_SPL_PAD_TO. If I'm going to do this in my PR, I'm not sure how to either read that value in postBuild or convince the U-Boot build system to use a different TPL (and that's very similar to what I need to figure out for the SPI image).

So I'm wondering if it might be better to do the change in this PR so this can be merged first, and I can figure out combined images without holding this up.

@samueldr
Copy link
Member

So I'm wondering if it might be better to do the change in this PR so this can be merged first, and I can figure out combined images without holding this up.

Considering the current build uses rk3328_ddr_786MHz_v1.13, merging this as-is means there is no useful discernable difference for end-users. Those with "bad hardware" already are using bad training data.

So we can merge this PR without thinking about the new variant.

Unless I'm missing something?

@mirrexagon
Copy link
Contributor

That's a good point, I think you are right.

@samueldr
Copy link
Member

The main thing is now we know that for the non-marginal hardware this is still good, as it was.

This is the scariest thing to get wrong. The failure mode can be extremely harmful, if it ends up corrupting in-memory data, instead of outright crashing.

@mirrexagon
Copy link
Contributor

I added a quick note about the memory corruption to the NixOS wiki here: https://nixos.wiki/wiki/NixOS_on_ARM/PINE64_ROCK64#Status

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 19, 2022
@wegank wegank added the 12.approvals: 1 This PR was reviewed and approved by one reputable person label Sep 7, 2023
@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Sep 7, 2023
@lopsided98
Copy link
Contributor

This PR was superseded by #191538

@lopsided98 lopsided98 closed this Feb 24, 2024
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/installing-nixos-on-rock64-2024-experience/40954/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux 12.approvals: 1 This PR was reviewed and approved by one reputable person
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants