Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent RISC-V CI failures #6905

Open
janvrany opened this issue Feb 24, 2023 · 11 comments
Open

Intermittent RISC-V CI failures #6905

janvrany opened this issue Feb 24, 2023 · 11 comments

Comments

@janvrany
Copy link
Contributor

janvrany commented Feb 24, 2023

For some time, we're experiencing itermittent failures with RISC-V CI cross-compiling job, see for example #6706 or #6704.
This issue is create to track progress on stabilising RISC-V CI cross-compiling job.

@janvrany
Copy link
Contributor Author

FYI: @AdamBrousseau

@janvrany
Copy link
Contributor Author

I have built Debian 10 (buster - this is what CI is running as far as I can tell) and Debian 11 (bullseye) images as similar to CI build node as I could and run few tests there:

  1. When I run the image as container (using systemd-nspawn) a number of tests are failing rather wildly, segfaults, aborts, failures - omrrastest, omrthreadtest, porttest, irrespective of what QEMU version is used (tried with 7.2.0, 7.0.0, 6.0.0, 5.0.0 all built on the image from source).

  2. When I run the image as a real VM (using KVM as hypervisor), only PortSignalExtendedTests.sig_ext_test1 is failing. Again tried with QEMU 7.2.0, 7.0.0, 6.0.0, 5.0.0.

  3. TRIL tests takes huge amount of memory (>8GB) causing system to swap a lot, eventually causing timeouts. This has been observed on Eclipse CI as well.

@AdamBrousseau Is the CI node (deb10-x64-1) a full VM?
In any case, what I observed above is not consistent with what can be seen on CI node...

@AdamBrousseau
Copy link
Contributor

Linux deb10-x64-1 4.19.0-13-amd64 #1 SMP Debian 4.19.160-2 (2020-11-28) x86_64

Yes, the debian machine is a vm running on kvm.

@janvrany
Copy link
Contributor Author

#6913 has been merged but it did not help much. Now it hangs up in SanityTest, see for example:

When this happened, the build node was not swapping and CPU usage was low - qemu process simply hang.

I'm running out of ideas. It's hard to reproduce for me - essentially I see this kind of failure only on Eclipse CI. QEMU does not implement "extended remote" protocol so one cannot debug multi-threaded programs running under user-mode emulation with QEMU. I can try to see where in QEMU it hangs, but not sure how useful this would be. If anyone has an idea how to approach this, I'm one big ear.

@janvrany
Copy link
Contributor Author

For the record, I run only threadtest as follows:

`QEMU=...
for in in `seq 1 100`; do 
   $QEMU "-L" "/home/jenkins/riscv-debian/rootfs" "/tmp/omr/build/fvtest/threadtest/omrthreadtest" "--gtest_output=xml:/tmp/omr/build/fvtest/threadtest/omrthreadtest-results.xml" "--gtest_filter=-PriorityInterrupt.*:RWMutex*" | perl -pe 'use POSIX strftime; print strftime "[%Y-%m-%d %H:%M:%S] ", localtime'; 
done

and:

  • on Eclipse CI node it hangs after couple iterations.
  • on freshly built deb10 image it does not hang, not once in 100 runs at least.

I tried on Eclipse CI node with the bit-identical static QEMU binary copied from my (working) deb10 image to no avail, still hangs.

@janvrany
Copy link
Contributor Author

Another observation:
when I replaced currently used sysroot on CI node with "fresh" sysroot, hangs are lot less frequent (one in ~25 compared to 1 in ~3), but it still hangs.

I also tried running tests on my freshly build deb10 image with sysroot copied from CI node - didn't hang once in 100 runs.

I also noticed that the uptime of CI node (deb10-x64-1) is > 300days.
@AdamBrousseau: how much hassle is to reboot the whole deb10-x64-1 VM? I know it should not matter, but I have no idea and just trying different things.

Anyways, I'll try to update sysroot on CI node to "fresh" (with newer versions of libraries, most notably glibc) as it might to reduce hangups (but not fix them).

@AdamBrousseau
Copy link
Contributor

Rebooted. Let me know how it goes.

@janvrany
Copy link
Contributor Author

@AdamBrousseau: Unfortunately, reboot did not help, but thanks anyway!

I'm going to update qemu-riscv64 to 7.2.0 (from 5.0.0) and sysroot on deb10-x64-1 as it seems that with this combination hangups are less likely.

@janvrany
Copy link
Contributor Author

I'm going to update qemu-riscv64 to 7.2.0 (from 5.0.0) and sysroot on deb10-x64-1

I did it but the test build hanged just like before. Maybe just bad luck, but when I tried to build ant run test manually it was far more stable. Anyways, I'm going to keep new versions there for some time (I left backups on the node so reverting is a matter of redirecting symlinks back)

@janvrany
Copy link
Contributor Author

With new sysroot, PortSignalExtendedTestsis failing, here's a fix: #6938

@janvrany
Copy link
Contributor Author

Just adding reference to PR #6912 as it might help with this. Maybe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants