Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest kernel shows problems on some hardware #590

Closed
conorsch opened this issue Jul 16, 2020 · 11 comments · Fixed by freedomofpress/securedrop-builder#189
Closed

Latest kernel shows problems on some hardware #590

conorsch opened this issue Jul 16, 2020 · 11 comments · Fixed by freedomofpress/securedrop-builder#189

Comments

@conorsch
Copy link
Contributor

After we shipped 4.14.186 kernels as part of #546, we received a report of distorted graphics after upgrading. The behavior described was quite similar to that documented in #308 (comment)

While we didn't catch the issue during QA, I was able to reproduce on test-only hardware. After reviewing logs, the problem seems to correlate with this event in syslog:

FATAL: Module u2mfn not found in directory /lib/modules/4.14.186-grsec-workstation

It appears that the dkms autoinstall line in the postinst for securedrop-workstation-grsec https://github.com/freedomofpress/securedrop-debian-packaging/blob/e7d5bea3f2eb6bbbc7ad76772ec42b4610830916/securedrop-workstation-grsec/debian/postinst#L42 is failing, but still exiting zero—so apt/dpkg didn't consider it an error. As a workaround, I was able to rebuild the dkms projects after bouncing paxctld and the situation was resolved. Let's try updating the postinst logic to restart the paxctld service before running dkms autoinstall.

Detailed dom0 logs
[user@dom0 ~]$ qvm-run sdw-backup gnome-terminal # cloned from problematic securedrop-workstation-buster
Running 'gnome-terminal' on sdw-backup
[user@dom0 ~]$ # confirmed garbled terminal
[user@dom0 ~]$ qvm-run -p sdw-backup 'sudo dpkg-reconfigure --frontend=noninteractive securedrop-workstation-grsec'

Kernel preparation unnecessary for this kernel.  Skipping...

Building module:
cleaning build area...
make -j2 KERNELRELEASE=4.14.186-grsec-workstation -C /lib/modules/4.14.186-grsec-workstation/build M=/var/lib/dkms/u2mfn/4.0.30/build...
cleaning build area...

DKMS: build completed.

u2mfn.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/4.14.186-grsec-workstation/updates/dkms/

depmod....
Job for systemd-modules-load.service failed because the control process exited with error code.

DKMS: install completed.
See "systemctl status systemd-modules-load.service" and "journalctl -xe" for details.
Synchronizing state of paxctld.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable paxctld
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-4.19.0-9-amd64
Found initrd image: /boot/initrd.img-4.19.0-9-amd64
Found linux image: /boot/vmlinuz-4.19.0-8-amd64
Found initrd image: /boot/initrd.img-4.19.0-8-amd64
Found linux image: /boot/vmlinuz-4.14.186-grsec-workstation
Found initrd image: /boot/initrd.img-4.14.186-grsec-workstation
Found linux image: /boot/vmlinuz-4.14.169-grsec-workstation
Found initrd image: /boot/initrd.img-4.14.169-grsec-workstation
done
[user@dom0 ~]$ qvm-run -p sdw-backup 'sudo dpkg-reconfigure --frontend=noninteractive securedrop-workstation-grsec'
Synchronizing state of paxctld.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable paxctld
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-4.19.0-9-amd64
Found initrd image: /boot/initrd.img-4.19.0-9-amd64
Found linux image: /boot/vmlinuz-4.19.0-8-amd64
Found initrd image: /boot/initrd.img-4.19.0-8-amd64
Found linux image: /boot/vmlinuz-4.14.186-grsec-workstation
Found initrd image: /boot/initrd.img-4.14.186-grsec-workstation
Found linux image: /boot/vmlinuz-4.14.169-grsec-workstation
Found initrd image: /boot/initrd.img-4.14.169-grsec-workstation
done
[user@dom0 ~]$ qvm-shutdown --wait sdw-backup
[user@dom0 ~]$ qvm-run sdw-backup gnome-terminal # let's see if resolved...
Running 'gnome-terminal' on sdw-backup
[user@dom0 ~]$ 

No hypothesis yet on why this change seems to affect only certain hardware. On the test laptop where I reproduced it, all SDW-based templates were affected:

sdw-kernel-fuzzy-terminal

Additionally we should investigate whether it's possible to cause dkms autoinstall to fail loudly, which would have notified the user about problems during the updater run.

@conorsch
Copy link
Contributor Author

In order to increase the metapackage version without changing the required kernel image, based on https://www.debian.org/doc/manuals/maint-guide/first.en.html#namever, it looks like should update the version https://github.com/freedomofpress/securedrop-debian-packaging/blob/e7d5bea3f2eb6bbbc7ad76772ec42b4610830916/securedrop-workstation-grsec/debian/changelog-buster#L1

from 4.14.186+buster to 4.14.186+buster1:

$ dpkg --compare-versions '4.14.186+buster1' gt '4.14.186+buster' 
$ echo $?
0
$ dpkg --compare-versions '4.14.186+buster1' lt '4.14.187+buster'
$ echo $?
0

@conorsch
Copy link
Contributor Author

Test package is live on apt-test: https://apt-test.freedom.press/pool/main/s/securedrop-workstation-grsec/ . Tried to QA on local hardware, but mistakenly used "prod" environment, so the problem persists—because the packages didn't change. To proceed with testing, I will

  1. Edit config.json prod -> staging
  2. Rerun securedrop-admin --apply
  3. Evaluate whether resolved

That's not a perfect test of the updater scenario, since the --apply actions pulls in the latest packages, but it should tell us whether the newer packages will unbreak a system as intended.

@conorsch
Copy link
Contributor Author

Tested after converting to staging. The end result is that my VMs are working again, although I had to run the updater twice in order to get full coverage. See screenshots below.

After running --apply (to convert to staging repos)

sdw-kernel-problem-4

After running the updater manually

sdw-kernel-problem-5

After running the updater manually a second time

sdw-kernel-problem-6

At no point did I reboot the host machine. So it looks like we have a resolution, but it didn't resolve for me entirely on the first pass. I recommend proceeding with release, and preparing support language for pilot participants that recommends 1) re-running the updater manually (with --skip-delta 0) and/or 2) qvm-run commands for dpkg-reconfigure securedrop-workstation-grsec in the
affected VMs if the problem isn't gone.

@emkll
Copy link
Contributor

emkll commented Jul 20, 2020

I've updated my workstation with the latest metapackage served by apt-test (4.14.186+buster1), and observed the package was updated by the GUI updater.

I was unable to reproduce the underlying issue, but did not observe any screen artifacts, nor any regressions after doing some quick basic client testing (login, export, open-in-dvm, reply)

@eloquence
Copy link
Member

Same on T480:

  1. Switched prod environment to staging. This laptop was not previously exhibiting the problem, however.
  2. Ran updater
  3. Verified that latest package is installed and running in AppVMs [sic]
  4. Confirmed no graphical issues opening terminal in any of the AppVMs.

So can confirm no regression from new metapackage, cannot confirm whether it resolves the original issue, since this laptop never had it, and the one that I have which did (X1) has already been fixed via dpkg-reconfigure.

@conorsch
Copy link
Contributor Author

Thanks, folks. I'm going to proceed with preparing a prod artifact and submit for review. After doing so, I'll work on reverting my test hardware to prod, in an attempt to re-break it, so I can dig a bit more deeply on the resolution behavior.

@eloquence
Copy link
Member

I reinstalled my X1, which did exhibit the issue previously, on latest prod (it finished when the new package was already up). (I did make clean && make prod in this case.) I confirmed that all templates were indeed running the buster+1 version of the metapackage, and all VMs are able to start without issues.

@eloquence
Copy link
Member

This appears to have been resolved via freedomofpress/securedrop-builder#179 and the associated updated package https://github.com/freedomofpress/securedrop-debian-packages-lfs/pull/30 . We've agreed to do more structured kernel testing next time around, at which time we may also want to investigate this paxctld logic further.

@conorsch
Copy link
Contributor Author

conorsch commented Aug 4, 2020

We've seen this problem crop up again. Made a quick script in an attempt to repro it locally: https://gist.github.com/conorsch/9c5f4e69798200d069fe43f4d5ab4e76 That script is very naive: it just repeatedly installs the old kernel and the new one, back and forth, checking for module errors in syslog every time. After a 1000 iterations, no repro. Given the naive approach, that's not terribly surprising—the next step was to test with startup/shutdowns of the VM each time, to mimick more closely how updates land in prod VMs.

After rebooting the test VM in which the loop had been running, however, I discovered that I did indeed have a repro:

[user@dom0 ~]$ qvm-run -p sdw-kernel-test 'sudo grep -F FATAL /var/log/syslog'
Aug  4 16:21:25 localhost qubes-sysinit.sh[204]: modprobe: FATAL: Module u2mfn not found in directory /lib/modules/4.14.169-grsec-workstation
[user@dom0 ~]$

Note that's the older kernel, not 4.14.186 when we first observed this problem. Will back up the VM image with the failure in it and investigate further.

@conorsch conorsch reopened this Aug 4, 2020
@conorsch
Copy link
Contributor Author

We've received more reports of this issue in the wild. We're certain the problem correlates with a missing u2mfn kernel module. The variable nature of the failure strongly implies a race condition. After exploring in a test VM, it appears that we're inadvertently calling dkms autoinstall twice: once explicitly in the postinst for the securedrop-workstation-grsec package, and the second occurs automatically by virtue of /etc/kernel/postinst.d/dkms (which is provided by the dkms package, and which honors the AUTOINSTALL=yes setting for the u2mfn module as specified in https://github.com/QubesOS/qubes-linux-utils/blob/4219f12d2325b037726250dc6ad4b34b15d0e05c/kernel-modules/u2mfn/dkms.conf.in).

So, if updates were only recently applied for the folks reporting the error, then we have a reasonable explanation for why the problem is appearing again. More testing required to determine whether a single run of dkms autoinstall is sufficient for reliable configuration of the modules.

@conorsch
Copy link
Contributor Author

conorsch commented Sep 8, 2020

Ran through an updater scenario on test hardware, after intentionally breaking the GUI in sd-app via sd-app-buster-template, as described in freedomofpress/securedrop-builder#189 The results: successfully resolved!

There was a small surprise in my testing because I'd inadveterently used a prod setup on hardware, and the metapackage is only available on staging. After a full update run did not resolve the problem, and indeed had not even resulted in the new metapackage being installed, I switched the sd-app-buster-template over to staging URL & pubkey, and forced another updated run. Fully resolved at the end.

Before

sdwk1

After

sdwk3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants