Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade build chroot and debian containers to buster #464

Merged
merged 22 commits into from
Feb 7, 2020

Conversation

celskeggs
Copy link
Member

@celskeggs celskeggs commented Dec 15, 2019

This PR upgrades us to the latest version of Debian, including working around bug 946783 in debootstrap.

Note that the deploy chroot does not have an automated test, and that I'm not intending to add one in this PR.

Fixes #428.


Checklist:

  • I have split up this change into one or more appropriately-delineated commits.
  • The first line of each commit is of the form "[component]: do something"
  • I have written a complete, multi-line commit message for each commit.
  • I have formatted any Go code that I have changed with gofmt.
  • I have written or updated appropriate documentation to cover this change.
  • I have confirmed that this change is covered by at least one appropriate test run by CI.
  • If my change includes new or modified functionality, I have tested that the changes work as expected.
  • I have assigned this issue to an appropriate reviewer. (Choose @celskeggs if you are not otherwise certain.)
  • I consider my PR complete and ready to be merged without my further input, assuming that it passes CI and code review.
  • My changes have passed CI, including an automatic Jenkins deploy.
  • My changes have passed code review.

@celskeggs celskeggs added this to the Dev Cluster 7 milestone Dec 15, 2019
@celskeggs celskeggs self-assigned this Dec 15, 2019
@celskeggs celskeggs force-pushed the 428-build-chroot branch 3 times, most recently from a8728be to f7bfa96 Compare December 19, 2019 05:43
@celskeggs celskeggs changed the title build-chroot: upgrade to buster Upgrade build chroot and debian containers to buster Dec 19, 2019
When the build chroot fails to install, print out some of the
debootstrap output, so that more information is available when the
build fails only in CI.
The prometheus metric 'node_systemd_system_running' reports the state
of 'systemctl is-system-running', which requires that no units have
failed.

As such, this will only succeed if all of our other checks succeed. It
provides less helpful information than a service-by-service list of
which services are wrong. So move it to the end of the function, after
we do the checks that would give us more detailed reporting.
Previously, we didn't run the command to clean up malformed fakechroot
links after installing debian packages. This meant that installing
iptables for flannel would create a malformed link at the path
/usr/sbin/iptables, which would prevent flannel from resolving it.

This was uncovered by the upgrade to buster, in which iptables is now
routed through /etc/alternatives/iptables.
Because we immediately would jump into issuing commands to the
supervisor, it was possible that we would do so slightly before the
SSH server was actually online, which would cause the install to abort.

Introduce an explicit wait so that we don't proceed until the server is
ready for us to do so.
@celskeggs celskeggs requested a review from krawthekrow January 16, 2020 18:44
@celskeggs celskeggs assigned krawthekrow and unassigned celskeggs Jan 16, 2020
@celskeggs celskeggs marked this pull request as ready for review January 16, 2020 18:45
@@ -23,3 +23,7 @@
continue
os.remove(path)
os.symlink(os.path.join("/", rootrel), path)

if os.path.islink(os.path.join(rootfs, "proc")):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add a comment or todo in the code as well.

We have a Jenkins-only bug where /proc is a symlink to itself in the
flannel container, which breaks a lot of things. We don't know why,
exactly, this is happening -- but we need to mitigate it.
We can't have flannel monitors until we have OCI launches, which we
can't have until we have flannel working, which we can't have until we
have OCI pulling.

Split up the order so that debugging information is more useful.
@celskeggs
Copy link
Member Author

Current status, so that I remember:

I'm currently running into a problem where the newer qemu available on buster objects to having the same CD inserted into multiple VMs simultaneously:

qemu-system-x86_64: Initialization of device ide-hd failed: Failed to get "write" lock
Is another process using the image [/cluster/cluster-1125.iso]?

This is reasonable, because they really should be read-only, but when I try to set them read-only, I get another error:

qemu-system-x86_64: Initialization of device ide-hd failed: Block node is read-only

docs/build.md Outdated Show resolved Hide resolved
@celskeggs
Copy link
Member Author

It looks like it's impossible to set ide-hd disks to be read-only, because it's hardcoded to only accept read-only options from ide-cd. I think this means that we can't set it read-only. However, we can turn off the file locking. It's not a perfect solution, but it's better than nothing.

@celskeggs
Copy link
Member Author

celskeggs commented Feb 1, 2020

@krawthekrow I believe this is ready for review now whenever you have the chance

@cryslith
Copy link
Member

cryslith commented Feb 1, 2020

It looks like it's impossible to set ide-hd disks to be read-only, because it's hardcoded to only accept read-only options from ide-cd. I think this means that we can't set it read-only. However, we can turn off the file locking. It's not a perfect solution, but it's better than nothing.

Can we see if we can use the snapshot option to avoid writing back to the original disk image? That ought to let us avoid disabling file locking.

tools/debian-checksum.sh Outdated Show resolved Hide resolved
tools/debian-checksum.sh Outdated Show resolved Hide resolved
build-chroot/debootstrap-1.0.106.deb.sha256 Show resolved Hide resolved
platform/spire/src/virt.py Show resolved Hide resolved
platform/spire/src/virt.py Show resolved Hide resolved
tools/debian-checksum.sh Outdated Show resolved Hide resolved
cryslith
cryslith previously approved these changes Feb 7, 2020
We pull down the host key for the supervisor via a process involving
ssh-keyscan and a set of fingerprints scraped from the MOTD of the
supervisor. For some reason, on Buster, ssh isn't always running at the
point the MOTD is printed -- so we don't always find the SSH hostkeys
we need.

Add a loop to wait for the SSH server to be up before continuing with
the pull process.
This updates the script used to validate new versions of the debian ISO
provided by upstream, along with the current version specification.

This update is part of the overall upgrade from stretch to buster.
Without it, the version of CRI-O we compile will crash due to the newer
libc being missing.
 - Remove transitional realpath package (replaced by coreutils)
 - Replace transitional btrfs-tools package with btrfs-progs
 - Replace unsupported openjdk-8-jdk with supported openjdk-11-jdk
 - Add libtinfo5 because it's now needed for container fakeroot setup
 - Add perl-openssl-defaults and dbus-user-session because they seem to
   be required for the chroot to install, though it's not quite clear
   why.
 - Add libbtrfs-dev, because it needs to be pulled in explicitly for
   the CRI-O build now.

Note that this commit will not build correctly at this point, because
using a buster chroot means the fakeroot/fakechroot invocation used to
set up debian containers will not have the correct host libraries. As
well, a buggy version of debootstrap is currently in the debian repo,
which also breaks the debian container scripts. The next two commits
will be required for a successful build.
debootstrap bug #946783 means that the latest version on buster does
not function correctly with both --unpack-tarball and --foreign. Pin an
older version without this bug until it's fixed on stable.

Note that this commit is not expected to build; the next commit will be
required for reasons previously described.
Stretch is an entire release behind -- and we're over a year out of
date. Get us to the latest version of the latest release.

This makes it possible to build again, because the build chroot now
matches the container release closely enough that the fakeroot/
fakechroot combo can have the right libraries available.
As part of the larger upgrade effort.
There are still places where we use Stretch systems, so in some cases
we accept using Stretch instead of Buster.
When we're using a buster deploy-chroot with a non-buster host, it's
likely that the kernel versions might not match, which would lead to a
failure to run modprobe in the chroot. When we actually do need to run
modprobe, there's not much we can do, but when we don't actually need
to run the command, we can easily bypass it.
Buster doesn't come with gpg by default. Make sure it's installed.

I'm not sure if Stretch did, but I assume so.
Add instructions to set up a build on RHOMBI, including local rsync,
but not including two minor issues that will be addressed in the next
commits.
Currently, builds started on RHOMBI cannot upload their binaries to
RHOMBI directly, because the relevant section of the filesystem is not
accessible. Add an optional helper to set up a bind mount, and document
how to use it.
When we have different versions of GPG inside and outside of the build
chroot, which can occur when the host is stretch and the chroot is
buster, the //upload target may fail to sign packages through reprepro.
This is because the keys transferred in by the host might not be
compatible with the new gpg version; this can be fixed by running any
gpg command, so run gpg --list-keys on entry.

This was confusing to debug, because the error was a "key not found"
error, and if you then ran gpg --list-keys to check, you'd see it did
actually exist, and then if you ran the upload again, it would
succeed... but only until the next time you reopened the chroot.
It is frequently useful for debugging to have access to the exact
invocations of the qemu instances that spire executes, but learning
that information usually requires modifying the code to add debugging
statements, or carefully extracting the info from /proc/<pid>/cmdline.

Add an option to print the command lines before the VMs are launched,
in a form that can be immediately executed in the debug shell.
It turns out that, if we send all the characters at once, newer qemus
will end up dropping the characters sent after the first 15-byte buffer
of input. So add a short delay between each character in order to let
the launched VM handle them all.
Previously, the conditionals for user-grant were ordered such that, if
there were the appropriate temporary credentials, it would attempt to
verify that the service worked, even if it was disabled in the
configuration. (This could happen if, for example, you were using spire
with user-grant enabled, and then disabled it.)

Reorder the conditionals so that user-grant is not verified in this
case, because there's no way for verification to succeed.
The newer version of qemu provided by default on buster turns on file
locking by default, which prevents us from installing multiple nodes at
the same time. Turn on snapshot mode, so that none of the VMs will
write back changes to the USB drive -- which is what we want, anyway.
This lets us keep installing multiple VMs at the same time.

We unfortunately can't just set the disk to read-only, because IDE
apparently doesn't support that except for CDROMs, so qemu won't let us
enable the option.
@celskeggs
Copy link
Member Author

Merged according to @cryslith's approval of the non-rebased version.

@celskeggs celskeggs deleted the 428-build-chroot branch February 7, 2020 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update to Debian 10 (Buster)
3 participants