Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate updater performance bottlenecks #459

Closed
eloquence opened this issue Feb 18, 2020 · 10 comments
Closed

Investigate updater performance bottlenecks #459

eloquence opened this issue Feb 18, 2020 · 10 comments

Comments

@eloquence
Copy link
Member

During testing, we have found that preflight updates can take as long as an hour, with the fedora-30 VM in particular often taken a long time to update (@zenmonkeykstop reported 26 minutes just for Fedora updates). That's prohibitively slow and negates much of the speed advantage of using Qubes instead of an air-gap.

We need to understand a bit better what's causing updates to be slow, and in particular, if there are specific bottleneck issues that cause individual updater runs to be dramatically slower than they otherwise would be. For a systematic investigation, we need to:

  • run the updater repeatedly under comparable conditions (i.e., ideally reset an updated VM to a pre-update state after the updater runs);
  • investigate logs, CPU utilization, memory usage, network access, and disk access during update runs;
  • conduct small experiments (e.g., allocate higher memory to VMs that are being updated; change the update servers used for an update; increase concurrency).

This ticket is intended to track findings from an initial investigatory spike, which may then lead to follow-up tickets, or the re-prioritization of existing ones (e.g., #403).

@eloquence
Copy link
Member Author

eloquence commented Feb 18, 2020

@kushaldas noted that the Fedora updater sometimes picks a very slow mirror to download from, which can cause Fedora updates to be prohibitively slow.

@conorsch notes a known upstream issue where VMs are sometimes started with insufficient memory (QubesOS/qubes-issues#4890), and recommends carefully observing the memory assigned to VMs that are being updated / testing whether restarting the qmemman service makes a difference.

@conorsch
Copy link
Contributor

We've got a few options for improving the updater story. Listed in approximate order of preference, heavily prioritizing near-term feasibility:

  1. Replace login background tasks with graphical updater (Start updater at boot, remove existing svs-disp backgound updater #415)
  2. Consider running dpkg --configure -a (or similar) as part of preflight updates (Templates updates sometimes fail, requiring manual dpkg --configure -a run #428)
  3. Ensuring VMs have adequate resources to update quickly
  4. Parallelize updates so multiple VMs are checked concurrently (Perform update checks concurrently #403)
  5. Reducing number of TemplateVMs to update, by consolidating/standardizing

Speaking entirely from my own experience running the updater (as well as the make all dev workflow) repeatedly, the most significant factor is did the qmemmem service fall over. If it did, my updates are likely going to take a very long time.

In terms of diagnosing the problem, the most reliable method I've come up with to determine "has qmemman crashed" is scripted, see below.

check-qmemman.sh
#!/bin/bash
# Utility script to check whether Qubes memory balancing
# service has failed. Compares the timestamps of the last
# success balance operation and the most recent "EOF"
# message available in the log file. If EOF is more
# recent, declare service broken. Recommended invocation:
#
#  watch -n5 ./check-qmemman.sh
#
set -e
set -u
set -o pipefail

# Search for pattern, grab timestamp, convert to epoch time,
# for easy integer comparison
get_last_balance_time() {
  grep -P 'balance_when_enough_memory' /var/log/qubes/qmemman.log \
      | tail -n1 \
      | perl -nE '/^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d+)/ and say $1' \
      | xargs -d '\n' date +%s -d
}

get_last_eof_time() {
  grep -P 'EOF$' /var/log/qubes/qmemman.log \
      | tail -n1 \
      | perl -nE '/^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d+)/ and say $1' \
      | xargs -d '\n' date +%s -d
}

if [[ (( $(get_last_eof_time) > $(get_last_balance_time) )) ]]; then
    echo "Looks like qmembalance has failed."
    echo "You should restarted it with:"
    echo "sudo systemctl restart qubes-qmemman"
    exit 1
else
    echo "The qmembalance service appears to be working correctly."
fi

If other team members feel the log-grep is adequate, let's work on analysis to pass info upstream, e.g. QubesOS/qubes-issues#4890

Despite the resource constraints (which still need reproducing), it doesn't seem worthwhilet to hardcode greater allotments of e.g. RAM to each VM during updates, because we'd basically be reinventing the wheel on the qmemman service, while running the risk of overprovisioning RAM
based on what's available to the host. The qmemman service is designed to address those needs.

Running dpkg --configure -a automatically during preflight updates seems like a reasonable precaution, given that we've had multiple folks report the need to run it manually. If our advice to Admins is to run that command if updates break, then let's run it for them.

Parallelizing the updates does not seem like a good idea at present, given the complexity involved. The qubesctl has a --max-concurrency flag that we could make use of. That approach would be preferable to using e.g. multiprocessing ourselves, particularly given the complicated interactions between multiprocessing and logging.

It seems that the single biggest win in terms of efficiency for the updater would be to consolidate the templates. There's significant complexity involved in that change, so best discussed in a separate issue, but worth mentioning that consolidation would sidestep any need to discuss parallelization altogether, as well as enable us to rely on the Qubes-based GUI update tool (#238).

For now, I'm most interested in identifying STR for creating a situation that leads to broken apt state (#428); I suspect that lack of resources could contribute to slow updates, leading a user to cancel/reboot while updates are pending, which could leave apt unhappy. Even so, if running a proactive cleanup task resolves reliably, that should serve us well as a stopgap, particularly if we make sure to log state for later analysis.

@eloquence
Copy link
Member Author

Thanks for that write-up @conorsch! Since 1) just got resolved, 2) and 3) do seem like excellent focus areas for the near term. For the dpkg --configure -a issue I'd really like to understand better if it still happens for routine updates. I've not seen it myself since the big Whonix upgrade, and as far as I know, with the old cron job updater, it was possible to have broken VMs hanging around for weeks, since it would just ignore those. But definitely worth investigating further.

Post-Beta, if we do decide to remove the Whonix dependency (one option discussed in #456), that could further simplify the template story, and give us the option to not do those updates via Tor.

@conorsch
Copy link
Contributor

conorsch commented Mar 3, 2020

Dug around a bit to find some short-term wins for speeding up the updater flow. The Whonix-based VMs are being updated over Tor, which isn't necessary for our purposes. We can disable that behavior, defaulting to clearnet connections for Template updates across the board, by removing the "whonix-updatevm" tag from the Whonix-based Templates. See relevant RPC policy here:

https://github.com/QubesOS/qubes-core-admin/blob/34e6c2ff34985aba84c5d1388e5f9b6d1a5c4385/qubes-rpc-policy/qubes.UpdatesProxy.policy#L9-L10

As a quick test toggling the apt-over-tor connections by removing/re-adding the tag, I installed firefox-esr in the sd-proxy-buster-template TemplateVM. Results:

# Over tor:
real	2m3.245s
user	0m6.251s
sys	0m1.568s

# Over clearnet:
real	0m29.120s
user	0m6.295s
sys	0m1.264s

Hardly bulletproof testing, but there's a 4x speedup there. Optionally we could remove the whonix-updatevm tag from whonix-gw-15, as well, to speed up updates for sd-whonix. As #456 describes, we could switch from Whonix Workstation to the standard SDW base for templates, but switching to clearnet updates may be enough of a win that we can defer template switch for now.

One of the longest portions of the preflight updater run is checking whether updates are required. Rather than poll for each and every VM, we can trust the value of qvm-features <vm> updates-available (1 for true, 0/none for false). At the very least, sd-proxy, sd-whonix, and sys-firewall are all network-connected, so we can skip blocking on checks there by inspecting the features. It appears the same is true of dom0: qvm-features dom0 updates-available.

To determine whether a VM needs to be rebooted in order to apply updates, we leverage the same logic used by upstream Qubes, to draw the convenient "reboot-required" icon:

https://github.com/QubesOS/qubes-manager/blob/da2826db20fa852403240a45b3906a6c54b2fe33/qubesmanager/table_widgets.py#L402-L406

Essentially, check if any of the volumes are out of date (we likely mostly care about the root volume), and recommend/force a reboot based on that result. Unlikely to speed up the preflight updater, but would let us cull some code. Similarly, in order to determine "last updated time", we should consider inspecting /var/log/qubes/mgmt-${vm}.log and inspecting the last successful run of update.qubes-vm, which even the Qubes GUI updater uses (#238).

In summary, the following are suitable for near-term fixes:

  • remove whonix-updatevm tags from sd-proxy-buster-template and whonix-gw-15
  • use QVM "features" flag for net-connected vm templates when checking whether updates are available

@eloquence
Copy link
Member Author

eloquence commented Mar 3, 2020

Thanks much for the investigation & excellent write-up!

use QVM "features" flag for net-connected vm templates when checking whether updates are available

Would that mean we rely on Qubes' own updater to get us information about available VM updates for network-connected VMs? If so, with what frequency does that run?

Similarly, in order to determine "last updated time", we should consider inspecting /var/log/qubes/mgmt-${vm}.log and inspecting the last successful run of update.qubes-vm

sdw-last-updated currently captures the timestamp when all SecureDrop components have most recently been successfully updated, i.e. all VMs and the dom0 state. Unless I'm misunderstanding, I think parsing the logs to try to derive a single aggregate "last successful update" value would be trickier than having the updater make that determination based on its run results.

@emkll
Copy link
Contributor

emkll commented Mar 3, 2020

Under the current updater logic, any updates to fedora-30 will trigger the REBOOT_REQUIRED status, which requires a user to reboot the workstation. In practice, I've been getting updates to that TemplateVM pretty much every day. For the beta , we should consider running qvm-kill, sys-usb, sys-net and sys-firewall as part of the rebooting function to ensure updates are applied to those AppVMs and avoid a full workstation reboot.

@conorsch
Copy link
Contributor

conorsch commented Mar 4, 2020

Regarding forcing Whonix updates over clearnet, simply removing the tag isn't enough to opt out of the apt-over-tor config (although adding the tag is sufficient to opt in to the apt-over-tor config). There's also /etc/uwt.d/40_qubes.conf in the Whonix TemplateVMs, which will cause apt actions to fail if not proxied over Tor. While it's hypothetically possible to remove that file, much more research/consultation required, given that the fail-closed behavior is considered a feature: QubesOS/qubes-issues#1880

Modifying the reboot logic as @emkll describes above will to save a reboot of the entire workstation will definitely help reduce wait time during updates.

@eloquence
Copy link
Member Author

@conorsch Do you think it's still worth investigating the a switch of sd-proxy (not sd-whonix) to debian-10 as a drop-in replacement to reduce its update cost, or is there anything we rely on in whonix-ws that would make that too complex?

Alternatively, should we uninstall Tor Browser from the template, so we at least don't have to download these typically slow updates?

conorsch pushed a commit that referenced this issue Mar 5, 2020
We've been using both Whonix TemplateVMs in the SDW components:

 * whonix-gw-15 -> sd-whonix
 * whonix-ws-15 -> sd-proxy

In order to speed up the time it takes for updates to run (#459),
as well as nudge us toward full template consolidation (#471),
let's use the `securedrop-workstation-buster` Template for `sd-proxy`.
Since `sd-proxy` still has its NetVM set to `sd-whonix`, it's able to
resolve Onion URLs just fine.
@eloquence
Copy link
Member Author

We now have #488 for basing sd-whonix on our standard template, which we may or may not be able to land, and @emkll implemented the improvement for fedora-30 templates described above in #486.

Realistically, we won't have time for further updater performance improvements before the pilot kick-off, so I'm removing the release blocker label and kicking this back to the near-term backlog. Template consolidation (#471) is under continued discussion as a likely near-term change after the pilot launch, which could give us big wins in update performance.

@eloquence
Copy link
Member Author

Template consolidation (#471) is the biggest additional win we've been able to identify, no further investigation required until we've done that, so closing this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants