-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate updater performance bottlenecks #459
Comments
@kushaldas noted that the Fedora updater sometimes picks a very slow mirror to download from, which can cause Fedora updates to be prohibitively slow. @conorsch notes a known upstream issue where VMs are sometimes started with insufficient memory (QubesOS/qubes-issues#4890), and recommends carefully observing the memory assigned to VMs that are being updated / testing whether restarting the |
We've got a few options for improving the updater story. Listed in approximate order of preference, heavily prioritizing near-term feasibility:
Speaking entirely from my own experience running the updater (as well as the In terms of diagnosing the problem, the most reliable method I've come up with to determine "has qmemman crashed" is scripted, see below. check-qmemman.sh
If other team members feel the log-grep is adequate, let's work on analysis to pass info upstream, e.g. QubesOS/qubes-issues#4890 Despite the resource constraints (which still need reproducing), it doesn't seem worthwhilet to hardcode greater allotments of e.g. RAM to each VM during updates, because we'd basically be reinventing the wheel on the qmemman service, while running the risk of overprovisioning RAM Running Parallelizing the updates does not seem like a good idea at present, given the complexity involved. The It seems that the single biggest win in terms of efficiency for the updater would be to consolidate the templates. There's significant complexity involved in that change, so best discussed in a separate issue, but worth mentioning that consolidation would sidestep any need to discuss parallelization altogether, as well as enable us to rely on the Qubes-based GUI update tool (#238). For now, I'm most interested in identifying STR for creating a situation that leads to broken apt state (#428); I suspect that lack of resources could contribute to slow updates, leading a user to cancel/reboot while updates are pending, which could leave apt unhappy. Even so, if running a proactive cleanup task resolves reliably, that should serve us well as a stopgap, particularly if we make sure to log state for later analysis. |
Thanks for that write-up @conorsch! Since 1) just got resolved, 2) and 3) do seem like excellent focus areas for the near term. For the Post-Beta, if we do decide to remove the Whonix dependency (one option discussed in #456), that could further simplify the template story, and give us the option to not do those updates via Tor. |
Dug around a bit to find some short-term wins for speeding up the updater flow. The Whonix-based VMs are being updated over Tor, which isn't necessary for our purposes. We can disable that behavior, defaulting to clearnet connections for Template updates across the board, by removing the "whonix-updatevm" tag from the Whonix-based Templates. See relevant RPC policy here: As a quick test toggling the apt-over-tor connections by removing/re-adding the tag, I installed
Hardly bulletproof testing, but there's a 4x speedup there. Optionally we could remove the One of the longest portions of the preflight updater run is checking whether updates are required. Rather than poll for each and every VM, we can trust the value of To determine whether a VM needs to be rebooted in order to apply updates, we leverage the same logic used by upstream Qubes, to draw the convenient "reboot-required" icon: Essentially, check if any of the volumes are out of date (we likely mostly care about the In summary, the following are suitable for near-term fixes:
|
Thanks much for the investigation & excellent write-up!
Would that mean we rely on Qubes' own updater to get us information about available VM updates for network-connected VMs? If so, with what frequency does that run?
|
Under the current updater logic, any updates to |
Regarding forcing Whonix updates over clearnet, simply removing the tag isn't enough to opt out of the apt-over-tor config (although adding the tag is sufficient to opt in to the apt-over-tor config). There's also Modifying the reboot logic as @emkll describes above will to save a reboot of the entire workstation will definitely help reduce wait time during updates. |
@conorsch Do you think it's still worth investigating the a switch of Alternatively, should we uninstall Tor Browser from the template, so we at least don't have to download these typically slow updates? |
We've been using both Whonix TemplateVMs in the SDW components: * whonix-gw-15 -> sd-whonix * whonix-ws-15 -> sd-proxy In order to speed up the time it takes for updates to run (#459), as well as nudge us toward full template consolidation (#471), let's use the `securedrop-workstation-buster` Template for `sd-proxy`. Since `sd-proxy` still has its NetVM set to `sd-whonix`, it's able to resolve Onion URLs just fine.
We now have #488 for basing Realistically, we won't have time for further updater performance improvements before the pilot kick-off, so I'm removing the release blocker label and kicking this back to the near-term backlog. Template consolidation (#471) is under continued discussion as a likely near-term change after the pilot launch, which could give us big wins in update performance. |
Template consolidation (#471) is the biggest additional win we've been able to identify, no further investigation required until we've done that, so closing this for now. |
During testing, we have found that preflight updates can take as long as an hour, with the
fedora-30
VM in particular often taken a long time to update (@zenmonkeykstop reported 26 minutes just for Fedora updates). That's prohibitively slow and negates much of the speed advantage of using Qubes instead of an air-gap.We need to understand a bit better what's causing updates to be slow, and in particular, if there are specific bottleneck issues that cause individual updater runs to be dramatically slower than they otherwise would be. For a systematic investigation, we need to:
This ticket is intended to track findings from an initial investigatory spike, which may then lead to follow-up tickets, or the re-prioritization of existing ones (e.g., #403).
The text was updated successfully, but these errors were encountered: