-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sporadic Performance issues with 1.1 #221
Comments
Update: After running some profiling with
Here is what I've found so far:
https://gist.github.com/sadsfae/ccd42c1e627befffa674da1b040a6470
|
Update: It seems this might be our issue, a bug with nacl / sodium library which explains the |
Symptoms / cause:
Why is this an issue for QUADS We use the paramiko libraries which make calls to libsodium / nacl bindings.
It's entirely possible that we only hit a lack of entropy because at the time in testing/using QUADS 1.1/master code we're not running the scheduling processes and tasks as we normally would, but we don't want users to experience this on first trying things out and get discouraged. Switching to (or telling others) to use the As amusing as it would be to have a valid excuse to run a service that just loops opening/closing a CD-ROM or some other behavior that makes the bare-metal system appear haunted to provide more entropy, passing it to the VM guest is probably not the best solution. This would be hilarious though, maybe we'll do this anyway. Proposed Solution
|
There seems to be a bug with entropy pool gathering with libsodium in which calls to /dev/random fail. This manifests itself in QEMU because it has no hardware entropy (and as a result containers run in VMs). As /dev/random and /dev/urandom are blocking devices this manifests itself in slowdowns for random, sporadic calls from QUADS. libsodium which is used by paramiko cannot use /dev/urandom (non-blocking entropy) so this moves us to use a software-based entropy service which is optional. #221 More about the nacl bindings / libsodium bug: PyNACL / Sodium: pyca/pynacl#327 Symptoms / cause: ====== By default libsodium only uses the /dev/random path to determine if the CSPRNG is initialized before it starts using /dev/urandom. It also prefers to use getrandom, which gives the same behavior for free (blocking only until the CSPRNG is ready). On normal systems this happens extremely rapidly, but there exists a weird long tail of systems that can have bad entropy for long periods of time ====== The pragmatic solution here is to instead use the haveged service to provide software entropy. https://issihosts.com/haveged/ * Make haveged a dependency on QUADS via RPM installation * For containers include and start it in the Dockerfile * haveged is optional, just turn it off if you want to use /dev/random if you're on bare-metal or feel you have sufficient entropy. Included are other fixes/changes below: * Add python3-ipdb to rpm spec requirements * Specify version requirements for ipmitool and git * Remove contactbank wp plugin, we're not using it and it may not even work right. * Since HTTPD is required for serving visuals/instackenv.json make it also be a dependency of quads-server.service * Demote docker compose from recommended status * Correct RPM spec warnings about incorrect date format * Remove waffle.io badge / references as they are closing up shop. https://waffle.io/closing-its-doors Fixes: #221 Change-Id: I5c20defe12871e6399cf6b1ada659caf1a5e1b94
This should be resolved with the inclusion of |
This is a tracking issue for some unreasonably long
bin/quads-cli
command executions that we've been experiencing at random times. We need to investigate this further with some Python profiling work to isolate where this is happening.Some preliminary strace runs reveal severe hangups here:
via
strace -tt quads-cli --full-summary
for example.Namely here, note the entire 2 minute delay:
And in particular:
Something is trying to poll a file handle, this could be symptomatic and not causal. We need Python profiling ala
cProfile
or similar to dig deeper here in the code.https://docs.python.org/3/library/profile.html#introduction-to-the-profilers
It's important to note here that we initially thought this was due to running containers inside a VM (particularly quads_db / mongo) and although we see it less testing directly in a VM via RPM against latest master c72d9b2 we still do see it there.
The text was updated successfully, but these errors were encountered: