-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ceph RBD primary storage fails connection and renders node unusable #2611
Comments
I've marked this as blocker applicable for Ceph users, we'll try to get this fixed. @giorgiomassar8 can you advise any documentation on how to setup a small ceph storagepool to test this against KVM? |
Can you increase the logging to DEBUG on the Agent and show the XML which the Agent generates to define the storage pool? The error you are seeing comes from libvirt+librados underneath so I'd need to see the XML it generates. |
I have set debugging on one of the agents and this is what I get:
In the generated XML there isn't the secret part, perhaps the agent is trying to use another authentication mechanism which this version of libvirt does not support? This is what I have been using as a workaround, and it works:
|
Ok, I see. The operation not supported is odd, but it seems to be missing the cephx part indeed. At the moment I do not have a 4.11 environment to test this with, but I'll look at the code later. |
fix issue where kvm / ceph cannot create volumes apache#2611
@wido The issue seems to be that for any KVM host, the disk format is always hard coded to QCOW2 here: This gets sent along to the agent. I guess this is a reasonable default for KVM, but obviously for RBD it needs to be RAW, and it won't know that until it looks up the storage pool. In the short term, I think hard coding the disk format to RAW on the agent side for RBD is valid, it can't be anything but that, but it might also be good for the management server to send along the correct disk type. |
On second thought, this might be a separate bug we were chasing. We were getting seeing this error in the logs:
@giorgiomassar8 were you seeing similar messages? We were receiving these messages on a freshly created pool when attempting to deploy a VM. |
@kiwiflyer Have you seen this one? The Agent is not providing the secret for the RBD storage pool to the libvirt XML definition. |
I will see if I can deploy a test environment to reproduce this issue meanwhile. |
@nathanejohnson nope I don't see qcow2/raw mismatch errors here, I do believe the issue is related to the fact that for some reasons, no authentication information is put in the pool's XML sent to libvirt. |
@giorgiomassar8 here are the logs from when the rbd pool was configured on our host:
do you see the key go across before it attempts to create the libvirt storage pool? |
So working with a user, it does appear that the original issue outlined by @giorgiomassar8 is different from the one we found. The user in question is running Ubuntu 16.04 and no authentication is being passed to libvirt. This particular installation is a 4.9.3 -> 4.11.1 upgrade. |
Are there by any chance slashes (/) in the secret of this RBD pool? |
Funny you should mention that. There is a single slash and I did notice a
very old open bug report on jira for that issue. This user had upgraded a
prod system, so I had a very limited window to take a look.
…On Sun, Nov 18, 2018, 2:35 AM Wido den Hollander ***@***.***> wrote:
Are there by any chance slashes (/) in the secret of this RBD pool?
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#2611 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AQek8gqrA6MTISjaYPX6UambYNX8trOZks5uwRu7gaJpZM4Tqnd3>
.
|
Could you check that for me? As I have seen issues with that in the secret information and URL parsing inside Java. Otherwise I wouldn't know where this is coming from. |
The pool secret apparently does not have any slashes. The encrypted version in the DB does, but obviously that shouldn't matter. I'm trying to work back through the code to see what might have changed between 4.9.x and 4.11.x. |
ok...well it's definitely not decrypting the userInfo before it gets sent to the agent in the upgrade case above. |
In the Ubuntu case above, it looks like the encrypted db payload in the userInfo field, rather than the decrypted version with colon delimiter that is expected in the code here: Line 245 in 4809fe7
|
Note that my test lab is Centos 7. |
@wido or @GabrielBrascher , do either of you guys have both an older 4.9/4.10 lab as well as a 4.11 lab on Ubuntu 16.04? It would be interesting to test trying a manual decrypt on the user_info field from the storage_pool data using a 4.9.x payload, but trying the decrypt using a 4.11 installation. e.g. java -classpath /usr/share/cloudstack-common/lib/jasypt-1.9.2.jar org.jasypt.intf.cli.JasyptPBEStringDecryptionCLI decrypt.sh input=encrypted_payload_from_user_info password=encrypt_password verbose=true |
@kiwiflyer All our test and production systems are 4.11 at the moment running on Ubuntu 16.04 They were all upgraded from 4.9 to 4.10 and then to 4.11. In none of the cases we ran in to this. |
@wido @kiwiflyer is this still a bug wrt 4.11.2.0 or next 4.11.3.0/4.12.0.0 ? |
I have not been able to verify this bug, so I can't tell you. |
I saw this problem in an ACS mailing list users system after attempting an upgrade from 4.9.3 -> 4.11.1. Unfortunately, the bug was in production, so I only had an extremely limited window to take a look before they restored their previous 4.9.3 version. At the present time, I'm not sure what caused it and I haven't been able to reproduce this in a lab environment. |
So, @nathanejohnson and I just started some upgrade testing on some lab environments going from our 4.8 production branch to 4.12 and we may have just run into this same bug. We'll triage it and update everyone early next week. |
@wido I can second what @kiwiflyer mentioned earlier: The reason pools aren't being created has to do with the user_info being passed along encrypted, the split doesn't find a semicolon so it skips the secret section of the pool xml. It looks like the only time it's sent along decrypted is when the pool is initially created.. Obviously I'm missing something, I can't find anything significant that's different between 4.8 and 4.12's handling of this logic-wise, but this is failing consistently now in our lab environment. Would love some ideas on troubleshooting |
as a followup, I cannot reproduce this in our fresh installed lab, only the one we upgraded. The userInfo comes across decrypted there. |
This seems like a management server thing where it doesn't send the data properly. Because I can't think of a reason why this happens on the Agent/KVM side. I honestly have no idea. |
I narrowed down the issue to where at some point during the upgrade process the db.properties file had switched from using encryption to not, so the user_info field was stored encrypted in the db but then was being sent without being decrypted on the way to the agent. I see in the spec file that the db.properties file is listed as config(noreplace), but I think in some situations that will still end up overwritten. At one point during the upgrade, I completely removed the RPM's and had to downgrade, even though I left the /etc/cloudstack/management directory in place I think when I then reinstalled the RPMs it overwrote db.properties since the file wasn't marked as owned by a previous installation, if this makes sense. At any rate, I suspect something similar might have occurred with the original bug. I'm going to try upgrading another cluster, paying attention to the db.properties file and see if this happens again. |
@nathanejohnson any update on this one? |
We are still unable to reproduce this issue. Thus, I am removing this as a Blocker for 4.12. |
Sorry for the late reply, i have not had a chance to upgrade our other development cluster but I agree this shouldn't be a blocker. |
@nathanejohnson @giorgiomassar8 @GabrielBrascher @wido is this still an issue, should we close it or is there a PR coming? |
@nathanejohnson @giorgiomassar8 @GabrielBrascher @wido kindly discuss and advise a milestone in case a PR is coming soon. I've removed this issue from 4.13.0.0. |
@giorgiomassar8 there's been no activity on this for two year. I'm closing the ticket, please reopen if you feel it still should be. |
Any date ? #5741 |
ISSUE TYPE
COMPONENT NAME
Cloudstack agent
CLOUDSTACK VERSION
4.11
CONFIGURATION
KVM cluster with CEPH backed RDB primary storage
OS / ENVIRONMENT
Ubuntu 16.04 / 14.04
SUMMARY
On a perfectly working 4.10 node with KVM hypervisor and Ceph RBD primary storage, after upgrading to 4.11, cloudstack agent is unable to connect the BRD pool in libvirt, giving just a generic "operation not supported" error in its logs:
2018-04-06 16:27:37,650 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-2:null) (logid:91b4e1df) Attempting to create storage pool be80af6a-7201-3410-8da4-9b3b58c4954f (RBD) in libvirt
2018-04-06 16:27:37,652 WARN [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-2:null) (logid:91b4e1df) Storage pool be80af6a-7201-3410-8da4-9b3b58c4954f was not found running in libvirt. Need to create it.
2018-04-06 16:27:37,653 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-2:null) (logid:91b4e1df) Didn't find an existing storage pool be80af6a-7201-3410-8da4-9b3b58c4954f by UUID, checking for pools with duplicate paths
2018-04-06 16:27:37,664 ERROR [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-2:null) (logid:91b4e1df) Failed to create RBD storage pool: org.libvirt.LibvirtException: failed to connect to the RADOS monitor on: storagepool1:6789,: Operation not supported
2018-04-06 16:27:42,762 INFO [cloud.agent.Agent] (Agent-Handler-4:null) (logid Lost connection to the server. Dealing with the remaining commands...
Exactly the same pool was previously working before upgrade:
2018-04-06 12:53:52,847 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-3:null) (logid:14dace5e) Attempting to create storage pool be80af6a-7201-3410-8da4-9b3b58c4954f (RBD) in libvirt
2018-04-06 12:53:52,850 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-3:null) (logid:14dace5e) Found existing defined storage pool be80af6a-7201-3410-8da4-9b3b58c4954f, using it.
2018-04-06 12:53:52,850 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-3:null) (logid:14dace5e) Trying to fetch storage pool be80af6a-7201-3410-8da4-9b3b58c4954f from libvirt
2018-04-06 12:53:53,171 INFO [cloud.agent.Agent] (agentRequest-Handler-2:null) (logid:14dace5e) Proccess agent ready command, agent id = 46
STEPS TO REPRODUCE
Take an existing and working cloudstack cluster with 4.10, with RDB primary storage and Ubuntu 14.04 based agents and upgrade them to version 4.11 of the agent.
EXPECTED RESULTS
The cluster should be working fine, the agents should be connecting and the RDB pool should be correctly opened in libvirt.
ACTUAL RESULTS
Cloudstack agents fails to boot with a generic "Failed to create RBD storage pool: org.libvirt.LibvirtException: failed to connect to the RADOS monitor on: storagepool1:6789,: Operation not supported" error and loops in a failed state, rendering the machine unusable.
WORKAROUND
To workaround the issue I have tried to use the following XML config (dumped from another node where it is correctly running) and define the pool directly in libvirt, and it worked as expected:
be80af6a-7201-3410-8da4-9b3b58c4954f be80af6a-7201-3410-8da4-9b3b58c4954fThe text was updated successfully, but these errors were encountered: