Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ceph RBD primary storage fails connection and renders node unusable #2611

Closed
giorgiomassar8 opened this issue Apr 27, 2018 · 36 comments
Closed

Comments

@giorgiomassar8
Copy link

ISSUE TYPE
  • Bug Report
COMPONENT NAME

Cloudstack agent

CLOUDSTACK VERSION

4.11

CONFIGURATION

KVM cluster with CEPH backed RDB primary storage

OS / ENVIRONMENT

Ubuntu 16.04 / 14.04

SUMMARY

On a perfectly working 4.10 node with KVM hypervisor and Ceph RBD primary storage, after upgrading to 4.11, cloudstack agent is unable to connect the BRD pool in libvirt, giving just a generic "operation not supported" error in its logs:

2018-04-06 16:27:37,650 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-2:null) (logid:91b4e1df) Attempting to create storage pool be80af6a-7201-3410-8da4-9b3b58c4954f (RBD) in libvirt
2018-04-06 16:27:37,652 WARN [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-2:null) (logid:91b4e1df) Storage pool be80af6a-7201-3410-8da4-9b3b58c4954f was not found running in libvirt. Need to create it.
2018-04-06 16:27:37,653 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-2:null) (logid:91b4e1df) Didn't find an existing storage pool be80af6a-7201-3410-8da4-9b3b58c4954f by UUID, checking for pools with duplicate paths
2018-04-06 16:27:37,664 ERROR [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-2:null) (logid:91b4e1df) Failed to create RBD storage pool: org.libvirt.LibvirtException: failed to connect to the RADOS monitor on: storagepool1:6789,: Operation not supported
2018-04-06 16:27:42,762 INFO [cloud.agent.Agent] (Agent-Handler-4:null) (logid Lost connection to the server. Dealing with the remaining commands...

Exactly the same pool was previously working before upgrade:

2018-04-06 12:53:52,847 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-3:null) (logid:14dace5e) Attempting to create storage pool be80af6a-7201-3410-8da4-9b3b58c4954f (RBD) in libvirt
2018-04-06 12:53:52,850 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-3:null) (logid:14dace5e) Found existing defined storage pool be80af6a-7201-3410-8da4-9b3b58c4954f, using it.
2018-04-06 12:53:52,850 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-3:null) (logid:14dace5e) Trying to fetch storage pool be80af6a-7201-3410-8da4-9b3b58c4954f from libvirt
2018-04-06 12:53:53,171 INFO [cloud.agent.Agent] (agentRequest-Handler-2:null) (logid:14dace5e) Proccess agent ready command, agent id = 46

STEPS TO REPRODUCE

Take an existing and working cloudstack cluster with 4.10, with RDB primary storage and Ubuntu 14.04 based agents and upgrade them to version 4.11 of the agent.

EXPECTED RESULTS

The cluster should be working fine, the agents should be connecting and the RDB pool should be correctly opened in libvirt.

ACTUAL RESULTS

Cloudstack agents fails to boot with a generic "Failed to create RBD storage pool: org.libvirt.LibvirtException: failed to connect to the RADOS monitor on: storagepool1:6789,: Operation not supported" error and loops in a failed state, rendering the machine unusable.

WORKAROUND

To workaround the issue I have tried to use the following XML config (dumped from another node where it is correctly running) and define the pool directly in libvirt, and it worked as expected:

be80af6a-7201-3410-8da4-9b3b58c4954f be80af6a-7201-3410-8da4-9b3b58c4954f cephstor1

virsh pool-define test.xml
Pool be80af6a-7201-3410-8da4-9b3b58c4954f defined from test.xml

root@compute6:~# virsh pool-start be80af6a-7201-3410-8da4-9b3b58c4954f
Pool be80af6a-7201-3410-8da4-9b3b58c4954f started

root@compute6:~# virsh pool-info be80af6a-7201-3410-8da4-9b3b58c4954f
Name: be80af6a-7201-3410-8da4-9b3b58c4954f
UUID: be80af6a-7201-3410-8da4-9b3b58c4954f
State: running
Persistent: yes
Autostart: no
Capacity: 10.05 TiB
Allocation: 2.22 TiB
Available: 2.71 TiB

And now the cloudstack agent correctly starts:

2018-04-09 10:29:19,989 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-2:null) (logid:f0021131) Attempting to create storage pool be80af6a-7201-3410-8da4-9b3b58c4954f (RBD) in libvirt
2018-04-09 10:29:19,990 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-2:null) (logid:f0021131) Found existing defined storage pool be80af6a-7201-3410-8da4-9b3b58c4954f, using it.
2018-04-09 10:29:19,991 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-2:null) (logid:f0021131) Trying to fetch storage pool be80af6a-7201-3410-8da4-9b3b58c4954f from libvirt
2018-04-09 10:29:20,372 INFO [cloud.agent.Agent] (agentRequest-Handler-2:null) (logid:f0021131) Proccess agent ready command, agent id = 56

@rohityadavcloud
Copy link
Member

I've marked this as blocker applicable for Ceph users, we'll try to get this fixed. @giorgiomassar8 can you advise any documentation on how to setup a small ceph storagepool to test this against KVM?

@wido
Copy link
Contributor

wido commented May 2, 2018

Can you increase the logging to DEBUG on the Agent and show the XML which the Agent generates to define the storage pool?

The error you are seeing comes from libvirt+librados underneath so I'd need to see the XML it generates.

@giorgiomassar8
Copy link
Author

I have set debugging on one of the agents and this is what I get:

2018-05-02 13:41:44,401 INFO  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-4:null) (logid:d06ebad5) Attempting to create storage pool be80af6a-7201-3410-8da4-9b3b58c4954f (RBD) in libvirt
2018-05-02 13:41:44,401 DEBUG [kvm.resource.LibvirtConnection] (agentRequest-Handler-4:null) (logid:d06ebad5) Looking for libvirtd connection at: qemu:///system
2018-05-02 13:41:44,403 WARN  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-4:null) (logid:d06ebad5) Storage pool be80af6a-7201-3410-8da4-9b3b58c4954f was not found running in libvirt. Need to create it.
2018-05-02 13:41:44,403 INFO  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-4:null) (logid:d06ebad5) Didn't find an existing storage pool be80af6a-7201-3410-8da4-9b3b58c4954f by UUID, checking for pools with duplicate paths
2018-05-02 13:41:44,404 DEBUG [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-4:null) (logid:d06ebad5) Checking path of existing pool 9fb695bd-a47f-45d3-9fe7-f029dff431d8 against pool we want to create
2018-05-02 13:41:44,408 DEBUG [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-4:null) (logid:d06ebad5) Attempting to create storage pool be80af6a-7201-3410-8da4-9b3b58c4954f
2018-05-02 13:41:44,408 DEBUG [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-4:null) (logid:d06ebad5) <pool type='rbd'>
<name>be80af6a-7201-3410-8da4-9b3b58c4954f</name>
<uuid>be80af6a-7201-3410-8da4-9b3b58c4954f</uuid>
<source>
<host name='storagepool1' port='6789'/>
<name>cephstor1</name>
</source>
</pool>

2018-05-02 13:41:44,415 ERROR [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-4:null) (logid:d06ebad5) Failed to create RBD storage pool: org.libvirt.LibvirtException: failed to connect to the RADOS monitor on: storagepool1:6789,: Operation not supported
2018-05-02 13:41:44,415 DEBUG [cloud.agent.Agent] (agentRequest-Handler-4:null) (logid:d06ebad5) Seq 1-6775102689425489922:  { Ans: , MgmtId: 29010072043598, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.UnsupportedAnswer":{"result":false,"details":"Unsupported command issued: com.cloud.agent.api.ModifyStoragePoolCommand.  Are you sure you got the right type of server?","wait":0}}] }
2018-05-02 13:41:44,451 DEBUG [kvm.resource.LibvirtComputingResource] (Agent-Handler-4:null) (logid:d06ebad5) Execution is successful.
2018-05-02 13:41:44,453 DEBUG [cloud.agent.Agent] (Agent-Handler-4:null) (logid:d06ebad5) Watch Sent: Seq 1-6775102689425489921:  { Ans: , MgmtId: 29010072043598, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.Answer":{"result":true,"details":"","wait":0}}] }
2018-05-02 13:41:44,464 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Location 1: Socket Socket[addr=/10.20.0.21,port=8250,localport=45374] closed on read.  Probably -1 returned: Connection closed with -1 on reading size.
2018-05-02 13:41:44,464 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Closing socket Socket[addr=/10.20.0.21,port=8250,localport=45374]
2018-05-02 13:41:44,464 DEBUG [cloud.agent.Agent] (Agent-Handler-1:null) (logid:) Clearing watch list: 2
2018-05-02 13:41:46,728 INFO  [cloud.agent.Agent] (Agent-Handler-1:null) (logid:) Connected to the server
2018-05-02 13:41:49,465 INFO  [cloud.agent.Agent] (Agent-Handler-1:null) (logid:) Lost connection to the server. Dealing with the remaining commands...

In the generated XML there isn't the secret part, perhaps the agent is trying to use another authentication mechanism which this version of libvirt does not support?

This is what I have been using as a workaround, and it works:

<pool type="rbd">
  <name>be80af6a-7201-3410-8da4-9b3b58c4954f</name>
  <uuid>be80af6a-7201-3410-8da4-9b3b58c4954f</uuid>
  <source>
    <name>cephstor1</name>
    <host name='storagepool1' port='6789'/>
    <auth username='admin' type='ceph'>
      <secret uuid='OMITTED'/>
    </auth>
  </source>
</pool>

@wido
Copy link
Contributor

wido commented May 2, 2018

Ok, I see. The operation not supported is odd, but it seems to be missing the cephx part indeed.

At the moment I do not have a 4.11 environment to test this with, but I'll look at the code later.

nathanejohnson added a commit to myENA/cloudstack that referenced this issue May 3, 2018
fix issue where kvm / ceph cannot create volumes apache#2611
@nathanejohnson nathanejohnson mentioned this issue May 3, 2018
7 tasks
@nathanejohnson
Copy link
Member

nathanejohnson commented May 3, 2018

@wido The issue seems to be that for any KVM host, the disk format is always hard coded to QCOW2 here:

https://github.com/apache/cloudstack/blob/master/engine/orchestration/src/main/java/org/apache/cloudstack/engine/orchestration/VolumeOrchestrator.java#L768

This gets sent along to the agent. I guess this is a reasonable default for KVM, but obviously for RBD it needs to be RAW, and it won't know that until it looks up the storage pool. In the short term, I think hard coding the disk format to RAW on the agent side for RBD is valid, it can't be anything but that, but it might also be good for the management server to send along the correct disk type.

@nathanejohnson
Copy link
Member

On second thought, this might be a separate bug we were chasing. We were getting seeing this error in the logs:

2018-05-03 09:58:03,932 INFO  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-3:null) (logid:fb29bd87) Attempting to create volume f2d800a0-fc19-4b41-b886-57
5af6742425 (RBD) in pool e485c675-896a-303e-be82-c42357c6b4c1 with size 5368709120
2018-05-03 09:58:03,932 DEBUG [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-3:null) (logid:fb29bd87) <volume>
<name>f2d800a0-fc19-4b41-b886-575af6742425</name>
<capacity>5368709120</capacity>
<target>
<format type='qcow2'/>
<permissions><mode>0744</mode></permissions></target>
</volume>

2018-05-03 09:58:03,932 DEBUG [kvm.storage.KVMStorageProcessor] (agentRequest-Handler-3:null) (logid:fb29bd87) Failed to create volume: 
com.cloud.utils.exception.CloudRuntimeException: org.libvirt.LibvirtException: unsupported configuration: only RAW volumes are supported by this storage pool
at com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.createPhysicalDiskByLibVirt(LibvirtStorageAdaptor.java:705)
at com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.createPhysicalDisk(LibvirtStorageAdaptor.java:664)
at com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.createPhysicalDisk(LibvirtStoragePool.java:115)
at com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.createVolume(KVMStorageProcessor.java:1323)
at com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.execute(StorageSubsystemCommandHandlerBase.java:119)
at com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.handleStorageCommands(StorageSubsystemCommandHandlerBase.java:58)
at com.cloud.hypervisor.kvm.resource.wrapper.LibvirtStorageSubSystemCommandWrapper.execute(LibvirtStorageSubSystemCommandWrapper.java:36)
at com.cloud.hypervisor.kvm.resource.wrapper.LibvirtStorageSubSystemCommandWrapper.execute(LibvirtStorageSubSystemCommandWrapper.java:30)
at com.cloud.hypervisor.kvm.resource.wrapper.LibvirtRequestWrapper.execute(LibvirtRequestWrapper.java:75)
at com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1438)
at com.cloud.agent.Agent.processRequest(Agent.java:563)
at com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:976)
at com.cloud.utils.nio.Task.call(Task.java:83)
at com.cloud.utils.nio.Task.call(Task.java:29)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

@giorgiomassar8 were you seeing similar messages? We were receiving these messages on a freshly created pool when attempting to deploy a VM.

@wido
Copy link
Contributor

wido commented May 4, 2018

@kiwiflyer Have you seen this one? The Agent is not providing the secret for the RBD storage pool to the libvirt XML definition.

@GabrielBrascher
Copy link
Member

I will see if I can deploy a test environment to reproduce this issue meanwhile.

@giorgiomassar8
Copy link
Author

@nathanejohnson nope I don't see qcow2/raw mismatch errors here, I do believe the issue is related to the fact that for some reasons, no authentication information is put in the pool's XML sent to libvirt.

@nathanejohnson
Copy link
Member

@giorgiomassar8 here are the logs from when the rbd pool was configured on our host:

2018-05-02 12:07:01,225 DEBUG [cloud.agent.Agent] (agentRequest-Handler-3:null) (logid:328be924) Request:Seq 1-8891231564336201737:  { Cmd , MgmtId: 14038005752684, via: 1, Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CreateStoragePoolCommand":{"_createDatastore":false,"add":true,"pool":{"id":3,"uuid":"e485c675-896a-303e-be82-c42357c6b4c1","host":"192.168.100.100","path":"acs411testing","userInfo":"acs411testing:supersecretkeygoeshere==","port":6789,"type":"RBD"},"localPath":"/mnt//dcf0b890-984a-3c12-9190-050f3294a590","wait":0}}] }
2018-05-02 12:07:01,225 DEBUG [cloud.agent.Agent] (agentRequest-Handler-3:null) (logid:328be924) Processing command: com.cloud.agent.api.CreateStoragePoolCommand
2018-05-02 12:07:01,225 DEBUG [cloud.agent.Agent] (agentRequest-Handler-3:null) (logid:328be924) Seq 1-8891231564336201737:  { Ans: , MgmtId: 14038005752684, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.Answer":{"result":true,"details":"success","wait":0}}] }
2018-05-02 12:07:01,313 DEBUG [cloud.agent.Agent] (agentRequest-Handler-5:null) (logid:328be924) Request:Seq 1-8891231564336201738:  { Cmd , MgmtId: 14038005752684, via: 1, Ver: v1, Flags: 100011, [{"com.cloud.agent.api.ModifyStoragePoolCommand":{"add":true,"pool":{"id":3,"uuid":"e485c675-896a-303e-be82-c42357c6b4c1","host":"192.168.100.100","path":"acs411testing","userInfo":"acs411testing:supersecretkeygoeshere==","port":6789,"type":"RBD"},"localPath":"/mnt//dcf0b890-984a-3c12-9190-050f3294a590","wait":0}}] }
2018-05-02 12:07:01,313 DEBUG [cloud.agent.Agent] (agentRequest-Handler-5:null) (logid:328be924) Processing command: com.cloud.agent.api.ModifyStoragePoolCommand
2018-05-02 12:07:01,313 INFO  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) (logid:328be924) Attempting to create storage pool e485c675-896a-303e-be82-c42357c6b4c1 (RBD) in libvirt
2018-05-02 12:07:01,313 DEBUG [kvm.resource.LibvirtConnection] (agentRequest-Handler-5:null) (logid:328be924) Looking for libvirtd connection at: qemu:///system
2018-05-02 12:07:01,319 WARN  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) (logid:328be924) Storage pool e485c675-896a-303e-be82-c42357c6b4c1 was not found running in libvirt. Need to create it.
2018-05-02 12:07:01,319 INFO  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) (logid:328be924) Didn't find an existing storage pool e485c675-896a-303e-be82-c42357c6b4c1 by UUID, checking for pools with duplicate paths
2018-05-02 12:07:01,319 DEBUG [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) (logid:328be924) Checking path of existing pool d358f5c2-f5fd-43a8-844e-cf01183de220 against pool we want to create
2018-05-02 12:07:01,322 DEBUG [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) (logid:328be924) Attempting to create storage pool e485c675-896a-303e-be82-c42357c6b4c1
2018-05-02 12:07:01,323 DEBUG [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) (logid:328be924) <secret ephemeral='no' private='no'>
<uuid>e485c675-896a-303e-be82-c42357c6b4c1</uuid>
<usage type='ceph'>
<name>[email protected]:6789/acs411testing</name>
</usage>
</secret>

2018-05-02 12:07:01,388 DEBUG [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) (logid:328be924) <pool type='rbd'>
<name>e485c675-896a-303e-be82-c42357c6b4c1</name>
<uuid>e485c675-896a-303e-be82-c42357c6b4c1</uuid>
<source>
<host name='192.168.100.100' port='6789'/>
<name>acs411testing</name>
<auth username='acs411testing' type='ceph'>
<secret uuid='e485c675-896a-303e-be82-c42357c6b4c1'/>
</auth>
</source>
</pool>

2018-05-02 12:07:01,434 INFO  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) (logid:328be924) Trying to fetch storage pool e485c675-896a-303e-be82-c42357c6b4c1 from libvirt
2018-05-02 12:07:01,434 DEBUG [kvm.resource.LibvirtConnection] (agentRequest-Handler-5:null) (logid:328be924) Looking for libvirtd connection at: qemu:///system
2018-05-02 12:07:01,439 DEBUG [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) (logid:328be924) Succesfully refreshed pool e485c675-896a-303e-be82-c42357c6b4c1 Capacity: 11413921890304 Used: 0 Available: 10496547282944

do you see the key go across before it attempts to create the libvirt storage pool?

@kiwiflyer
Copy link
Contributor

So working with a user, it does appear that the original issue outlined by @giorgiomassar8 is different from the one we found. The user in question is running Ubuntu 16.04 and no authentication is being passed to libvirt.

image

This particular installation is a 4.9.3 -> 4.11.1 upgrade.

@kiwiflyer
Copy link
Contributor

rbd

@wido
Copy link
Contributor

wido commented Nov 18, 2018

Are there by any chance slashes (/) in the secret of this RBD pool?

@kiwiflyer
Copy link
Contributor

kiwiflyer commented Nov 18, 2018 via email

@wido
Copy link
Contributor

wido commented Nov 18, 2018

Could you check that for me? As I have seen issues with that in the secret information and URL parsing inside Java.

Otherwise I wouldn't know where this is coming from.

@kiwiflyer
Copy link
Contributor

The pool secret apparently does not have any slashes. The encrypted version in the DB does, but obviously that shouldn't matter. I'm trying to work back through the code to see what might have changed between 4.9.x and 4.11.x.
The secret is set correctly in the 4.9.3 release.

@kiwiflyer
Copy link
Contributor

kiwiflyer commented Nov 21, 2018

ok...well it's definitely not decrypting the userInfo before it gets sent to the agent in the upgrade case above.
He's our 4.11 lab with debug enabled.
I restarted libvirtd to remove the storage pool.
I then restarted the cloudstack-agent. the agent then tries to add the storage pool back -
2018-11-21 10:54:18,733 DEBUG [cloud.agent.Agent] (agentRequest-Handler-3:null) (logid:3144d09f) Request:Seq 1-5622462659795353602: { Cmd , MgmtId: 14038005752684, via: 1, Ver: v1, Flags: 100011, [{"com.cloud.agent.api.ModifyStoragePoolCommand":{"add":true,"pool":{"id":3,"uuid":"e485c675-896a-303e-be82-c42357c6b4c1","host":"192.168.100.100","path":"acs411testing","userInfo":"acs411testing:AQDk3ela0NMAMxAA-4BkLN20KPNiEm2aJ8WVAg==","port":6789,"type":"RBD"},"localPath":"/mnt//dcf0b890-984a-3c12-9190-050f3294a590","wait":0}}] }

@kiwiflyer
Copy link
Contributor

In the Ubuntu case above, it looks like the encrypted db payload in the userInfo field, rather than the decrypted version with colon delimiter that is expected in the code here:

, so hence NULL gets returned and then libvirt throws an exception.

@kiwiflyer
Copy link
Contributor

Note that my test lab is Centos 7.

@kiwiflyer
Copy link
Contributor

@wido or @GabrielBrascher , do either of you guys have both an older 4.9/4.10 lab as well as a 4.11 lab on Ubuntu 16.04?

It would be interesting to test trying a manual decrypt on the user_info field from the storage_pool data using a 4.9.x payload, but trying the decrypt using a 4.11 installation.

e.g. java -classpath /usr/share/cloudstack-common/lib/jasypt-1.9.2.jar org.jasypt.intf.cli.JasyptPBEStringDecryptionCLI decrypt.sh input=encrypted_payload_from_user_info password=encrypt_password verbose=true

@wido
Copy link
Contributor

wido commented Nov 23, 2018

@kiwiflyer All our test and production systems are 4.11 at the moment running on Ubuntu 16.04

They were all upgraded from 4.9 to 4.10 and then to 4.11. In none of the cases we ran in to this.

@rohityadavcloud
Copy link
Member

@wido @kiwiflyer is this still a bug wrt 4.11.2.0 or next 4.11.3.0/4.12.0.0 ?

@wido
Copy link
Contributor

wido commented Dec 7, 2018

I have not been able to verify this bug, so I can't tell you.

@kiwiflyer
Copy link
Contributor

kiwiflyer commented Dec 10, 2018

I saw this problem in an ACS mailing list users system after attempting an upgrade from 4.9.3 -> 4.11.1. Unfortunately, the bug was in production, so I only had an extremely limited window to take a look before they restored their previous 4.9.3 version. At the present time, I'm not sure what caused it and I haven't been able to reproduce this in a lab environment.
This definitely wasn't a one off situation, as they attempting this upgrade multiple times with the same result.

@kiwiflyer kiwiflyer modified the milestones: 4.11.1.0, 4.12.0.0 Feb 22, 2019
@kiwiflyer
Copy link
Contributor

kiwiflyer commented Feb 22, 2019

So, @nathanejohnson and I just started some upgrade testing on some lab environments going from our 4.8 production branch to 4.12 and we may have just run into this same bug. We'll triage it and update everyone early next week.

@nathanejohnson
Copy link
Member

@wido I can second what @kiwiflyer mentioned earlier: The reason pools aren't being created has to do with the user_info being passed along encrypted, the split doesn't find a semicolon so it skips the secret section of the pool xml. It looks like the only time it's sent along decrypted is when the pool is initially created..

Obviously I'm missing something, I can't find anything significant that's different between 4.8 and 4.12's handling of this logic-wise, but this is failing consistently now in our lab environment. Would love some ideas on troubleshooting

@nathanejohnson
Copy link
Member

as a followup, I cannot reproduce this in our fresh installed lab, only the one we upgraded. The userInfo comes across decrypted there.

@wido
Copy link
Contributor

wido commented Feb 25, 2019

This seems like a management server thing where it doesn't send the data properly. Because I can't think of a reason why this happens on the Agent/KVM side.

I honestly have no idea.

@nathanejohnson
Copy link
Member

nathanejohnson commented Feb 25, 2019

I narrowed down the issue to where at some point during the upgrade process the db.properties file had switched from using encryption to not, so the user_info field was stored encrypted in the db but then was being sent without being decrypted on the way to the agent. I see in the spec file that the db.properties file is listed as config(noreplace), but I think in some situations that will still end up overwritten. At one point during the upgrade, I completely removed the RPM's and had to downgrade, even though I left the /etc/cloudstack/management directory in place I think when I then reinstalled the RPMs it overwrote db.properties since the file wasn't marked as owned by a previous installation, if this makes sense.

At any rate, I suspect something similar might have occurred with the original bug. I'm going to try upgrading another cluster, paying attention to the db.properties file and see if this happens again.

@GabrielBrascher
Copy link
Member

@nathanejohnson any update on this one?
I am not sure if this issue is a blocker for 4.12, considering that not all contributors were able to reproduce it (@wido was able to upgrade from 4.10 -> 4.11 -> 4.12).

@GabrielBrascher
Copy link
Member

We are still unable to reproduce this issue. Thus, I am removing this as a Blocker for 4.12.

@nathanejohnson
Copy link
Member

Sorry for the late reply, i have not had a chance to upgrade our other development cluster but I agree this shouldn't be a blocker.

@rohityadavcloud
Copy link
Member

@nathanejohnson @giorgiomassar8 @GabrielBrascher @wido is this still an issue, should we close it or is there a PR coming?

@rohityadavcloud rohityadavcloud removed this from the 4.13.0.0 milestone Jun 25, 2019
@rohityadavcloud
Copy link
Member

@nathanejohnson @giorgiomassar8 @GabrielBrascher @wido kindly discuss and advise a milestone in case a PR is coming soon. I've removed this issue from 4.13.0.0.

@DaanHoogland
Copy link
Contributor

@giorgiomassar8 there's been no activity on this for two year. I'm closing the ticket, please reopen if you feel it still should be.

@xuanyuanaosheng
Copy link

Any date ? #5741

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants