Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ceph: workloads adoption with cinder volume #280

Merged
merged 1 commit into from
Apr 23, 2024

Conversation

bogdando
Copy link
Contributor

@bogdando bogdando commented Feb 8, 2024

Enable back the cinder volume commands on the source cloud, and
resume testing of the ceph-backed volume attached to the test VM.

Extend volume/backup/snapshot/attachment commands to wait for
the previous step results.

Follow the EDPM Post Ceph steps of HCI VA to prepare adopted
workloads for using Ceph backend on EDPM.

Add Nova discover host command (step 5 of the HCI VA).

Add Nova Ceph custom configs to properly configure ceph
vms pool for libvirt.

Combine nova-ceph related configurations and nova FFU related
ones into a single nova-compute-extraconfig service (by design,
having two dataplane services for Nova in the same node set is
not supported).

Note about available choises for libvirt storage backends for Nova

Add nova_libvirt_backend to control either to deploy with the local
or ceph storage EDPM backends

Depends-On: https://review.rdoproject.org/r/c/rdo-jobs/+/52932

Jira OSPRH-4217

@bogdando

This comment was marked as outdated.

@bogdando bogdando force-pushed the workloads_cinder_ceph branch 2 times, most recently from 82e83fb to 3187e71 Compare February 8, 2024 13:29
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/bde0113029fd45b895b39e9a08d83512

data-plane-adoption-osp-17-to-extracted-crc FAILURE in 1h 22m 59s
✔️ adoption-docs-preview SUCCESS in 1m 49s

@bogdando bogdando force-pushed the workloads_cinder_ceph branch from 3187e71 to f90876b Compare February 8, 2024 16:26
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/68bba9a4da9f466c8ce96f1810882314

data-plane-adoption-osp-17-to-extracted-crc FAILURE in 1h 18m 40s
✔️ adoption-docs-preview SUCCESS in 1m 46s

@bogdando bogdando force-pushed the workloads_cinder_ceph branch 4 times, most recently from 24d957f to 4b3bfc9 Compare February 29, 2024 16:28
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/70b75dc56cac404e8f5ad9c258763e11

data-plane-adoption-osp-17-to-extracted-crc FAILURE in 2h 24m 52s
✔️ adoption-docs-preview SUCCESS in 2m 03s

@bogdando

This comment was marked as resolved.

@fultonj
Copy link
Contributor

fultonj commented Mar 1, 2024

@fultonj @fmount could you please inspect this for missing bits? I believe I followed the documented steps and didn't miss anything? On my local setup, I see ceph-client EDPM service has produced these files:

[root@standalone ~]# cat /etc/ceph/ceph.client.openstack.keyring
[client.openstack]
   key = "xxx snip xxx=="
   caps mgr = allow *
   caps mon = profile rbd
   caps osd = profile rbd pool=volumes, profile rbd pool=images, profile rbd pool=backups
[root@standalone ~]# cat /etc/ceph/ceph.conf 
# minimal ceph.conf for 00d3cde9-0501-5b30-a533-e69040d6dcde
[global]
        fsid = 00d3cde9-0501-5b30-a533-e69040d6dcde
        mon_host = [v2:172.18.0.100:3300/0,v1:172.18.0.100:6789/0]

CI failure in logs :

ult default] Could not find nvme_core/parameters/multipath: FileNotFoundError: [Errno 2] No such file or directory: '/sys/module/nvme_core/parameters/multipath'
2024-02-29 17:43:46.523 209 DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): findmnt / -n -o SOURCE execute /usr/lib/python3.9/site-packages/oslo_concurrency/processutils.py:384
2024-02-29 17:43:46.536 209 DEBUG oslo_concurrency.processutils [-] CMD "findmnt / -n -o SOURCE" returned: 0 in 0.012s execute /usr/lib/python3.9/site-packages/oslo_concurrency/processutils.py:422
2024-02-29 17:43:46.536 209 DEBUG oslo.privsep.daemon [-] privsep: reply[139919505952080]: (4, ('overlay
', '')) _call_back /usr/lib/python3.9/site-packages/oslo_privsep/daemon.py:512
2024-02-29 17:43:46.537 209 DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): blkid overlay -s UUID -o value execute /usr/lib/python3.9/site-packages/oslo_concurrency/processutils.py:384
2024-02-29 17:43:46.543 209 DEBUG oslo_concurrency.processutils [-] CMD "blkid overlay -s UUID -o value" returned: 2 in 0.006s execute /usr/lib/python3.9/site-packages/oslo_concurrency/processutils.py:422
2024-02-29 17:43:46.543 209 DEBUG oslo_concurrency.processutils [-] 'blkid overlay -s UUID -o value' failed. Not Retrying. execute /usr/lib/python3.9/site-packages/oslo_concurrency/processutils.py:473
2024-02-29 17:43:46.543 209 DEBUG oslo.privsep.daemon [-] privsep: Exception during request[139919505952080]: Unexpected error while running command.
Command: blkid overlay -s UUID -o value
Exit code: 2
Stdout: ''
Stderr: '' _process_cmd /usr/lib/python3.9/site-packages/oslo_privsep/daemon.py:490
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/oslo_privsep/daemon.py", line 487, in _process_cmd
    ret = func(*f_args, **f_kwargs)
  File "/usr/lib/python3.9/site-packages/oslo_privsep/priv_context.py", line 255, in _wrap
    return func(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/os_brick/privileged/rootwrap.py", line 197, in execute_root
    return custom_execute(*cmd, shell=False, run_as_root=False, **kwargs)
  File "/usr/lib/python3.9/site-packages/os_brick/privileged/rootwrap.py", line 145, in custom_execute
    return putils.execute(on_execute=on_execute,
  File "/usr/lib/python3.9/site-packages/oslo_concurrency/processutils.py", line 438, in execute
    raise ProcessExecutionError(exit_code=_returncode,
oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
Command: blkid overlay -s UUID -o value
Exit code: 2

I don't see how this os_brick error is connected to Ceph. Let's work our way down from the top to make sure our assumptions are correct.

I understand that /etc/ceph/ceph.client.openstack.keyring exists on the container host. Do we know if the nova containers have it (along with the ceph.conf)? Was a nova secret.xml created? For exampe:

bash-5.1$ cat /etc/nova/secret.xml
<secret ephemeral='no' private='no'>
  <usage type='ceph'>
    <name>client.openstack secret</name>
  </usage>
  <uuid>604c9994-1d82-11ed-8ae5-5254003d6107</uuid>
</secret>
bash-5.1$

Can we use NovaEnableRbdBackend: true?

@bogdando

This comment was marked as outdated.

@fultonj
Copy link
Contributor

fultonj commented Mar 4, 2024

update2: I may be missing this https://github.com/openstack-k8s-operators/ci-framework/blob/34de0c6392c1c713af46a8b405e7123eacc89950/ci_framework/roles/hci_prepare/templates/configmap-ceph-nova.yml.j2#L8

Yes, if you're missing the override to nova.conf to tell [libvirt] to use rbd that will explain the problem. Let's see how it works with those parameters.

@bogdando bogdando force-pushed the workloads_cinder_ceph branch from 4b3bfc9 to 4a5aa2c Compare March 5, 2024 13:11
Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/data-plane-adoption for 280,4a5aa2c50d7c4997ff10cd58985811ccdb367e69

@bogdando bogdando force-pushed the workloads_cinder_ceph branch from 4a5aa2c to a37e824 Compare March 5, 2024 13:19
@fultonj
Copy link
Contributor

fultonj commented Mar 5, 2024

The ceph related changes look good.

@bogdando bogdando force-pushed the workloads_cinder_ceph branch 4 times, most recently from bd5eac1 to 2e0de27 Compare March 5, 2024 16:01
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/2c4fedb2fb344292bfdd2c23101962c1

data-plane-adoption-osp-17-to-extracted-crc FAILURE in 2h 07m 24s
✔️ adoption-docs-preview SUCCESS in 1m 44s

@bogdando bogdando force-pushed the workloads_cinder_ceph branch from 2e0de27 to 93155f8 Compare March 6, 2024 14:21
@bogdando
Copy link
Contributor Author

bogdando commented Mar 6, 2024

@GIBI on my dev setup I get this:

Mar 06 16:04:26 standalone.localdomain nova_compute[509696]: ERROR:__main__:Failed to change ownership of /var/lib/nova/compute_id to 42436:42436
Mar 06 16:04:26 standalone.localdomain nova_compute[509696]: Traceback (most recent call last):
Mar 06 16:04:26 standalone.localdomain nova_compute[509696]:   File "/usr/local/bin/kolla_set_configs", line 343, in set_perms
Mar 06 16:04:26 standalone.localdomain nova_compute[509696]:     os.chown(path, uid, gid)
Mar 06 16:04:26 standalone.localdomain nova_compute[509696]: PermissionError: [Errno 1] Operation not permitted: '/var/lib/nova/compute_id'

the similar failure is when I consequently re-deploy the same EDPM node, w/o full clean up:

ssh -i /home/cloud-user/install_yamls/out/edpm/ansibleee-ssh-key-id_rsa [email protected] 'echo 940ea0e1-c87f-4731-a9ee-c96d9d1a712a | sudo tee /var/lib/nova/compute_id && sudo chown 42436:42436 /var/lib/nova/compute_id && sudo chcon -t container_file_t /var/lib/nova/compute_id'
tee: /var/lib/nova/compute_id: Operation not permitted
940ea0e1-c87f-4731-a9ee-c96d9d1a712a

this idempotency issue is related to using chattr command to lock that file. For the latter, fix is #333. Not sure about the kolla part

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/9b3b601c9793423da7853533cd268c11

data-plane-adoption-osp-17-to-extracted-crc FAILURE in 2h 26m 20s
✔️ adoption-docs-preview SUCCESS in 2m 08s

@bogdando
Copy link
Contributor Author

recheck new dep

rdoproject pushed a commit to rdo-infra/rdo-jobs that referenced this pull request Apr 17, 2024
Unset EDPM_COMPUTE_CEPH_NOVA temporarily.

We will enable the setting back, after merging these:
- openstack-k8s-operators/install_yamls#757
- openstack-k8s-operators/data-plane-adoption#280

Until then, such deployments with Ceph backend cannot be covered.

Change-Id: Iaadcbf6a6b1483558ae186cc5f6cc6083f215dc7
Signed-off-by: Bogdan Dobrelya <[email protected]>
rdoproject pushed a commit to rdo-infra/rdo-jobs that referenced this pull request Apr 17, 2024
This reverts commit d113b32.
As we have now
openstack-k8s-operators/data-plane-adoption#280
and
openstack-k8s-operators/data-plane-adoption#280
merged, start testing both deployment paths for Nova Libvirt local
and ceph backend options.

Change-Id: Ia88e3c8f197efe0a4efac7527e1bad0fd617b856
@bogdando bogdando force-pushed the workloads_cinder_ceph branch from 4f284c6 to 9837f39 Compare April 17, 2024 14:24
Copy link

This change depends on a change that failed to merge.

Changes openstack-k8s-operators/install_yamls#757, https://review.rdoproject.org/r/c/rdo-jobs/+/52932 are needed.

rdoproject pushed a commit to rdo-infra/rdo-jobs that referenced this pull request Apr 17, 2024
Unset EDPM_COMPUTE_CEPH_NOVA temporarily.

We will enable the setting back, after merging these:
- openstack-k8s-operators/install_yamls#757
- openstack-k8s-operators/data-plane-adoption#280

Until then, such deployments with Ceph backend cannot be covered.

Change-Id: Iaadcbf6a6b1483558ae186cc5f6cc6083f215dc7
Signed-off-by: Bogdan Dobrelya <[email protected]>
rdoproject pushed a commit to rdo-infra/rdo-jobs that referenced this pull request Apr 17, 2024
This reverts commit d113b32.
As we have now
openstack-k8s-operators/data-plane-adoption#280
and
openstack-k8s-operators/data-plane-adoption#280
merged, start testing both deployment paths for Nova Libvirt local
and ceph backend options.

Change-Id: Ia88e3c8f197efe0a4efac7527e1bad0fd617b856
@bogdando bogdando force-pushed the workloads_cinder_ceph branch from 9837f39 to 5da9222 Compare April 17, 2024 14:52
rdoproject pushed a commit to rdo-infra/rdo-jobs that referenced this pull request Apr 19, 2024
Unset EDPM_COMPUTE_CEPH_NOVA temporarily.

We will enable the setting back, after merging these:
- openstack-k8s-operators/install_yamls#757
- openstack-k8s-operators/data-plane-adoption#280

Until then, such deployments with Ceph backend cannot be covered.

Change-Id: Iaadcbf6a6b1483558ae186cc5f6cc6083f215dc7
Signed-off-by: Bogdan Dobrelya <[email protected]>
rdoproject pushed a commit to rdo-infra/rdo-jobs that referenced this pull request Apr 19, 2024
This reverts commit d113b32.
As we have now
openstack-k8s-operators/data-plane-adoption#280
and
openstack-k8s-operators/data-plane-adoption#280
merged, start testing both deployment paths for Nova Libvirt local
and ceph backend options.

Change-Id: Ia88e3c8f197efe0a4efac7527e1bad0fd617b856
@bogdando bogdando force-pushed the workloads_cinder_ceph branch 2 times, most recently from a0ca760 to fc9046e Compare April 19, 2024 13:36
Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/data-plane-adoption for 280,fc9046ebc775e25f7e226f7eda2832fbf321693f

@bogdando bogdando removed the check-before-merge/depends-on Don't forget to check depends-on before merging label Apr 19, 2024
Enable back the cinder volume commands on the source cloud, and
resume testing of the ceph-backed volume attached to the test VM.

Extend volume/backup/snapshot/attachment commands to wait for
the previous step results.

Follow the EDPM Post Ceph steps of HCI VA to prepare adopted
workloads for using Ceph backend on EDPM.

Add Nova discover host command (step 5 of the HCI VA).

Add Nova Ceph custom configs to properly configure ceph
vms pool for libvirt.

Combine nova-ceph related configurations and nova FFU related
ones into a single nova-compute-extraconfig service (by design,
having two dataplane services for Nova in the same node set is
not supported).

Note about available choises for libvirt storage backends for Nova

Add nova_libvirt_backend to control either to deploy with the local
or ceph storage EDPM backends

Signed-off-by: Bohdan Dobrelia <[email protected]>
@bogdando bogdando force-pushed the workloads_cinder_ceph branch from fc9046e to 140aa59 Compare April 19, 2024 13:56
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/aa388c68941842b1a90b5f82a262894a

✔️ data-plane-adoption-osp-17-to-extracted-crc SUCCESS in 2h 27m 26s
✔️ data-plane-adoption-osp-17-to-extracted-crc-minimal-no-ceph SUCCESS in 2h 13m 48s
adoption-docs-preview RETRY_LIMIT in 1m 35s

@bogdando
Copy link
Contributor Author

/test adoption-docs-preview

@bogdando
Copy link
Contributor Author

recheck adoption-docs-preview

@jistr jistr added the check-before-merge/depends-on Don't forget to check depends-on before merging label Apr 22, 2024
@bogdando bogdando removed the check-before-merge/depends-on Don't forget to check depends-on before merging label Apr 22, 2024
@bogdando
Copy link
Contributor Author

good to go

@jistr
Copy link
Contributor

jistr commented Apr 23, 2024

The dependency didn't land yet? Please let's only remove the dependency label when there is no depends-on or when all of it has landed. The purpose of it isn't to block, it is to alert the reviewer about to push the merge button to check dependency order. I created it for myself after merging a few patches in wrong order because i didn't notice the dependency :).

@jistr jistr added the check-before-merge/depends-on Don't forget to check depends-on before merging label Apr 23, 2024
rdoproject pushed a commit to rdo-infra/rdo-jobs that referenced this pull request Apr 23, 2024
This reverts commit d113b32.
As we have now
openstack-k8s-operators/data-plane-adoption#280
and
openstack-k8s-operators/data-plane-adoption#280
merged, start testing both deployment paths for Nova Libvirt local
and ceph backend options.

Change-Id: Ia88e3c8f197efe0a4efac7527e1bad0fd617b856
@jistr
Copy link
Contributor

jistr commented Apr 23, 2024

The dependency is merged, merging this one too.

@jistr jistr merged commit 1b67ee2 into openstack-k8s-operators:main Apr 23, 2024
3 checks passed
@jistr jistr mentioned this pull request Apr 23, 2024
@bogdando bogdando deleted the workloads_cinder_ceph branch April 23, 2024 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
check-before-merge/depends-on Don't forget to check depends-on before merging
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants