Merge pull request #461 from fmount/ceph_doc

Rework Ceph RBD migration documentation
openstack-k8s-operators · Jun 7, 2024 · bb5aef5 · bb5aef5
2 parents b73a146 + 04e42d9
commit bb5aef5
Show file tree

Hide file tree

Showing 10 changed files with 612 additions and 63 deletions.
diff --git a/docs_user/assemblies/assembly_migrating-ceph-cluster.adoc b/docs_user/assemblies/assembly_migrating-ceph-cluster.adoc
@@ -0,0 +1,44 @@
+ifdef::context[:parent-context: {context}]
+
+[id="ceph-migration_{context}"]
+
+= Migrating the {CephCluster} Cluster
+
+:context: migrating-ceph
+
+:toc: left
+:toclevels: 3
+
+ifdef::parent-context[:context: {parent-context}]
+ifndef::parent-context[:!context:]
+
+In the context of data plane adoption, where the {rhos_prev_long}
+({OpenStackShort}) services are redeployed in {OpenShift}, you migrate a
+{OpenStackPreviousInstaller}-deployed {CephCluster} cluster by using a process
+called “externalizing” the {CephCluster} cluster.
+
+There are two deployment topologies that include an internal {CephCluster}
+cluster:
+
+* {OpenStackShort} includes dedicated {CephCluster} nodes to host object
+  storage daemons (OSDs)
+
+* Hyperconverged Infrastructure (HCI), where Compute and Storage services are
+  colocated on hyperconverged nodes
+
+In either scenario, there are some {Ceph} processes that are deployed on
+{OpenStackShort} Controller nodes: {Ceph} monitors, Ceph Object Gateway (RGW),
+Rados Block Device (RBD), Ceph Metadata Server (MDS), Ceph Dashboard, and NFS
+Ganesha. To migrate your {CephCluster} cluster, you must decommission the
+Controller nodes and move the {Ceph} daemons to a set of target nodes that are
+already part of the {CephCluster} cluster.
+
+include::../modules/con_ceph-daemon-cardinality.adoc[leveloffset=+1]
+
+include::assembly_migrating-ceph-monitoring-stack.adoc[leveloffset=+1]
+
+include::../modules/proc_migrating-ceph-mds.adoc[leveloffset=+1]
+
+include::assembly_migrating-ceph-rgw.adoc[leveloffset=+1]
+
+include::assembly_migrating-ceph-rbd.adoc[leveloffset=+1]
diff --git a/docs_user/assemblies/assembly_migrating-ceph-monitoring-stack.adoc b/docs_user/assemblies/assembly_migrating-ceph-monitoring-stack.adoc
@@ -4,30 +4,32 @@
 
 = Migrating the monitoring stack component to new nodes within an existing {Ceph} cluster
 
-In the context of data plane adoption, where the {rhos_prev_long} ({OpenStackShort}) services are
-redeployed in {OpenShift}, a {OpenStackPreviousInstaller}-deployed {CephCluster} cluster will undergo a migration in a process we are calling “externalizing” the {CephCluster} cluster.
-There are two deployment topologies, broadly, that include an “internal” {CephCluster} cluster today: one is where {OpenStackShort} includes dedicated {CephCluster} nodes to host object storage daemons (OSDs), and the other is Hyperconverged Infrastructure (HCI) where Compute nodes
-double up as {CephCluster} nodes. In either scenario, there are some {Ceph} processes that are deployed on {OpenStackShort} Controller nodes: {Ceph} monitors, Ceph Object Gateway (RGW), Rados Block Device (RBD), Ceph Metadata Server (MDS), Ceph Dashboard, and NFS Ganesha.
 The Ceph Dashboard module adds web-based monitoring and administration to the
 Ceph Manager.
-With {OpenStackPreviousInstaller}-deployed {Ceph} this component is enabled as part of the overcloud deploy and it’s composed by:
+With {OpenStackPreviousInstaller}-deployed {Ceph}, this component is enabled as
+part of the overcloud deploy and it is composed of the following:
 
 - Ceph Manager module
 - Grafana
 - Prometheus
 - Alertmanager
 - Node exporter
 
-The Ceph Dashboard containers are included through `tripleo-container-image-prepare` parameters and the high availability relies on `Haproxy` and `Pacemaker` deployed on the {OpenStackShort} front.
-For an external {CephCluster} cluster, high availability is not supported.
-The goal of this procedure is to migrate and relocate the Ceph Monitoring
-components to free Controller nodes.
-
-For this procedure, we assume that we are beginning with a {OpenStackShort} based on {rhos_prev_ver} and a {Ceph} {CephRelease} deployment managed by {OpenStackPreviousInstaller}.
-We assume that:
-
-* {Ceph} has been upgraded to {CephRelease} and is managed by cephadm/orchestrator
-* Both the {Ceph} public and cluster networks are propagated, through{OpenStackPreviousInstaller}, to the target nodes
+The Ceph Dashboard containers are included through
+`tripleo-container-image-prepare` parameters and the high availability relies
+on `Haproxy` and `Pacemaker` deployed on the {OpenStackShort} front. For an
+external {CephCluster} cluster, high availability is not supported. The goal of
+this procedure is to migrate and relocate the Ceph Monitoring components to
+free Controller nodes.
+
+For this procedure, we assume that we are beginning with a {OpenStackShort}
+based on {rhos_prev_ver} and a {Ceph} {CephRelease} deployment managed by
+{OpenStackPreviousInstaller}. We assume that:
+
+* {Ceph} has been upgraded to {CephRelease} and is managed by
+  cephadm/orchestrator
+* Both the {Ceph} public and cluster networks are propagated,
+  through {OpenStackPreviousInstaller}, to the target nodes
 
 include::../modules/proc_completing-prerequisites-for-migrating-ceph-monitoring-stack.adoc[leveloffset=+1]
 

diff --git a/docs_user/assemblies/assembly_migrating-ceph-rbd.adoc b/docs_user/assemblies/assembly_migrating-ceph-rbd.adoc
@@ -4,13 +4,40 @@
 
 = Migrating Red Hat Ceph Storage RBD to external RHEL nodes
 
-For hyperconverged infrastructure (HCI) or dedicated Storage nodes that are running {Ceph} version 6 or later, you must migrate the daemons that are included in the {rhos_prev_long} control plane into the existing external Red Hat Enterprise Linux (RHEL) nodes. The external RHEL nodes typically include the Compute nodes for an HCI environment or dedicated storage nodes.
+For Hyperconverged Infrastructure (HCI) or dedicated Storage nodes that are
+running {Ceph} version 6 or later, you must migrate the daemons that are
+included in the {rhos_prev_long} control plane into the existing external Red
+Hat Enterprise Linux (RHEL) nodes. The external RHEL nodes typically include
+the Compute nodes for an HCI environment or dedicated storage nodes.
 
-To migrate Red Hat Ceph Storage Rados Block Device (RBD), your environment must meet the following requirements:
+To migrate Red Hat Ceph Storage Rados Block Device (RBD), your environment must
+meet the following requirements:
 
-* {Ceph} is running version 6 or later and is managed by cephadm/orchestrator.
-* NFS (ganesha) is migrated from a {OpenStackPreviousInstaller}-based deployment to cephadm. For more information, see xref:creating-a-ceph-nfs-cluster_migrating-databases[Creating a NFS Ganesha cluster].
-* Both the {Ceph} public and cluster networks are propagated, with {OpenStackPreviousInstaller}, to the target nodes.
-* Ceph Monitors need to keep their IPs to avoid cold migration.
+* {Ceph} is running version 6 or later and is managed by cephadm.
+* NFS Ganesha is migrated from a {OpenStackPreviousInstaller}-based
+  deployment to cephadm. For more information, see
+  xref:creating-a-ceph-nfs-cluster_migrating-databases[Creating a NFS Ganesha
+  cluster].
+* Both the {Ceph} public and cluster networks are propagated, with
+  {OpenStackPreviousInstaller}, to the target nodes.
+* Ceph MDS, Ceph Monitoring stack, Ceph MDS, Ceph RGW and other services are
+  migrated to the target nodes.
+ifeval::["{build}" != "upstream"]
+* The daemons distribution follows the cardinality constraints that are
+  described in link:https://access.redhat.com/articles/1548993[Red Hat Ceph
+  Storage: Supported configurations].
+endif::[]
+* The {Ceph} cluster is healthy, and the `ceph -s` command returns `HEALTH_OK`.
 
-include::../modules/proc_migrating-mon-and-mgr-from-controller-nodes.adoc[leveloffset=+1]
+During the procedure to migrate the Ceph Mon daemons, the following actions
+occur:
+
+* The mon IP addresses are moved to the target {Ceph} nodes.
+* The existing Controller nodes are drained and decommisioned.
+* Additional monitors are deployed to the target nodes, and they are promoted
+  as `_admin` nodes that can be used to manage the {CephCluster} cluster and
+  perform day 2 operations.
+
+include::../modules/proc_migrating-mgr-from-controller-nodes.adoc[leveloffset=+1]
+
+include::../modules/proc_migrating-mon-from-controller-nodes.adoc[leveloffset=+1]
diff --git a/docs_user/assemblies/assembly_migrating-ceph-rgw.adoc b/docs_user/assemblies/assembly_migrating-ceph-rgw.adoc
@@ -11,8 +11,6 @@ To migrate Ceph Object Gateway (RGW), your environment must meet the following r
 * {Ceph} is running version 6 or later and is managed by cephadm/orchestrator.
 * An undercloud is still available, and the nodes and networks are managed by {OpenStackPreviousInstaller}.
 
-include::../modules/con_ceph-daemon-cardinality.adoc[leveloffset=+1]
-
 include::../modules/proc_completing-prerequisites-for-migrating-ceph-rgw.adoc[leveloffset=+1]
 
 include::../modules/proc_migrating-the-rgw-backends.adoc[leveloffset=+1]

diff --git a/docs_user/assemblies/ceph_migration.adoc b/docs_user/assemblies/ceph_migration.adoc
diff --git a/docs_user/main.adoc b/docs_user/main.adoc
@@ -22,12 +22,6 @@ include::assemblies/assembly_adopting-openstack-control-plane-services.adoc[leve
 
 include::assemblies/assembly_adopting-the-data-plane.adoc[leveloffset=+1]
 
-include::assemblies/assembly_migrating-ceph-rbd.adoc[leveloffset=+1]
-
-include::assemblies/assembly_migrating-ceph-rgw.adoc[leveloffset=+1]
-
-include::modules/proc_migrating-ceph-mds.adoc[leveloffset=+1]
-
-include::assemblies/assembly_migrating-ceph-monitoring-stack.adoc[leveloffset=+1]
-
 include::assemblies/assembly_migrating-the-object-storage-service.adoc[leveloffset=+1]
+
+include::assemblies/assembly_migrating-ceph-cluster.adoc[leveloffset=+1]
diff --git a/docs_user/modules/con_ceph-daemon-cardinality.adoc b/docs_user/modules/con_ceph-daemon-cardinality.adoc
@@ -2,19 +2,19 @@
 
 = {Ceph} daemon cardinality
 
-{Ceph} 6 and later applies strict constraints in the way daemons can be colocated within the same node.
+{Ceph} 6 and later applies strict constraints in the way daemons can be
+colocated within the same node.
 ifeval::["{build}" != "upstream"]
 For more information, see link:https://access.redhat.com/articles/1548993[Red Hat Ceph Storage: Supported configurations].
 endif::[]
-The resulting topology depends on the available hardware, as well as the amount of {Ceph} services present in the Controller nodes which are going to be retired.
-ifeval::["{build}" != "upstream"]
-For more information about the procedure that is required to migrate the RGW component and keep an HA model using the Ceph ingress daemon, see link:{defaultCephURL}/object_gateway_guide/index#high-availability-for-the-ceph-object-gateway[High availability for the Ceph Object Gateway] in _Object Gateway Guide_.
-endif::[]
-ifeval::["{build}" != "downstream"]
-The following document describes the procedure required to migrate the RGW component (and keep an HA model using the https://docs.ceph.com/en/latest/cephadm/services/rgw/#high-availability-service-for-rgw[Ceph Ingress daemon] in a common {OpenStackPreviousInstaller} scenario where Controller nodes represent the
-https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/roles/tripleo_cephadm/tasks/rgw.yaml#L26-L30[spec placement] where the service is deployed.
-endif::[]
-As a general rule, the number of services that can be migrated depends on the number of available nodes in the cluster. The following diagrams cover the distribution of the {Ceph} daemons on the {Ceph} nodes where at least three nodes are required in a scenario that sees only RGW and RBD, without the {dashboard_first_ref}:
+The resulting topology depends on the available hardware, as well as the amount
+of {Ceph} services present in the Controller nodes that are going to be
+retired.
+As a general rule, the number of services that can be migrated depends on the
+number of available nodes in the cluster. The following diagrams cover the
+distribution of the {Ceph} daemons on the {Ceph} nodes where at least three
+nodes are required in a scenario that includes only RGW and RBD, without the
+{Ceph} Dashboard:
 
 ----
 |    |                     |             |
@@ -24,7 +24,8 @@ As a general rule, the number of services that can be migrated depends on the nu
 | osd | mon/mgr/crash      | rgw/ingress |
 ----
 
-With the {dashboard}, and without {rhos_component_storage_file_first_ref} at least four nodes are required. The {dashboard} has no failover:
+With the {dashboard}, and without {rhos_component_storage_file_first_ref}, at
+least 4 nodes are required. The {Ceph} dashboard has no failover:
 
 ----
 |     |                     |             |
@@ -35,7 +36,8 @@ With the {dashboard}, and without {rhos_component_storage_file_first_ref} at lea
 | osd | rgw/ingress   | (free)            |
 ----
 
-With the {dashboard} and the {rhos_component_storage_file}, 5 nodes minimum are required, and the {dashboard} has no failover:
+With the {Ceph} dashboard and the {rhos_component_storage_file}, 5 nodes
+minimum are required, and the {Ceph} dashboard has no failover:
 
 ----
 |     |                     |                         |

diff --git a/docs_user/modules/proc_migrating-mgr-from-controller-nodes.adoc b/docs_user/modules/proc_migrating-mgr-from-controller-nodes.adoc
@@ -0,0 +1,95 @@
+= Migrating Ceph Manager daemons to {Ceph} nodes
+
+The following section describes how to move Ceph Manager daemons from the
+{rhos_prev_long} Controller nodes to a set of target nodes. Target nodes might
+be pre-existing {Ceph} nodes, or {OpenStackShort} Compute nodes if {Ceph} is
+deployed by {OpenStackPreviousInstaller} with an HCI topology.
+This procedure assumes that Cephadm and the {Ceph} Orchestrator are the tools
+that drive the Ceph Manager migration. As is done with the other Ceph daemons
+(MDS, Monitoring, and RGW), the procedure uses the Ceph spec to modify the
+placement and reschedule the daemons. Ceph Manager is run in an active/passive
+fashion, and it also provides many modules, including the Ceph orchestrator.
+
+
+.Prerequisites
+
+* Configure the target nodes (CephStorage or ComputeHCI) to have both `storage`
+and `storage_mgmt` networks to ensure that you can use both {Ceph} public and
+cluster networks from the same node. This step requires you to interact with
+{OpenStackPreviousInstaller}. From {rhos_prev_long} {rhos_prev_ver} and later
+you do not have to run a stack update.
+
+.Procedure
+
+. Ssh into the target node and enable the firewall rules that are required to
+  reach a Manager service:
++
+----
+dports="6800:7300"
+ssh heat-admin@<target_node> sudo iptables -I INPUT \
+    -p tcp --match multiport --dports $dports -j ACCEPT;
+----
++
+Repeat this step for each `<target_node>`.
+
+. Check that the rules are properly applied and persist them:
++
+----
+$ sudo iptables-save
+$ sudo systemctl restart iptables
+----
++
+. Prepare the target node to host the new Ceph Manager daemon, and add the `mgr`
+label to the target node:
++
+----
+ceph orch host label add <target_node> mgr; done
+----
++
+* Replace `<target_node>` with the hostname of the hosts listed in the {Ceph}
+  through the `ceph orch host ls` command
++
+Repeat the actions described above for each `<target_node> that will host a
+Ceph Manager daemon.
+
+. Get the Ceph Manager spec:
++
+[source,yaml]
+----
+sudo cephadm shell -- ceph orch ls --export mgr > mgr.yaml
+----
+
+. Edit the retrieved spec and add the `label: mgr` section to the `placement`
+  section:
++
+[source,yaml]
+----
+service_type: mgr
+service_id: mgr
+placement:
+  label: mgr
+----
+
+. Save the spec in the `/tmp/mgr.yaml` file.
+. Apply the spec with cephadm by using the orchestrator:
++
+----
+sudo cephadm shell -m /tmp/mgr.yaml -- ceph orch apply -i /mnt/mgr.yaml
+----
++
+As a result of this procedure, you see a Ceph Manager daemon count that matches
+the number of hosts where the `mgr` label is added.
+
+. Verify that the new Ceph Manager are created in the target nodes:
++
+----
+ceph orch ps | grep -i mgr
+ceph -s
+----
++
+[NOTE]
+The procedure does not shrink the Ceph Manager daemons. The count is grown by
+the number of target nodes, and migrating Ceph Monitor daemons to {Ceph} nodes
+decommissions the stand-by Ceph Manager instances. For more information, see
+xref:migrating-mon-from-controller-nodes_migrating-ceph-rbd[Migrating Ceph Monitor
+daemons to {Ceph} nodes].