krkn-chaos · chaitanyaenr · Jan 10, 2023 · Dec 5, 2022 · Dec 6, 2022 · Dec 6, 2022
diff --git a/README.md b/README.md
@@ -69,6 +69,7 @@ Scenario type               | Kubernetes    | OpenShift
 [Application_outages](docs/application_outages.md) | :heavy_check_mark: | :heavy_check_mark: |
 [PVC scenario](docs/pvc_scenario.md) | :heavy_check_mark: | :heavy_check_mark: |
 [Network_Chaos](docs/network_chaos.md) | :heavy_check_mark: | :heavy_check_mark: |
+[ManagedCluster Scenarios](docs/managedcluster_scenarios.md) | :heavy_check_mark: | :question: |
 
 
 ### Kraken scenario pass/fail criteria and report
@@ -96,6 +97,9 @@ Kraken supports capturing metrics for the duration of the scenarios defined in t
 ### Alerts
 In addition to checking the recovery and health of the cluster and components under test, Kraken takes in a profile with the Prometheus expressions to validate and alerts, exits with a non-zero return code depending on the severity set. This feature can be used to determine pass/fail or alert on abnormalities observed in the cluster based on the metrics. Information on enabling and leveraging this feature can be found [here](docs/alerts.md).
 
+### OCM / ACM integration
+
+Kraken supports injecting faults into [Open Cluster Management (OCM)](https://open-cluster-management.io/) and [Red Hat Advanced Cluster Management for Kubernetes (ACM)](https://www.redhat.com/en/technologies/management/advanced-cluster-management) managed clusters through [ManagedCluster Scenarios](docs/managedcluster_scenarios.md).
 
 ### Blogs and other useful resources
 - Blog post on introduction to Kraken: https://www.openshift.com/blog/introduction-to-kraken-a-chaos-tool-for-openshift/kubernetes

diff --git a/docs/managedcluster_scenarios.md b/docs/managedcluster_scenarios.md
@@ -0,0 +1,36 @@
+### ManagedCluster Scenarios
+
+[ManagedCluster](https://open-cluster-management.io/concepts/managedcluster/) scenarios provide a way to integrate kraken with [Open Cluster Management (OCM)](https://open-cluster-management.io/) and [Red Hat Advanced Cluster Management for Kubernetes (ACM)](https://www.redhat.com/en/technologies/management/advanced-cluster-management).
+
+ManagedCluster scenarios leverage [ManifestWorks](https://open-cluster-management.io/concepts/manifestwork/) to inject faults into the ManagedClusters.
+
+The following ManagedCluster chaos scenarios are supported:
+
+1. **managedcluster_start_scenario**: Scenario to start the ManagedCluster instance.
+2. **managedcluster_stop_scenario**: Scenario to stop the ManagedCluster instance.
+3. **managedcluster_stop_start_scenario**: Scenario to stop and then start the ManagedCluster instance.
+4. **start_klusterlet_scenario**: Scenario to start the klusterlet of the ManagedCluster instance.
+5. **stop_klusterlet_scenario**: Scenario to stop the klusterlet of the ManagedCluster instance.
+6. **stop_start_klusterlet_scenario**: Scenario to stop and start the klusterlet of the ManagedCluster instance.
+
+ManagedCluster scenarios can be injected by placing the ManagedCluster scenarios config files under `managedcluster_scenarios` option in the Kraken config. Refer to [managedcluster_scenarios_example](https://github.com/redhat-chaos/krkn/blob/main/scenarios/kube/managedcluster_scenarios_example.yml) config file.
+
+```
+managedcluster_scenarios:
+  - actions:                                                        # ManagedCluster chaos scenarios to be injected
+    - managedcluster_stop_start_scenario
+    managedcluster_name: cluster1                                   # ManagedCluster on which scenario has to be injected; can set multiple names separated by comma
+    # label_selector:                                               # When managedcluster_name is not specified, a ManagedCluster with matching label_selector is selected for ManagedCluster chaos scenario injection
+    instance_count: 1                                               # Number of managedcluster to perform action/select that match the label selector
+    runs: 1                                                         # Number of times to inject each scenario under actions (will perform on same ManagedCluster each time)
+    timeout: 420                                                    # Duration to wait for completion of ManagedCluster scenario injection
+                                                                    # For OCM to detect a ManagedCluster as unavailable, have to wait 5*leaseDurationSeconds
+                                                                    # (default leaseDurationSeconds = 60 sec)
+  - actions:
+    - stop_start_klusterlet_scenario
+    managedcluster_name: cluster1
+    # label_selector:
+    instance_count: 1
+    runs: 1
+    timeout: 60
+```
diff --git a/kraken/kubernetes/client.py b/kraken/kubernetes/client.py
@@ -175,6 +175,26 @@ def list_killable_nodes(label_selector=None):
     return nodes
 
 
+# List managedclusters attached to the hub that can be killed
+def list_killable_managedclusters(label_selector=None):
+    managedclusters = []
+    try:
+        ret = custom_object_client.list_cluster_custom_object(
+            group="cluster.open-cluster-management.io",
+            version="v1",
+            plural="managedclusters",
+            label_selector=label_selector
+        )
+    except ApiException as e:
+        logging.error("Exception when calling CustomObjectsApi->list_cluster_custom_object: %s\n" % e)
+        raise e
+    for managedcluster in ret['items']:
+        conditions = managedcluster['status']['conditions']
+        available = list(filter(lambda condition: condition['reason'] == 'ManagedClusterAvailable', conditions))
+        if available and available[0]['status'] == 'True':
+            managedclusters.append(managedcluster['metadata']['name'])
+    return managedclusters
+
 # List pods in the given namespace
 def list_pods(namespace, label_selector=None):
     pods = []
@@ -362,6 +382,33 @@ def create_job(body, namespace="default"):
         raise
 
 
+def create_manifestwork(body, namespace):
+    try:
+        api_response = custom_object_client.create_namespaced_custom_object(
+            group="work.open-cluster-management.io", 
+            version="v1",
+            plural="manifestworks",
+            body=body,
+            namespace=namespace
+        )
+        return api_response
+    except ApiException as e:
+        print("Exception when calling CustomObjectsApi->create_namespaced_custom_object: %s\n" % e)
+
+
+def delete_manifestwork(namespace):
+    try:
+        api_response = custom_object_client.delete_namespaced_custom_object(
+            group="work.open-cluster-management.io", 
+            version="v1",
+            plural="manifestworks",
+            name="managedcluster-scenarios-template",
+            namespace=namespace
+        )
+        return api_response
+    except ApiException as e:
+        print("Exception when calling CustomObjectsApi->delete_namespaced_custom_object: %s\n" % e)
+
 def get_job_status(name, namespace="default"):
     try:
         return batch_cli.read_namespaced_job_status(
@@ -814,6 +861,30 @@ def watch_node_status(node, status, timeout, resource_version):
             watch_resource.stop()
 
 
+# Watch for a specific managedcluster status
+# TODO: Implement this with a watcher instead of polling
+def watch_managedcluster_status(managedcluster, status, timeout):
+    elapsed_time = 0
+    while True:
+        conditions = custom_object_client.get_cluster_custom_object_status(
+            "cluster.open-cluster-management.io", "v1", "managedclusters", managedcluster
+        )['status']['conditions']
+        available = list(filter(lambda condition: condition['reason'] == 'ManagedClusterAvailable', conditions))
+        if status == "True":
+            if available and available[0]['status'] == "True":
+                logging.info("Status of managedcluster " + managedcluster + ": Available")
+                return True
+        else:
+            if not available:
+                logging.info("Status of managedcluster " + managedcluster + ": Unavailable")
+                return True
+        time.sleep(2)
+        elapsed_time += 2
+        if elapsed_time >= timeout:
+            logging.info("Timeout waiting for managedcluster " + managedcluster + " to become: " + status)
+            return False
+
+
 # Get the resource version for the specified node
 def get_node_resource_version(node):
     return cli.read_node(name=node).metadata.resource_version
diff --git a/kraken/managedcluster_scenarios/__init__.py b/kraken/managedcluster_scenarios/__init__.py
diff --git a/kraken/managedcluster_scenarios/common_managedcluster_functions.py b/kraken/managedcluster_scenarios/common_managedcluster_functions.py
@@ -0,0 +1,34 @@
+import random
+import logging
+import kraken.kubernetes.client as kubecli
+
+
+# Pick a random managedcluster with specified label selector
+def get_managedcluster(managedcluster_name, label_selector, instance_kill_count):
+    if managedcluster_name in kubecli.list_killable_managedclusters():
+        return [managedcluster_name]
+    elif managedcluster_name:
+        logging.info("managedcluster with provided managedcluster_name does not exist or the managedcluster might " "be in unavailable state.")
+    managedclusters = kubecli.list_killable_managedclusters(label_selector)
+    if not managedclusters:
+        raise Exception("Available managedclusters with the provided label selector do not exist")
+    logging.info("Available managedclusters with the label selector %s: %s" % (label_selector, managedclusters))
+    number_of_managedclusters = len(managedclusters)
+    if instance_kill_count == number_of_managedclusters:
+        return managedclusters
+    managedclusters_to_return = []
+    for i in range(instance_kill_count):
+        managedcluster_to_add = managedclusters[random.randint(0, len(managedclusters) - 1)]
+        managedclusters_to_return.append(managedcluster_to_add)
+        managedclusters.remove(managedcluster_to_add)
+    return managedclusters_to_return
+
+
+# Wait until the managedcluster status becomes Available
+def wait_for_available_status(managedcluster, timeout):
+    kubecli.watch_managedcluster_status(managedcluster, "True", timeout)
+
+
+# Wait until the managedcluster status becomes Not Available
+def wait_for_unavailable_status(managedcluster, timeout):
+    kubecli.watch_managedcluster_status(managedcluster, "Unknown", timeout)
diff --git a/kraken/managedcluster_scenarios/managedcluster_scenarios.py b/kraken/managedcluster_scenarios/managedcluster_scenarios.py
@@ -0,0 +1,140 @@
+from jinja2 import Environment, FileSystemLoader
+import os
+import time
+import logging
+import sys
+import yaml
+import html
+import kraken.kubernetes.client as kubecli
+import kraken.managedcluster_scenarios.common_managedcluster_functions as common_managedcluster_functions
+
+
+class GENERAL:
+    def __init__(self):
+        pass
+
+
+class managedcluster_scenarios():
+    def __init__(self):
+        self.general = GENERAL()
+
+    # managedcluster scenario to start the managedcluster
+    def managedcluster_start_scenario(self, instance_kill_count, managedcluster, timeout):
+        for _ in range(instance_kill_count):
+            try:
+                logging.info("Starting managedcluster_start_scenario injection")
+                file_loader = FileSystemLoader(os.path.abspath(os.path.dirname(__file__)))
+                env = Environment(loader=file_loader, autoescape=False)
+                template = env.get_template("manifestwork.j2")
+                body = yaml.safe_load(
+                    template.render(managedcluster_name=managedcluster,
+                        args="""kubectl scale deployment.apps/klusterlet --replicas 3 &
+                                kubectl scale deployment.apps/klusterlet-registration-agent --replicas 1 -n open-cluster-management-agent""")
+                )
+                kubecli.create_manifestwork(body, managedcluster)
+                logging.info("managedcluster_start_scenario has been successfully injected!")
+                logging.info("Waiting for the specified timeout: %s" % timeout)
+                common_managedcluster_functions.wait_for_available_status(managedcluster, timeout)
+            except Exception as e:
+                logging.error("managedcluster scenario exiting due to Exception %s" % e)
+                sys.exit(1)
+            finally:
+                logging.info("Deleting manifestworks")
+                kubecli.delete_manifestwork(managedcluster)
+
+    # managedcluster scenario to stop the managedcluster
+    def managedcluster_stop_scenario(self, instance_kill_count, managedcluster, timeout):
+        for _ in range(instance_kill_count):
+            try:
+                logging.info("Starting managedcluster_stop_scenario injection")
+                file_loader = FileSystemLoader(os.path.abspath(os.path.dirname(__file__)),encoding='utf-8')
+                env = Environment(loader=file_loader, autoescape=False)
+                template = env.get_template("manifestwork.j2")
+                body = yaml.safe_load(
+                    template.render(managedcluster_name=managedcluster,
+                        args="""kubectl scale deployment.apps/klusterlet --replicas 0 &&
+                                kubectl scale deployment.apps/klusterlet-registration-agent --replicas 0 -n open-cluster-management-agent""")
+                )
+                kubecli.create_manifestwork(body, managedcluster)
+                logging.info("managedcluster_stop_scenario has been successfully injected!")
+                logging.info("Waiting for the specified timeout: %s" % timeout)
+                common_managedcluster_functions.wait_for_unavailable_status(managedcluster, timeout)
+            except Exception as e:
+                logging.error("managedcluster scenario exiting due to Exception %s" % e)
+                sys.exit(1)
+            finally:
+                logging.info("Deleting manifestworks")
+                kubecli.delete_manifestwork(managedcluster)
+
+    # managedcluster scenario to stop and then start the managedcluster
+    def managedcluster_stop_start_scenario(self, instance_kill_count, managedcluster, timeout):
+        logging.info("Starting managedcluster_stop_start_scenario injection")
+        self.managedcluster_stop_scenario(instance_kill_count, managedcluster, timeout)
+        time.sleep(10)
+        self.managedcluster_start_scenario(instance_kill_count, managedcluster, timeout)
+        logging.info("managedcluster_stop_start_scenario has been successfully injected!")
+
+    # managedcluster scenario to terminate the managedcluster
+    def managedcluster_termination_scenario(self, instance_kill_count, managedcluster, timeout):
+        logging.info("managedcluster termination is not implemented, " "no action is going to be taken")
+
+    # managedcluster scenario to reboot the managedcluster
+    def managedcluster_reboot_scenario(self, instance_kill_count, managedcluster, timeout):
+        logging.info("managedcluster reboot is not implemented," " no action is going to be taken")
+
+    # managedcluster scenario to start the klusterlet
+    def start_klusterlet_scenario(self, instance_kill_count, managedcluster, timeout):
+        for _ in range(instance_kill_count):
+            try:
+                logging.info("Starting start_klusterlet_scenario injection")
+                file_loader = FileSystemLoader(os.path.abspath(os.path.dirname(__file__)))
+                env = Environment(loader=file_loader, autoescape=False)
+                template = env.get_template("manifestwork.j2")
+                body = yaml.safe_load(
+                    template.render(managedcluster_name=managedcluster,
+                        args="""kubectl scale deployment.apps/klusterlet --replicas 3""")
+                )
+                kubecli.create_manifestwork(body, managedcluster)
+                logging.info("start_klusterlet_scenario has been successfully injected!")
+                time.sleep(30)                              # until https://github.com/open-cluster-management-io/OCM/issues/118 gets solved
+            except Exception as e:
+                logging.error("managedcluster scenario exiting due to Exception %s" % e)
+                sys.exit(1)
+            finally:
+                logging.info("Deleting manifestworks")
+                kubecli.delete_manifestwork(managedcluster)
+
+    # managedcluster scenario to stop the klusterlet
+    def stop_klusterlet_scenario(self, instance_kill_count, managedcluster, timeout):
+        for _ in range(instance_kill_count):
+            try:
+                logging.info("Starting stop_klusterlet_scenario injection")
+                file_loader = FileSystemLoader(os.path.abspath(os.path.dirname(__file__)))
+                env = Environment(loader=file_loader, autoescape=False)
+                template = env.get_template("manifestwork.j2")
+                body = yaml.safe_load(
+                    template.render(managedcluster_name=managedcluster,
+                        args="""kubectl scale deployment.apps/klusterlet --replicas 0""")
+                )
+                kubecli.create_manifestwork(body, managedcluster)
+                logging.info("stop_klusterlet_scenario has been successfully injected!")
+                time.sleep(30)                              # until https://github.com/open-cluster-management-io/OCM/issues/118 gets solved
+            except Exception as e:
+                logging.error("managedcluster scenario exiting due to Exception %s" % e)
+                sys.exit(1)
+            finally:
+                logging.info("Deleting manifestworks")
+                kubecli.delete_manifestwork(managedcluster)
+
+    # managedcluster scenario to stop and start the klusterlet
+    def stop_start_klusterlet_scenario(self, instance_kill_count, managedcluster, timeout):
+        logging.info("Starting stop_start_klusterlet_scenario injection")
+        self.stop_klusterlet_scenario(instance_kill_count, managedcluster, timeout)
+        time.sleep(10)
+        self.start_klusterlet_scenario(instance_kill_count, managedcluster, timeout)
+        logging.info("stop_start_klusterlet_scenario has been successfully injected!")
+
+    # managedcluster scenario to crash the managedcluster
+    def managedcluster_crash_scenario(self, instance_kill_count, managedcluster, timeout):
+        logging.info("managedcluster crash scenario is not implemented, " "no action is going to be taken")
+
diff --git a/kraken/managedcluster_scenarios/manifestwork.j2 b/kraken/managedcluster_scenarios/manifestwork.j2
@@ -0,0 +1,68 @@
+apiVersion: work.open-cluster-management.io/v1
+kind: ManifestWork
+metadata:
+  namespace: {{managedcluster_name}}
+  name: managedcluster-scenarios-template
+spec:
+  workload:
+    manifests:
+      - apiVersion: rbac.authorization.k8s.io/v1
+        kind: ClusterRole
+        metadata:
+          name: scale-deploy
+          namespace: open-cluster-management 
+        rules:
+        - apiGroups: ["apps"]
+          resources: ["deployments/scale"]
+          verbs: ["patch"]
+        - apiGroups: ["apps"]
+          resources: ["deployments"]
+          verbs: ["get"]
+      - apiVersion: rbac.authorization.k8s.io/v1
+        kind: RoleBinding
+        metadata:
+          name: scale-deploy-to-sa
+          namespace: open-cluster-management
+        subjects:
+          - kind: ServiceAccount
+            name: internal-kubectl
+            namespace: open-cluster-management
+        roleRef:
+          kind: ClusterRole
+          name: scale-deploy
+          apiGroup: rbac.authorization.k8s.io
+      - apiVersion: rbac.authorization.k8s.io/v1
+        kind: RoleBinding
+        metadata:
+          name: scale-deploy-to-sa
+          namespace: open-cluster-management-agent
+        subjects:
+          - kind: ServiceAccount
+            name: internal-kubectl
+            namespace: open-cluster-management
+        roleRef:
+          kind: ClusterRole
+          name: scale-deploy
+          apiGroup: rbac.authorization.k8s.io
+      - apiVersion: v1
+        kind: ServiceAccount
+        metadata:
+          name: internal-kubectl
+          namespace: open-cluster-management 
+      - apiVersion: batch/v1
+        kind: Job
+        metadata:
+          name: managedcluster-scenarios-template
+          namespace: open-cluster-management
+        spec:
+          template:
+            spec:
+              serviceAccountName: internal-kubectl
+              containers:
+              - name: kubectl
+                image: quay.io/sighup/kubectl-kustomize:1.21.6_3.9.1
+                command: ["/bin/sh", "-c"]
+                args:
+                  - {{args}}
+              restartPolicy: Never
+          backoffLimit: 0