From 34718ccfcedc30ef8370416abe78eda78f96dcbc Mon Sep 17 00:00:00 2001
From: Seshachalam Yerasala Venkata <seshachalam.yerasala.venkata@sap.com>
Date: Wed, 25 Sep 2024 20:55:53 +0530
Subject: [PATCH 1/7] Add DEP-06: Immutable ETCD Backups

---------

Co-authored-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
---
 docs/proposals/06-immutable-etcd-backups.md | 292 ++++++++++++++++++++
 1 file changed, 292 insertions(+)
 create mode 100644 docs/proposals/06-immutable-etcd-backups.md

diff --git a/docs/proposals/06-immutable-etcd-backups.md b/docs/proposals/06-immutable-etcd-backups.md
new file mode 100644
index 000000000..643342951
--- /dev/null
+++ b/docs/proposals/06-immutable-etcd-backups.md
@@ -0,0 +1,292 @@
+---
+title: Immutable ETCD Backups
+dep-number: 06
+creation-date: 2024-09-25
+status: implementable
+authors:
+- "@seshachalam-yv"
+- "@renormalize"
+- "@ishan16696"
+reviewers:
+- "@unmarshall"
+---
+
+# DEP-06: Immutable ETCD Backups
+
+## Table of Contents
+
+- [DEP-06: Immutable ETCD Backups](#dep-06-immutable-etcd-backups)
+  - [Table of Contents](#table-of-contents)
+  - [Summary](#summary)
+  - [Motivation](#motivation)
+    - [Goals](#goals)
+    - [Non-Goals](#non-goals)
+  - [Proposal](#proposal)
+    - [Overview](#overview)
+    - [Detailed Design](#detailed-design)
+      - [Bucket Immutability Mechanism](#bucket-immutability-mechanism)
+      - [ETCD Backup Configuration](#etcd-backup-configuration)
+      - [Handling of Hibernated Clusters](#handling-of-hibernated-clusters)
+    - [New Compaction Controller Flag](#new-compaction-controller-flag)
+    - [Garbage Collection during Compaction for Hibernated ETCD Clusters](#garbage-collection-during-compaction-for-hibernated-etcd-clusters)
+    - [Excluding Snapshots Under Specific Circumstances](#excluding-snapshots-under-specific-circumstances)
+  - [Compatibility](#compatibility)
+  - [Implementation Steps](#implementation-steps)
+  - [Risks and Mitigations](#risks-and-mitigations)
+  - [Operational Considerations](#operational-considerations)
+  - [Alternatives](#alternatives)
+    - [Object-Level Immutability Policies vs. Bucket-Level Immutability Policies](#object-level-immutability-policies-vs-bucket-level-immutability-policies)
+      - [Feasibility Study: Immutable Backups on Cloud Providers](#feasibility-study-immutable-backups-on-cloud-providers)
+      - [Considerations for Object-Level Immutability](#considerations-for-object-level-immutability)
+      - [Conclusion](#conclusion)
+        - [Comparison of Storage Provider Properties for Bucket-Level and Object-Level Immutability](#comparison-of-storage-provider-properties-for-bucket-level-and-object-level-immutability)
+  - [References](#references)
+  - [Glossary](#glossary)
+
+## Summary
+
+This proposal aims to enhance the reliability and integrity of ETCD backups in ETCD Druid by introducing immutable backups. By leveraging cloud provider features that support a write-once-read-many (WORM) model, this approach prevents unauthorized modifications to backup data, ensuring that backups remain available and intact for restoration.
+
+## Motivation
+
+Ensuring the integrity and availability of ETCD backups is crucial for the ability to restore an ETCD cluster after it has irrecoverably gone down. Making backups immutable protects against unintended or malicious modifications post-creation, thereby enhancing the overall security posture.
+
+### Goals
+
+- Implement immutable backup support for ETCD clusters.
+- Secure backup data against unintended or unauthorized modifications after creation.
+- Ensure backups are consistently available and intact for restoration purposes.
+
+### Non-Goals
+
+- Altering existing backup processes beyond what's necessary for immutability.
+- Implementing object-level immutability policies at this stage.
+- Supporting immutable backups on storage providers that do not offer immutability features (e.g., OpenStack Swift).
+
+## Proposal
+
+### Overview
+
+Introduce immutability in backup storage by leveraging cloud provider features that support a write-once-read-many (WORM) model. This will prevent data alterations after backup creation, enhancing data integrity and security. The implementation will focus on bucket-level immutability policies, as they are widely supported and easier to manage across different cloud providers.
+
+### Detailed Design
+
+#### Bucket Immutability Mechanism
+
+The Bucket Immutability feature configures an immutability policy for a cloud storage bucket, governing how long objects in the bucket must be retained in an immutable state. It also allows for locking the bucket's immutability policy, permanently preventing the policy from being reduced or removed.
+
+- **Supported by Major Providers:**
+  - **Google Cloud Storage (GCS):** [Bucket Lock](https://cloud.google.com/storage/docs/bucket-lock)
+  - **Amazon S3 (S3):** [Object Lock](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html)
+  - **Azure Blob Storage (ABS):** [Immutable Blob Storage](https://learn.microsoft.com/en-us/azure/storage/blobs/immutable-policy-configure-container-scope?tabs=azure-portal)
+
+- **Not Supported:**
+  - **OpenStack Swift**
+
+**Implementation Details:**
+
+- Operators are responsible for creating or updating buckets with these immutability settings before ETCD Druid starts uploading snapshots.
+- Once the bucket is configured to be immutable, snapshots uploaded by `etcd-backup-restore` are also immutable and cannot be altered or deleted until the immutability period expires.
+- No additional configuration needs to be passed to etcd-druid.
+
+#### ETCD Backup Configuration
+
+Operators need to ensure that the ETCD backup configuration aligns with the immutability requirements. This includes setting appropriate immutability periods.
+
+
+#### Handling of Hibernated Clusters
+
+In scenarios where an ETCD cluster has been hibernated (scaled down to zero replicas) for a duration that exceeds the immutability period, backups may become mutable again (behaviour depends on cloud provider [refer](#comparison-of-storage-provider-properties-for-bucket-level-and-object-level-immutability)), compromising the intended immutability guarantees.
+
+There are two ways to maintain the immutability of snapshots:
+
+1. **Extend the Immutability of Snapshots**
+
+   Extending the immutability period at the bucket level to cover the extended hibernation period might seem like a solution. However, this approach is **not feasible** due to several significant drawbacks:
+
+   - **Global Impact on All Objects:** Since bucket-level immutability settings apply to all objects within the bucket, extending the immutability period would affect not only the existing snapshots but also all future snapshots uploaded to the bucket.
+   - **Increased Storage Costs:** Prolonging the immutability period means that snapshots cannot be deleted or modified until the extended period expires. This leads to the accumulation of old backups, consuming more storage space and significantly driving up storage costs over time.
+   - **Inefficient Resource Utilization:** Retaining all snapshots for a longer period than necessary is an inefficient use of storage resources, especially when only the latest snapshots are typically needed for restoration purposes.
+
+   Given these considerable disadvantages, extending the bucket-level immutability period is impractical for maintaining the immutability of snapshots during extended hibernation periods.
+
+2. **Take a New Snapshot**
+
+   By creating a new snapshot, immutability is retained for the new snapshot without affecting the immutability settings of the entire bucket or other snapshots.
+
+   Approach 2 is feasible and can be achieved in two ways:
+
+   - **Option A:** Bring the ETCD cluster back up by increasing the replicas, take a snapshot, and then hibernate the ETCD by setting the replicas back to zero **without the operator's intention**.
+     - **Challenges:**
+       - **Unintended State Changes:** Increasing the ETCD replicas to bring the cluster back up modifies the cluster's state without the operator's explicit intention, which could lead to unexpected behavior or conflicts with operational policies.
+       - **Operational Overhead:** Automating the process of scaling the ETCD cluster up and down adds complexity and potential risks, especially if the scaling operations interfere with other scheduled tasks or maintenance windows.
+
+   - **Option B:** Start an embedded ETCD instance and take a snapshot.
+     - **Advantages:**
+       - **Respects Operator Intentions:** Does not alter the state of the actual ETCD cluster, keeping it in hibernation as intended by the operator.
+       - **No Impact on Existing Setup:** Avoids modifying the ETCD cluster's replicas, preventing any unintended consequences.
+       - **Automated Process:** Can be integrated into existing maintenance jobs without manual intervention.
+
+**Proposed Solution:**
+
+*Option B is recommended* because it respects the operator's intention to keep the ETCD cluster hibernated and avoids the complexities and risks associated with modifying the cluster's state.
+
+Leverage the compaction job to take fresh snapshots periodically during hibernation. This ensures that:
+
+- **Immutability is Maintained:** New snapshots have a fresh immutability period, keeping backups protected.
+- **Operator Intentions are Respected:** The ETCD cluster remains in its hibernated state, as per the operator's intention.
+- **Storage Costs are Controlled:** Avoids unnecessary extension of immutability for all snapshots, preventing increased storage costs.
+
+### New Compaction Controller Flag
+
+**Flag:** `--hibernation-snapshot-interval`
+
+- **Type:** Duration
+- **Default:** `24h`
+- **Description:** This flag sets the period after which a compaction job should be triggered for a hibernated ETCD cluster, based on the time since the last renewal of the full snapshot lease. If the time since `fullLease.Spec.RenewTime.Time` exceeds the duration specified by this flag, and `etcd.spec.replicas` is `0` (indicating hibernation), the compaction job will automatically trigger to create a new snapshot. This approach ensures that backups remain within the immutability period and are safeguarded against becoming mutable.
+
+### Garbage Collection during Compaction for Hibernated ETCD Clusters
+
+If an ETCD cluster is hibernated for a long duration, there is a chance that backup storage will accumulate a large number of snapshots. Since the ETCD cluster is not running, the `etcd-backup-restore` component is also not running to perform garbage collection of the snapshots.
+
+To address this, the compaction job should be enhanced to handle garbage collection during hibernation. This ensures that old snapshots are cleaned up appropriately, considering the immutability constraints.
+
+### Excluding Snapshots Under Specific Circumstances
+
+Given that immutable backups cannot be deleted until the immutability period expires, there are scenarios, such as corrupted snapshots or other anomalies, where certain snapshots must be skipped during the restoration process. To facilitate this:
+
+- **Custom Metadata Tags:** Utilize custom metadata to mark specific objects (snapshots) that should be bypassed. To exclude a snapshot from the restoration process, attach custom metadata to it with the key `x-etcd-snapshot-exclude` and value `true`. This method is officially supported, as demonstrated in the [etcd-backup-restore PR](https://github.com/gardener/etcd-backup-restore/pull/776).
+
+## Compatibility
+
+The proposed changes are fully compatible with existing ETCD clusters and backup processes. Operators are responsible for creating or updating the immutability settings on the backup storage buckets, but no changes are required to the ETCD clusters themselves.
+
+- **Backward Compatibility:** Existing clusters without immutable buckets will continue to function without change.
+- **Forward Compatibility:** Clusters can opt-in to use immutable backups by configuring the bucket accordingly.
+
+## Implementation Steps
+
+1. **Enhance the Compaction Job:**
+     - Modify the compaction job in `etcd-backup-restore` to manage backup processes for hibernated clusters, including snapshot creation and garbage collection, while considering the new immutability constraints.
+     - Implement the `--hibernation-snapshot-interval` flag in the compaction controller. Ensure that the compaction job can start an embedded ETCD instance to take snapshots during hibernation.
+
+2. **Update Documentation:**
+     - Revise the documentation to reflect the changes and guide operators in effectively using the new immutability features.
+     - Provide detailed guidelines on configuring buckets with immutability settings.
+     - Document procedures for excluding snapshots when necessary.
+
+## Risks and Mitigations
+
+- **Increased Storage Costs:**
+
+  - **Risk:** Introducing immutability could lead to increased storage costs due to the inability to delete backups before the immutability period ends.
+  - **Mitigation:** Operators should carefully configure immutability periods and monitor storage utilization. Garbage collection will help mitigate long-term storage growth.
+
+- **Backup Gaps During Hibernation:**
+
+  - **Risk:** Hibernated clusters might not receive necessary backups, potentially leading to compliance issues.
+  - **Mitigation:** The compaction job's enhancement ensures that backups are taken during hibernation at configured intervals.
+
+- **Excluding Critical Snapshots:**
+
+  - **Risk:** Excluding snapshots during restoration might be misused or lead to incomplete data restoration.
+  - **Mitigation:** Restrict the ability to tag snapshots for exclusion to authorized personnel. Implement audit logging for actions that tag snapshots.
+
+## Operational Considerations
+
+Operators need to:
+
+- **Bucket Configuration:**
+
+  - Configure buckets with appropriate immutability settings before deploying ETCD clusters.
+  - Ensure that the immutability periods align with organizational policies.
+
+- **Compaction Job Configuration:**
+
+  - Set the `--hibernation-snapshot-interval` flag according to the desired snapshot frequency during hibernation.
+  - Monitor compaction jobs and logs for any issues.
+
+## Alternatives
+
+### Object-Level Immutability Policies vs. Bucket-Level Immutability Policies
+
+An alternative to implementing immutability via bucket-level immutability policies is to use object-level immutability policies. Object-level immutability allows for more granular control over the immutability periods of individual objects within a bucket, whereas bucket-level immutability applies a uniform immutability period to all objects in the bucket.
+
+#### Feasibility Study: Immutable Backups on Cloud Providers
+
+Major cloud storage providers such as Google Cloud Storage (GCS), Amazon S3, and Azure Blob Storage (ABS) support both bucket-level and object-level immutability mechanisms to enforce data immutability.
+
+1. **Bucket-Level Immutability Policies:**
+
+     - **Applies Uniformly:** Applies a uniform immutability period to all objects within a bucket.
+     - **Immutable Objects:** Once set, objects cannot be modified or deleted until the immutability period expires.
+     - **Simplified Management:** Simplifies management by applying the same policy to all objects.
+
+2. **Object-Level Immutability Policies:**
+
+     - **Granular Control:** Allows setting immutability periods on a per-object basis.
+     - **Flexible Immutability Durations:** Offers granular control, enabling different immutability durations for individual objects.
+     - **Varying Requirements:** Can accommodate varying immutability requirements for different types of backups.
+
+#### Considerations for Object-Level Immutability
+
+Using object-level immutability provides flexibility in scenarios where certain backups require different immutability periods. For example, in the case of hibernated clusters where the ETCD cluster may not be running and backups are not being updated, object-level immutability allows extending the immutability of the latest snapshots without affecting the immutability of older backups.
+
+**Advantages:**
+
+- **Granular Control:** Allows setting different immutability periods for different objects, accommodating varying requirements.
+- **Efficient Resource Utilization:** Prevents unnecessary extension of immutability for all objects, potentially reducing storage costs.
+- **Enhanced Flexibility:** Can adjust immutability periods for specific backups as needed.
+
+**Disadvantages:**
+
+- **Provider Limitations:** Not all providers support enabling object-level immutability on existing buckets without additional steps. For instance, in GCS, enabling object-level immutability on existing buckets is currently not supported. This limitation necessitates creating new buckets or waiting for the feature to become available.
+- **Prerequisite Requirements:** In some providers, object-level immutability requires bucket-level immutability to be set first (e.g., in Amazon S3 and Azure Blob Storage), adding complexity to the configuration.
+- **Increased Complexity:** Managing immutability policies at the object level requires additional logic in backup processes and tooling.
+
+#### Conclusion
+
+While object-level immutability offers greater flexibility and control, current provider limitations and operational complexities make it less practical for immediate implementation. Specifically, the inability to enable object-level immutability on existing buckets in GCS and the prerequisite of bucket-level immutability in some providers are significant factors.
+
+Given these considerations, we propose starting with bucket-level immutability to achieve immediate enhancement of backup immutability with minimal changes to existing processes. This approach allows us to implement immutability features across all providers consistently.
+
+Once provider support for object-level immutability on existing buckets improves and operational complexities are addressed, we can consider adopting object-level immutability in the future to address specific requirements, such as varying immutability periods for different backups.
+
+These are the reasons why we are initially opting for bucket-level immutability, with the possibility of transitioning to object-level immutability when it becomes more feasible.
+
+##### Comparison of Storage Provider Properties for Bucket-Level and Object-Level Immutability
+
+
+| Feature                                                                 | GCS | AWS | Azure |
+|-------------------------------------------------------------------------|-----|-----|-------|
+| Can bucket-level immutability period be increased?                      | Yes | Yes* | Yes (only 5 times) |
+| Can bucket-level immutability period be decreased?                      | No  | Yes* | No    |
+| Is bucket-level immutability a prerequisite for object-level immutability? | No  | Yes | Yes (for existing buckets), No (for new buckets) |
+| Can object-level immutability period be increased?                      | Yes | Yes | Yes   |
+| Can object-level immutability period be decreased?                      | No  | No  | No    |
+| Support for enabling object-level immutability in existing buckets      | No (planned support soon) | Yes (only new objects will have immutability) | Yes (Azure handles the migration) |
+| Support for enabling object-level immutability in new buckets           | Yes | Yes | Yes   |
+| Precedence between bucket-level and object-level immutability periods   | Maximum of bucket or object-level immutability | Object-level immutability has precedence | Maximum of bucket or object-level immutability |
+
+> **Note:** *In AWS S3, changes to the bucket-level immutability period can be blocked by adding a specific bucket policy.
+
+</details>
+
+## References
+
+- [GCS Bucket Lock](https://cloud.google.com/storage/docs/bucket-lock)
+- [AWS S3 Object Lock](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html)
+- [Azure Immutable Blob Storage](https://learn.microsoft.com/en-us/azure/storage/blobs/immutable-policy-configure-container-scope?tabs=azure-portal)
+- [etcd-backup-restore PR #776](https://github.com/gardener/etcd-backup-restore/pull/776)
+
+## Glossary
+
+- **ETCD:** A distributed key-value store used as the backing store for Kubernetes.
+- **Compaction Job:** A process that compacts ETCD snapshots to reduce storage size and improve performance.
+- **Hibernation:** Scaling down a cluster (or ETCD) to zero replicas to save resources.
+- **Immutability Period:** The duration for which data must remain immutable in storage before it can be modified or deleted.
+- **WORM (Write Once, Read Many):** A storage model where data, once written, cannot be modified or deleted until certain conditions are met.
+- **Immutability:** The property of an object being unchangeable after creation.
+- **Garbage Collection:** The process of deleting old or unnecessary data to free up storage space.
+
+---
\ No newline at end of file

From ba411e5a7bfbfdfb0a674c529220049384488447 Mon Sep 17 00:00:00 2001
From: Seshachalam <104052572+seshachalam-yv@users.noreply.github.com>
Date: Thu, 10 Oct 2024 10:27:05 +0530
Subject: [PATCH 2/7] Apply suggestions from  @ishan16696 code review

Co-authored-by: Ishan Tyagi <42602577+ishan16696@users.noreply.github.com>
---
 docs/proposals/06-immutable-etcd-backups.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/proposals/06-immutable-etcd-backups.md b/docs/proposals/06-immutable-etcd-backups.md
index 643342951..74ad8f960 100644
--- a/docs/proposals/06-immutable-etcd-backups.md
+++ b/docs/proposals/06-immutable-etcd-backups.md
@@ -49,7 +49,7 @@ This proposal aims to enhance the reliability and integrity of ETCD backups in E
 
 ## Motivation
 
-Ensuring the integrity and availability of ETCD backups is crucial for the ability to restore an ETCD cluster after it has irrecoverably gone down. Making backups immutable protects against unintended or malicious modifications post-creation, thereby enhancing the overall security posture.
+Ensuring the integrity and availability of ETCD backups is crucial for the ability to restore an ETCD cluster after it has irrecoverably gone down. Making backups immutable, protects against unintended or malicious modifications post-creation, thereby enhancing the overall security posture.
 
 ### Goals
 
@@ -85,7 +85,7 @@ The Bucket Immutability feature configures an immutability policy for a cloud st
 
 **Implementation Details:**
 
-- Operators are responsible for creating or updating buckets with these immutability settings before ETCD Druid starts uploading snapshots.
+- Operators are responsible for creating the new buckets or updating the existing buckets with these immutability settings before `etcd-backup-restore` starts uploading snapshots.
 - Once the bucket is configured to be immutable, snapshots uploaded by `etcd-backup-restore` are also immutable and cannot be altered or deleted until the immutability period expires.
 - No additional configuration needs to be passed to etcd-druid.
 
@@ -155,7 +155,7 @@ To address this, the compaction job should be enhanced to handle garbage collect
 
 Given that immutable backups cannot be deleted until the immutability period expires, there are scenarios, such as corrupted snapshots or other anomalies, where certain snapshots must be skipped during the restoration process. To facilitate this:
 
-- **Custom Metadata Tags:** Utilize custom metadata to mark specific objects (snapshots) that should be bypassed. To exclude a snapshot from the restoration process, attach custom metadata to it with the key `x-etcd-snapshot-exclude` and value `true`. This method is officially supported, as demonstrated in the [etcd-backup-restore PR](https://github.com/gardener/etcd-backup-restore/pull/776).
+- **Custom Metadata Tags:** Utilize custom metadata to mark specific objects (snapshots) that should be bypassed. To exclude a snapshot from the restoration process, attach custom metadata to it with the key `x-etcd-snapshot-exclude` and value `true`. This method is officially supported, as demonstrated in the [etcd-backup-restore PR](https://github.com/gardener/etcd-backup-restore/pull/776) for storage provider: GCS.
 
 ## Compatibility
 
@@ -166,7 +166,7 @@ The proposed changes are fully compatible with existing ETCD clusters and backup
 
 ## Implementation Steps
 
-1. **Enhance the Compaction Job:**
+1. **Enhance the Trigger of Compaction Job:**
      - Modify the compaction job in `etcd-backup-restore` to manage backup processes for hibernated clusters, including snapshot creation and garbage collection, while considering the new immutability constraints.
      - Implement the `--hibernation-snapshot-interval` flag in the compaction controller. Ensure that the compaction job can start an embedded ETCD instance to take snapshots during hibernation.
 

From f28db1f488994b44b62bf18886820bfda7d10ad2 Mon Sep 17 00:00:00 2001
From: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
Date: Thu, 21 Nov 2024 15:28:48 +0530
Subject: [PATCH 3/7] Enhance the proposal to use the operator task framework

* The operator task framework is used to enhance the proposal in the approach which re-uploads the latest full snapshot to prolong the immutability.

---------

Co-authored-by: Seshachalam Yerasala Venkata <seshachalam.yerasala.venkata@sap.com>
---
 docs/proposals/06-immutable-etcd-backups.md | 295 +++++++++++---------
 1 file changed, 159 insertions(+), 136 deletions(-)

diff --git a/docs/proposals/06-immutable-etcd-backups.md b/docs/proposals/06-immutable-etcd-backups.md
index 74ad8f960..0c2a3b04a 100644
--- a/docs/proposals/06-immutable-etcd-backups.md
+++ b/docs/proposals/06-immutable-etcd-backups.md
@@ -13,53 +13,24 @@ reviewers:
 
 # DEP-06: Immutable ETCD Backups
 
-## Table of Contents
-
-- [DEP-06: Immutable ETCD Backups](#dep-06-immutable-etcd-backups)
-  - [Table of Contents](#table-of-contents)
-  - [Summary](#summary)
-  - [Motivation](#motivation)
-    - [Goals](#goals)
-    - [Non-Goals](#non-goals)
-  - [Proposal](#proposal)
-    - [Overview](#overview)
-    - [Detailed Design](#detailed-design)
-      - [Bucket Immutability Mechanism](#bucket-immutability-mechanism)
-      - [ETCD Backup Configuration](#etcd-backup-configuration)
-      - [Handling of Hibernated Clusters](#handling-of-hibernated-clusters)
-    - [New Compaction Controller Flag](#new-compaction-controller-flag)
-    - [Garbage Collection during Compaction for Hibernated ETCD Clusters](#garbage-collection-during-compaction-for-hibernated-etcd-clusters)
-    - [Excluding Snapshots Under Specific Circumstances](#excluding-snapshots-under-specific-circumstances)
-  - [Compatibility](#compatibility)
-  - [Implementation Steps](#implementation-steps)
-  - [Risks and Mitigations](#risks-and-mitigations)
-  - [Operational Considerations](#operational-considerations)
-  - [Alternatives](#alternatives)
-    - [Object-Level Immutability Policies vs. Bucket-Level Immutability Policies](#object-level-immutability-policies-vs-bucket-level-immutability-policies)
-      - [Feasibility Study: Immutable Backups on Cloud Providers](#feasibility-study-immutable-backups-on-cloud-providers)
-      - [Considerations for Object-Level Immutability](#considerations-for-object-level-immutability)
-      - [Conclusion](#conclusion)
-        - [Comparison of Storage Provider Properties for Bucket-Level and Object-Level Immutability](#comparison-of-storage-provider-properties-for-bucket-level-and-object-level-immutability)
-  - [References](#references)
-  - [Glossary](#glossary)
-
 ## Summary
 
-This proposal aims to enhance the reliability and integrity of ETCD backups in ETCD Druid by introducing immutable backups. By leveraging cloud provider features that support a write-once-read-many (WORM) model, this approach prevents unauthorized modifications to backup data, ensuring that backups remain available and intact for restoration.
+This proposal aims to enhance the reliability and integrity of ETCD backups created by `etcd-backup-restore` in ETCD clusters managed by `etcd-druid`, by introducing immutable backups. By leveraging cloud provider features that support a write-once-read-many (WORM) model, unauthorized modifications to backup data are prevented, ensuring that backups remain intact and accessible for restoration.
+
+The proposed solution relies on `etcd-druid` to manage ETCD backups and handle hibernation processes effectively. It leverages one of the suggested approaches to ensure backups remain immutable over extended periods. It is important to note that using `etcd-backup-restore` standalone may not be sufficient to achieve this functionality end-to-end, as the immutability handling (with respect to hibernation) is specifically managed within `etcd-druid`.
 
 ## Motivation
 
-Ensuring the integrity and availability of ETCD backups is crucial for the ability to restore an ETCD cluster after it has irrecoverably gone down. Making backups immutable, protects against unintended or malicious modifications post-creation, thereby enhancing the overall security posture.
+Ensuring the integrity and availability of ETCD backups is crucial for the ability to restore an ETCD cluster when it has become non-functional or inoperable. Making the backups immutable protects against any unintended or malicious modifications post-creation, thereby enhancing the overall security posture.
 
 ### Goals
 
 - Implement immutable backup support for ETCD clusters.
 - Secure backup data against unintended or unauthorized modifications after creation.
-- Ensure backups are consistently available and intact for restoration purposes.
+- Implement changes required in `etcd-backup-restore` and `etcd-druid` to support this proposal.
 
 ### Non-Goals
 
-- Altering existing backup processes beyond what's necessary for immutability.
 - Implementing object-level immutability policies at this stage.
 - Supporting immutable backups on storage providers that do not offer immutability features (e.g., OpenStack Swift).
 
@@ -67,13 +38,20 @@ Ensuring the integrity and availability of ETCD backups is crucial for the abili
 
 ### Overview
 
-Introduce immutability in backup storage by leveraging cloud provider features that support a write-once-read-many (WORM) model. This will prevent data alterations after backup creation, enhancing data integrity and security. The implementation will focus on bucket-level immutability policies, as they are widely supported and easier to manage across different cloud providers.
+We propose introducing immutability in backup storage by leveraging cloud provider features that support a write-once-read-many (WORM) model. This approach will prevent data alterations post-creation, enhancing data integrity and security.
+
+There are two types of immutability options to consider:
+
+1. **Bucket-Level Immutability:** Applies a uniform immutability period to all objects within a bucket. This is widely supported and easier to manage across different cloud providers.
+2. **Object-Level Immutability:** Allows setting immutability periods on a per-object basis, offering more granular control but with increased complexity and varying support across providers.
+
+In the detailed design, we will focus on bucket-level immutability policies due to their broader support and simpler management.
 
 ### Detailed Design
 
 #### Bucket Immutability Mechanism
 
-The Bucket Immutability feature configures an immutability policy for a cloud storage bucket, governing how long objects in the bucket must be retained in an immutable state. It also allows for locking the bucket's immutability policy, permanently preventing the policy from being reduced or removed.
+The bucket immutability feature configures an immutability policy for a cloud storage bucket, dictating how long objects in the bucket must remain immutable. It also allows for locking the bucket's immutability policy, permanently preventing the policy from being reduced or removed.
 
 - **Supported by Major Providers:**
   - **Google Cloud Storage (GCS):** [Bucket Lock](https://cloud.google.com/storage/docs/bucket-lock)
@@ -85,126 +63,175 @@ The Bucket Immutability feature configures an immutability policy for a cloud st
 
 **Implementation Details:**
 
-- Operators are responsible for creating the new buckets or updating the existing buckets with these immutability settings before `etcd-backup-restore` starts uploading snapshots.
-- Once the bucket is configured to be immutable, snapshots uploaded by `etcd-backup-restore` are also immutable and cannot be altered or deleted until the immutability period expires.
+- Operators are responsible for configuring new or existing buckets with these immutability settings before `etcd-backup-restore` begins uploading snapshots.
+- Once configured, snapshots uploaded by `etcd-backup-restore` will also be immutable and cannot be altered or deleted until the immutability period expires.
 - No additional configuration needs to be passed to etcd-druid.
 
 #### ETCD Backup Configuration
 
-Operators need to ensure that the ETCD backup configuration aligns with the immutability requirements. This includes setting appropriate immutability periods.
-
+Operators must ensure that the ETCD backup configuration aligns with the immutability requirements, including setting appropriate immutability periods.
 
 #### Handling of Hibernated Clusters
 
-In scenarios where an ETCD cluster has been hibernated (scaled down to zero replicas) for a duration that exceeds the immutability period, backups may become mutable again (behaviour depends on cloud provider [refer](#comparison-of-storage-provider-properties-for-bucket-level-and-object-level-immutability)), compromising the intended immutability guarantees.
+When an ETCD cluster is hibernated for a duration exceeding the immutability period, backups may become mutable again (this behavior depends on the cloud provider; refer to [Comparison of Storage Provider Properties](#comparison-of-storage-provider-properties-for-bucket-level-and-object-level-immutability)), compromising the intended immutability guarantees.
+
+Such handling of hibernated clusters is the type of scenario which the etcd operator-tasks frameworks lends itself to quite well, and thus for all proposed solutions, the operator tasks framework as defined [here](./05-etcd-operator-tasks.md) will be made use of for the designs of the solutions.
 
-There are two ways to maintain the immutability of snapshots:
+To maintain snapshot immutability during extended hibernation, we propose two approaches:
 
-1. **Extend the Immutability of Snapshots**
+##### Approach 1: Using the Compaction Job
 
-   Extending the immutability period at the bucket level to cover the extended hibernation period might seem like a solution. However, this approach is **not feasible** due to several significant drawbacks:
+**Proposed Solution:**
+
+Utilize the compaction job to periodically take fresh snapshots during hibernation. Introduce a new flag `--hibernation-snapshot-interval` to the compaction controller. This flag sets the interval after which a compaction job should be triggered for a hibernated ETCD cluster, based on the time elapsed since `fullLease.Spec.RenewTime.Time` and if `etcd.spec.replicas` is `0` (indicating hibernation). The compaction job uses the [compact command](https://github.com/gardener/etcd-backup-restore/blob/master/cmd/compact.go) to create a new snapshot.
 
-   - **Global Impact on All Objects:** Since bucket-level immutability settings apply to all objects within the bucket, extending the immutability period would affect not only the existing snapshots but also all future snapshots uploaded to the bucket.
-   - **Increased Storage Costs:** Prolonging the immutability period means that snapshots cannot be deleted or modified until the extended period expires. This leads to the accumulation of old backups, consuming more storage space and significantly driving up storage costs over time.
-   - **Inefficient Resource Utilization:** Retaining all snapshots for a longer period than necessary is an inefficient use of storage resources, especially when only the latest snapshots are typically needed for restoration purposes.
+**Implementation Details:**
 
-   Given these considerable disadvantages, extending the bucket-level immutability period is impractical for maintaining the immutability of snapshots during extended hibernation periods.
+- **Etcd Druid:**
+  - **Compaction Controller:**
+    - Introduce a new flag:
+      - **Flag:** `--hibernation-snapshot-interval`
+        - **Type:** Duration
+        - **Default:** `24h`
+        - **Description:** Interval after which a new snapshot is taken during hibernation.
+    - The compaction job starts an embedded ETCD instance to take snapshots during hibernation.
+    - This approach ensures that backups remain within the immutability period and are safeguarded against becoming mutable.
 
-2. **Take a New Snapshot**
+###### Advantages
 
-   By creating a new snapshot, immutability is retained for the new snapshot without affecting the immutability settings of the entire bucket or other snapshots.
+- **No Change to ETCD Cluster State:** Does not alter the actual ETCD cluster, keeping it in hibernation.
+- **Automated Snapshot Creation:** Periodically creates new snapshots to extend immutability by triggering the compaction job.
+- **Leverages Existing Mechanism:** Utilizes the compaction job, which is already part of the system.
 
-   Approach 2 is feasible and can be achieved in two ways:
+###### Disadvantages
 
-   - **Option A:** Bring the ETCD cluster back up by increasing the replicas, take a snapshot, and then hibernate the ETCD by setting the replicas back to zero **without the operator's intention**.
-     - **Challenges:**
-       - **Unintended State Changes:** Increasing the ETCD replicas to bring the cluster back up modifies the cluster's state without the operator's explicit intention, which could lead to unexpected behavior or conflicts with operational policies.
-       - **Operational Overhead:** Automating the process of scaling the ETCD cluster up and down adds complexity and potential risks, especially if the scaling operations interfere with other scheduled tasks or maintenance windows.
+- **Resource Consumption:** Starting an embedded ETCD instance periodically consumes resources.
 
-   - **Option B:** Start an embedded ETCD instance and take a snapshot.
-     - **Advantages:**
-       - **Respects Operator Intentions:** Does not alter the state of the actual ETCD cluster, keeping it in hibernation as intended by the operator.
-       - **No Impact on Existing Setup:** Avoids modifying the ETCD cluster's replicas, preventing any unintended consequences.
-       - **Automated Process:** Can be integrated into existing maintenance jobs without manual intervention.
+##### Approach 2: Re-upload of the latest snapshot
 
 **Proposed Solution:**
 
-*Option B is recommended* because it respects the operator's intention to keep the ETCD cluster hibernated and avoids the complexities and risks associated with modifying the cluster's state.
+A new `EtcdOperatorTask` called `EtcdSnapshotImmutabilityExtension` will be created as defined in the operator tasks framework. This new `EtcdOperatorTask` extends the immutability period by deploying a job which uploads another copy of the latest snapshot to the object store.
 
-Leverage the compaction job to take fresh snapshots periodically during hibernation. This ensures that:
+A full snapshot is taken before hibernating the ETCD cluster. This is to ensure that no state maintained in the etcd cluster is lost before hibernation.
 
-- **Immutability is Maintained:** New snapshots have a fresh immutability period, keeping backups protected.
-- **Operator Intentions are Respected:** The ETCD cluster remains in its hibernated state, as per the operator's intention.
-- **Storage Costs are Controlled:** Avoids unnecessary extension of immutability for all snapshots, preventing increased storage costs.
+**Implementation Details:**
 
-### New Compaction Controller Flag
+- **Introduce the `extend-immutability` command to etcdbrctl**:
+  - etcd-backup-restore will be enhanced to support a new command `extend-immutability` which does the following:
+    - Downloads the latest full snapshot from the object store.
+    - Replaces the Unix epoch in the file-name of the downloaded snapshot to contain the time at which the file completes downloading.
+    - Uploads this newly renamed snapshot to the same object store.
+    - Renews the full snapshot lease after the upload is successful.
 
-**Flag:** `--hibernation-snapshot-interval`
+    The immutability period of an object begins from the moment of upload, thus extending the immutability period of the latest snapshot. Renaming the snapshot is necessary since the downloaded snapshot can not simply be re-uploaded as uploading with the same name would be an attempt at modifying an already existing snapshot, which is disallowed.
 
-- **Type:** Duration
-- **Default:** `24h`
-- **Description:** This flag sets the period after which a compaction job should be triggered for a hibernated ETCD cluster, based on the time since the last renewal of the full snapshot lease. If the time since `fullLease.Spec.RenewTime.Time` exceeds the duration specified by this flag, and `etcd.spec.replicas` is `0` (indicating hibernation), the compaction job will automatically trigger to create a new snapshot. This approach ensures that backups remain within the immutability period and are safeguarded against becoming mutable.
+    This command could either be implemented standalone, or could be implemented as a wrapper over the `copy` command of `etcdbrctl` by extending the functionality of the `copy` command accordingly.
+- **Introduce the `garbage-collect` command to etcdbrctl**:
+  - etcd-backup-restore will be enhanced to support a new command `garbage-collect` which does the following:
+    - Perform garbage collection of the snapshots in the object store according to the policy specified with the `--garbage-collection-policy` flag.  
 
-### Garbage Collection during Compaction for Hibernated ETCD Clusters
+    This functionality is needed since it would be necessary to garbage collect the (identical final) snapshots that are (re)uploaded in order to ensure that there is always a snapshot which is immutable.
+- **Update `Etcd` CRD:**
+  - Add `etcd.spec.hibernation`:  
+    Since there are situations outside of hibernation where the number of replicas of the statefulset would have to be scaled to zero, there needs to be an explicit way in which it is conveyed to etcd-druid that the etcd cluster is being hibernated. This can be achieved by extending the `Etcd` CRD by including a new field in the `spec` called `hibernated`.
 
-If an ETCD cluster is hibernated for a long duration, there is a chance that backup storage will accumulate a large number of snapshots. Since the ETCD cluster is not running, the `etcd-backup-restore` component is also not running to perform garbage collection of the snapshots.
+    ```yaml
+    hibernation:
+      enabled: <bool>
+    ```
 
-To address this, the compaction job should be enhanced to handle garbage collection during hibernation. This ensures that old snapshots are cleaned up appropriately, considering the immutability constraints.
+  - Add `etcd.status.hibernatedAt`:  
+    This field conveys information about whether the cluster has been successfully hibernated after a reconciliation, and the time at which it entered hibernation. This field is cleared when the cluster is woken up from hibernation.
 
-### Excluding Snapshots Under Specific Circumstances
+    ```yaml
+    hibernatedAt: <hibernation-time>
+    ```
 
-Given that immutable backups cannot be deleted until the immutability period expires, there are scenarios, such as corrupted snapshots or other anomalies, where certain snapshots must be skipped during the restoration process. To facilitate this:
+  - Add `immutableSettings.retentionType` under `etcd.spec.backup.store`.
+- **ETCD Controller Logic:**
+  - When hibernation is requested, by changing `etcd.spec.hibernated.enabled` to `true`:
+    - The controller removes the ETCD client ports `2380` and `2379` from the `etcd-client` service, leaving only the etcd-backup-restore port `8080`. This stops ETCD client traffic.
+    - The controller creates a `EtcdOperatorTask` to trigger an on-demand full snapshot.
+      - On-demand full snapshot is successful: the controller does no additional handling.
+      - On-demand full snapshot fails: the controller triggers the creation of a `EtcdOperatorTask` for an on-demand compaction job that compacts the latest base full snapshot and the corresponding deltas.
+    - The controller scales in the ETCD cluster (i.e., sets `StatefulSet.spec.replicas` to zero).
+    - The controller creates the `EtcdSnapshotImmutabilityExtension` periodically if `etcd.spec.backup.store.immutableSettings.retentionType` is set to `"Bucket"`.
 
-- **Custom Metadata Tags:** Utilize custom metadata to mark specific objects (snapshots) that should be bypassed. To exclude a snapshot from the restoration process, attach custom metadata to it with the key `x-etcd-snapshot-exclude` and value `true`. This method is officially supported, as demonstrated in the [etcd-backup-restore PR](https://github.com/gardener/etcd-backup-restore/pull/776) for storage provider: GCS.
+- **`EtcdSnapshotImmutabilityExtension` specification:**
+  - Run `etcdbrctl extend-immutability --bucket-level-immutability` to extend the immutability of the latest snapshot.
+  - Run `etcdbrctl garbage-collect --garbage-collection-policy <garbage-collection-policy>` to garbage collect the snapshots that are created during the extension.
 
-## Compatibility
+- **EtcdOperatorTask Controller Logic:**
+  - The operator-tasks controller will react to the creation of the custom resource, and will deploy a job named `<etcd-name>-extend-immutability` which is the `EtcdSnapshotImmutabilityExtension` job.
+  - The controller also reports metrics regarding the `EtcdSnapshotImmutabilityExtension` job, which can be used to raise alerts for operators that immutability has not been extended.
+
+###### Advantages
 
-The proposed changes are fully compatible with existing ETCD clusters and backup processes. Operators are responsible for creating or updating the immutability settings on the backup storage buckets, but no changes are required to the ETCD clusters themselves.
+- **Minimal Operational Impact:** Does not alter the ETCD cluster's state during hibernation and respects the operator's intention to hibernate the cluster without unintended changes.
+- **Efficient Resource Utilization:** Only the latest snapshot is copied, limiting additional storage usage, and avoids the need to start an embedded ETCD instance.
+- **Automated Process:** The process of taking a full snapshot before hibernation and creating the `EtcdSnapshotImmutabilityExtension` is automated within the controller.
 
-- **Backward Compatibility:** Existing clusters without immutable buckets will continue to function without change.
-- **Forward Compatibility:** Clusters can opt-in to use immutable backups by configuring the bucket accordingly.
+###### Disadvantages
 
-## Implementation Steps
+- **Additional Complexity:** Requires updates to the etcd controller, introduction of the operator-tasks controller, and introduction of new etcdbrctl commands.
+- **Prerequisite Requirement:** Relies on successfully taking a full snapshot before hibernation, which may introduce delays or require handling snapshot failures.
 
-1. **Enhance the Trigger of Compaction Job:**
-     - Modify the compaction job in `etcd-backup-restore` to manage backup processes for hibernated clusters, including snapshot creation and garbage collection, while considering the new immutability constraints.
-     - Implement the `--hibernation-snapshot-interval` flag in the compaction controller. Ensure that the compaction job can start an embedded ETCD instance to take snapshots during hibernation.
+##### Recommendation
 
-2. **Update Documentation:**
-     - Revise the documentation to reflect the changes and guide operators in effectively using the new immutability features.
-     - Provide detailed guidelines on configuring buckets with immutability settings.
-     - Document procedures for excluding snapshots when necessary.
+After evaluating both approaches, **Approach 2: Re-upload of the latest snapshot** is recommended due to its minimal operational impact and efficient resource utilization. By ensuring that a full snapshot is taken before hibernation, we maintain data consistency and extend the immutability period effectively. This approach respects the operator's intention to keep the ETCD cluster hibernated without introducing significant resource consumption or complexity.
+
+## Compatibility
+
+The proposed changes are fully compatible with existing ETCD clusters and backup processes.
+
+- **Backward Compatibility:**
+  - Existing clusters without immutable buckets will continue to function without change.
+  - The introduction of the `EtcdSnapshotImmutabilityExtension` does not affect clusters that are not hibernated.
+- **Forward Compatibility:**
+  - Clusters can opt-in to use immutable backups by configuring the bucket accordingly.
+  - The controller's logic to handle hibernation is additive and does not interfere with existing workflows.
 
 ## Risks and Mitigations
 
 - **Increased Storage Costs:**
 
-  - **Risk:** Introducing immutability could lead to increased storage costs due to the inability to delete backups before the immutability period ends.
-  - **Mitigation:** Operators should carefully configure immutability periods and monitor storage utilization. Garbage collection will help mitigate long-term storage growth.
+  - **Risk:** Copying snapshots or frequent snapshots may lead to increased storage usage.
+  - **Mitigation:** Since only the latest full snapshot is copied in Approach 2, the additional storage usage is minimal. Garbage collection helps manage storage utilization.
+
+- **Operational Complexity:**
 
-- **Backup Gaps During Hibernation:**
+  - **Risk:** Introducing new resources and processes might add complexity.
+  - **Mitigation:** The processes are automated within the controller, requiring minimal operator intervention. Clear documentation and tooling support will help manage complexity.
 
-  - **Risk:** Hibernated clusters might not receive necessary backups, potentially leading to compliance issues.
-  - **Mitigation:** The compaction job's enhancement ensures that backups are taken during hibernation at configured intervals.
+- **Failed Snapshot Before Hibernation:**
 
-- **Excluding Critical Snapshots:**
+  - **Risk:** Failure to take a full snapshot before hibernation could delay the hibernation process.
+  - **Mitigation:** Implement robust error handling and retries. Notify operators of failures to take corrective action.
 
-  - **Risk:** Excluding snapshots during restoration might be misused or lead to incomplete data restoration.
-  - **Mitigation:** Restrict the ability to tag snapshots for exclusion to authorized personnel. Implement audit logging for actions that tag snapshots.
+- **Failed Operations:**
+
+  - **Risk:** Errors during copy or snapshot operations could lead to incomplete backups.
+  - **Mitigation:** Implement robust error handling and retries in the copier and compaction job logic. Ensure proper logging and alerting.
 
 ## Operational Considerations
 
 Operators need to:
 
-- **Bucket Configuration:**
+- **Configure Buckets:**
+
+  - Set up buckets with appropriate immutability settings before deploying ETCD clusters.
+  - Ensure immutability periods align with organizational policies.
+
+- **Monitor Hibernation Processes:**
 
-  - Configure buckets with appropriate immutability settings before deploying ETCD clusters.
-  - Ensure that the immutability periods align with organizational policies.
+  - Keep track of hibernated clusters and ensure that full snapshots are taken before hibernation.
+  - Verify that `EtcdCopyBackupsTask` resources are created and executed as expected.
 
-- **Compaction Job Configuration:**
+- **Review Retention Policies:**
 
-  - Set the `--hibernation-snapshot-interval` flag according to the desired snapshot frequency during hibernation.
-  - Monitor compaction jobs and logs for any issues.
+  - Set `maxBackups` and `maxBackupAge` in the `EtcdCopyBackupsTask` to manage storage utilization effectively.
+  - Configure the `--hibernation-snapshot-interval` for the compaction job if using Approach 1.
 
 ## Alternatives
 
@@ -218,59 +245,52 @@ Major cloud storage providers such as Google Cloud Storage (GCS), Amazon S3, and
 
 1. **Bucket-Level Immutability Policies:**
 
-     - **Applies Uniformly:** Applies a uniform immutability period to all objects within a bucket.
-     - **Immutable Objects:** Once set, objects cannot be modified or deleted until the immutability period expires.
-     - **Simplified Management:** Simplifies management by applying the same policy to all objects.
+   - **Applies Uniformly:** Applies a uniform immutability period to all objects within a bucket.
+   - **Immutable Objects:** Once set, objects cannot be modified or deleted until the immutability period expires.
+   - **Simplified Management:** Simplifies management by applying the same policy to all objects.
 
 2. **Object-Level Immutability Policies:**
 
-     - **Granular Control:** Allows setting immutability periods on a per-object basis.
-     - **Flexible Immutability Durations:** Offers granular control, enabling different immutability durations for individual objects.
-     - **Varying Requirements:** Can accommodate varying immutability requirements for different types of backups.
+   - **Granular Control:** Allows setting immutability periods on a per-object basis.
+   - **Flexible Immutability Durations:** Offers granular control, enabling different immutability durations for individual objects.
+   - **Varying Requirements:** Can accommodate varying immutability requirements for different types of backups.
 
 #### Considerations for Object-Level Immutability
 
-Using object-level immutability provides flexibility in scenarios where certain backups require different immutability periods. For example, in the case of hibernated clusters where the ETCD cluster may not be running and backups are not being updated, object-level immutability allows extending the immutability of the latest snapshots without affecting the immutability of older backups.
+Using object-level immutability provides flexibility in scenarios where certain backups require different immutability periods. However, current limitations and complexities make it less practical for immediate implementation.
 
 **Advantages:**
 
-- **Granular Control:** Allows setting different immutability periods for different objects, accommodating varying requirements.
-- **Efficient Resource Utilization:** Prevents unnecessary extension of immutability for all objects, potentially reducing storage costs.
-- **Enhanced Flexibility:** Can adjust immutability periods for specific backups as needed.
+- **Granular Control:** Allows setting different immutability periods for different objects.
+- **Efficient Resource Utilization:** Prevents unnecessary extension of immutability for all objects.
+- **Enhanced Flexibility:** Adjust immutability periods as needed.
 
 **Disadvantages:**
 
-- **Provider Limitations:** Not all providers support enabling object-level immutability on existing buckets without additional steps. For instance, in GCS, enabling object-level immutability on existing buckets is currently not supported. This limitation necessitates creating new buckets or waiting for the feature to become available.
-- **Prerequisite Requirements:** In some providers, object-level immutability requires bucket-level immutability to be set first (e.g., in Amazon S3 and Azure Blob Storage), adding complexity to the configuration.
-- **Increased Complexity:** Managing immutability policies at the object level requires additional logic in backup processes and tooling.
+- **Provider Limitations:** Enabling object-level immutability on existing buckets is not universally supported.
+- **Increased Complexity:** Requires additional logic in backup processes and tooling.
+- **Prerequisites:** Some providers require bucket-level immutability to be set first.
 
 #### Conclusion
 
-While object-level immutability offers greater flexibility and control, current provider limitations and operational complexities make it less practical for immediate implementation. Specifically, the inability to enable object-level immutability on existing buckets in GCS and the prerequisite of bucket-level immutability in some providers are significant factors.
-
-Given these considerations, we propose starting with bucket-level immutability to achieve immediate enhancement of backup immutability with minimal changes to existing processes. This approach allows us to implement immutability features across all providers consistently.
-
-Once provider support for object-level immutability on existing buckets improves and operational complexities are addressed, we can consider adopting object-level immutability in the future to address specific requirements, such as varying immutability periods for different backups.
-
-These are the reasons why we are initially opting for bucket-level immutability, with the possibility of transitioning to object-level immutability when it becomes more feasible.
+Given the complexities and limitations, we recommend using bucket-level immutability in conjunction with the `EtcdCopyBackupsTask` approach (Approach 2) to manage immutability during hibernation effectively. This approach provides a balance between operational simplicity and meeting immutability requirements. The compaction job approach (Approach 1) is also viable but may introduce more resource consumption and operational overhead.
 
 ##### Comparison of Storage Provider Properties for Bucket-Level and Object-Level Immutability
 
-
-| Feature                                                                 | GCS | AWS | Azure |
-|-------------------------------------------------------------------------|-----|-----|-------|
-| Can bucket-level immutability period be increased?                      | Yes | Yes* | Yes (only 5 times) |
-| Can bucket-level immutability period be decreased?                      | No  | Yes* | No    |
-| Is bucket-level immutability a prerequisite for object-level immutability? | No  | Yes | Yes (for existing buckets), No (for new buckets) |
-| Can object-level immutability period be increased?                      | Yes | Yes | Yes   |
-| Can object-level immutability period be decreased?                      | No  | No  | No    |
-| Support for enabling object-level immutability in existing buckets      | No (planned support soon) | Yes (only new objects will have immutability) | Yes (Azure handles the migration) |
-| Support for enabling object-level immutability in new buckets           | Yes | Yes | Yes   |
-| Precedence between bucket-level and object-level immutability periods   | Maximum of bucket or object-level immutability | Object-level immutability has precedence | Maximum of bucket or object-level immutability |
+| Feature                                                                   | GCS | AWS | Azure                         |
+|---------------------------------------------------------------------------|-----|-----|-------------------------------|
+| Can bucket-level immutability period be increased?                        | Yes | Yes*| Yes (only 5 times)            |
+| Can bucket-level immutability period be decreased?                        | No  | Yes*| No                            |
+| Is bucket-level immutability a prerequisite for object-level immutability?| No  | Yes | Yes (existing buckets)        |
+| Can object-level immutability period be increased?                        | Yes | Yes | Yes                           |
+| Can object-level immutability period be decreased?                        | No  | No  | No                            |
+| Support for enabling object-level immutability in existing buckets        | No  | Yes | Yes                           |
+| Support for enabling object-level immutability in new buckets             | Yes | Yes | Yes                           |
+| Precedence between bucket-level and object-level immutability periods     | Max(bucket, object)| Object-level| Max(bucket, object) |
 
 > **Note:** *In AWS S3, changes to the bucket-level immutability period can be blocked by adding a specific bucket policy.
 
-</details>
+---
 
 ## References
 
@@ -278,15 +298,18 @@ These are the reasons why we are initially opting for bucket-level immutability,
 - [AWS S3 Object Lock](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html)
 - [Azure Immutable Blob Storage](https://learn.microsoft.com/en-us/azure/storage/blobs/immutable-policy-configure-container-scope?tabs=azure-portal)
 - [etcd-backup-restore PR #776](https://github.com/gardener/etcd-backup-restore/pull/776)
+- [EtcdCopyBackupsTask Implementation](https://github.com/gardener/etcd-druid/pull/544)
 
 ## Glossary
 
 - **ETCD:** A distributed key-value store used as the backing store for Kubernetes.
+- **Etcd Druid:** A Kubernetes operator that manages ETCD clusters for Gardener.
+- **EtcdCopyBackupsTask:** A custom resource that defines a task to copy ETCD backups.
 - **Compaction Job:** A process that compacts ETCD snapshots to reduce storage size and improve performance.
-- **Hibernation:** Scaling down a cluster (or ETCD) to zero replicas to save resources.
+- **Hibernation:** Shutting down all the processes that correspond to an etcd cluster, while persisting information which can later be used to restart the etcd cluster with the same state.
 - **Immutability Period:** The duration for which data must remain immutable in storage before it can be modified or deleted.
 - **WORM (Write Once, Read Many):** A storage model where data, once written, cannot be modified or deleted until certain conditions are met.
 - **Immutability:** The property of an object being unchangeable after creation.
 - **Garbage Collection:** The process of deleting old or unnecessary data to free up storage space.
 
----
\ No newline at end of file
+---

From 8a90dca07f51077813ce4c9e0e18de0123f8dd0d Mon Sep 17 00:00:00 2001
From: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
Date: Tue, 3 Dec 2024 13:29:33 +0530
Subject: [PATCH 4/7] Expand Summary and Motivation, add Hibernation, reword
 other sections.

---
 docs/proposals/06-immutable-etcd-backups.md | 207 +++++++++++++-------
 1 file changed, 131 insertions(+), 76 deletions(-)

diff --git a/docs/proposals/06-immutable-etcd-backups.md b/docs/proposals/06-immutable-etcd-backups.md
index 0c2a3b04a..9170859eb 100644
--- a/docs/proposals/06-immutable-etcd-backups.md
+++ b/docs/proposals/06-immutable-etcd-backups.md
@@ -1,5 +1,5 @@
 ---
-title: Immutable ETCD Backups
+title: Immutable etcd cluster backups
 dep-number: 06
 creation-date: 2024-09-25
 status: implementable
@@ -8,48 +8,65 @@ authors:
 - "@renormalize"
 - "@ishan16696"
 reviewers:
-- "@unmarshall"
+- "@etcd-druid-maintainers"
 ---
 
-# DEP-06: Immutable ETCD Backups
+# DEP-06: Immutable etcd cluster backups
 
 ## Summary
 
-This proposal aims to enhance the reliability and integrity of ETCD backups created by `etcd-backup-restore` in ETCD clusters managed by `etcd-druid`, by introducing immutable backups. By leveraging cloud provider features that support a write-once-read-many (WORM) model, unauthorized modifications to backup data are prevented, ensuring that backups remain intact and accessible for restoration.
+Currently, along with being able to provision etcd clusters and handle cluster lifecycle, `etcd-druid` also enables regular backups of the etcd cluster state to be taken, through the side-car container `etcd-backup-restore` that is deployed in each etcd pod running a member of the etcd cluster.
+This functionality is toggled on when `spec.backup` is enabled with appropriate values for an etcd cluster.
 
-The proposed solution relies on `etcd-druid` to manage ETCD backups and handle hibernation processes effectively. It leverages one of the suggested approaches to ensure backups remain immutable over extended periods. It is important to note that using `etcd-backup-restore` standalone may not be sufficient to achieve this functionality end-to-end, as the immutability handling (with respect to hibernation) is specifically managed within `etcd-druid`.
+All actors (with sufficient privileges) in the cluster where `etcd-druid` is deployed, and the etcd clusters it provisions have access to the `Secret` which holds the credentials that are used to upload snapshots of the etcd cluster state. These credentials are used by actors running in the system, and human operators - typically to perform various maintenance and recovery operations.
+
+To ensure erroneous operations do not occur by human operators during such maintenance and recovery operations, or by misbehaving actors in the cluster, and potentially create the scope for a complete restoration failure which can not be recovered from, the authors propose the usage of write-once-read-many (WORM) features offered by several cloud providers, wherever available.
+This WORM model will enhance the reliability and integrity of etcd cluster state backups created by `etcd-backup-restore` in etcd clusters managed and operated by `etcd-druid`; by ensuring that the backups are [*immutable*](#terminology) for a specific period of time from the time they are uploaded, thereby preventing all unintended modifications.
+
+`etcd-druid` and `etcd-backup-restore` will be enhanced to achieve the same functionality that currently is being achieved by modifying/deleting backups, without actually modifying/deleting these backups henceforth as the backups will be immutable for a set duration, thereby eliminating the possibility of a potential loss of data. `etcd-druid` will be the end-to-end solution for achieving this functionality, as relying just on `etcd-backup-restore` for such behavior will not be sufficient given the scope and all possible approaches to achieving this.
+
+## Terminology
+
+- **WORM (Write Once, Read Many):** A storage model where data, once written, cannot be modified or deleted until certain conditions are met.
+- **Immutability:** The property of an object being unmodifiable after creation.
+- **Immutability Period:** The duration for which data must remain immutable in object storage before it can be modified or deleted.
+- **Garbage Collection:** The process of deleting old or unnecessary snapshot data to free up storage space.
 
 ## Motivation
 
-Ensuring the integrity and availability of ETCD backups is crucial for the ability to restore an ETCD cluster when it has become non-functional or inoperable. Making the backups immutable protects against any unintended or malicious modifications post-creation, thereby enhancing the overall security posture.
+Backups are stored in object storage, which are accessible to both `etcd-backup-restore`, and human operators of these clusters with access to these credentials stored in `Secret`s.
+This however, is a double-edged sword. On one hand, it offers operators the capability to intervene in situations where restoration of the etcd cluster fails due to a multitude of reasons, like a potential bug in the side-car `etcd-backup-restore`, an unlikely bug in etcd's Snapshot API, and so on.
+There have been instances previously where such human operator intervention was necessary, as reported in https://github.com/gardener/etcd-backup-restore/issues/763. Such situations can be resolved by human operators through manual intervention, by either modifying or deleting erroneous snapshots.
+
+Manual intervention is quite helpful in cases where restoration fails, but there is *glaring flaw* with this method - operators have full access to all backups: `GET`, `PUT`, and `DELETE` calls.
+This leaves backups vulnerable to potential erroneous operations from human operators, which could lead to a disastrous loss of backup data, which can not be recovered from.
+
+Ensuring the integrity and availability of etcd cluster state backups is crucial for the ability to restore an etcd cluster when it has become non-functional or inoperable. Making the backups immutable protects against any unintended or malicious modifications post-creation, thereby enhancing the overall security posture.
 
 ### Goals
 
-- Implement immutable backup support for ETCD clusters.
-- Secure backup data against unintended or unauthorized modifications after creation.
-- Implement changes required in `etcd-backup-restore` and `etcd-druid` to support this proposal.
+- Secure backup data against unintended modifications after creation through bucket-level immutability policies with the storage providers that support such features.
+- Ensure a one-to-one map of functionality exists for recovery operations in special circumstances which require human operator intervention, where recovery involved direct manipulation of data in the object store.
 
 ### Non-Goals
 
-- Implementing object-level immutability policies at this stage.
+- Secure backup data through object-level immutability policies.
 - Supporting immutable backups on storage providers that do not offer immutability features (e.g., OpenStack Swift).
 
 ## Proposal
 
 ### Overview
 
-We propose introducing immutability in backup storage by leveraging cloud provider features that support a write-once-read-many (WORM) model. This approach will prevent data alterations post-creation, enhancing data integrity and security.
+The authors propose introducing immutability in backup storage by leveraging cloud provider features that support a write-once-read-many (WORM) model. This approach will prevent data alterations post-creation, enhancing data integrity and security.
 
 There are two types of immutability options to consider:
 
 1. **Bucket-Level Immutability:** Applies a uniform immutability period to all objects within a bucket. This is widely supported and easier to manage across different cloud providers.
-2. **Object-Level Immutability:** Allows setting immutability periods on a per-object basis, offering more granular control but with increased complexity and varying support across providers.
+2. **Object-Level Immutability:** Applies a non-uniform immutability period to the objects in the bucket, allowing setting immutability periods on a per-object basis, offering more granular control but with increased complexity and varying support across providers.
 
-In the detailed design, we will focus on bucket-level immutability policies due to their broader support and simpler management.
+Bucket-level immutability policies will be focused on in this proposal due to their broader support and simpler management, as mentioned in the [Non-Goals](#non-goals).
 
-### Detailed Design
-
-#### Bucket Immutability Mechanism
+### Configuring Immutable Backups
 
 The bucket immutability feature configures an immutability policy for a cloud storage bucket, dictating how long objects in the bucket must remain immutable. It also allows for locking the bucket's immutability policy, permanently preventing the policy from being reduced or removed.
 
@@ -61,59 +78,111 @@ The bucket immutability feature configures an immutability policy for a cloud st
 - **Not Supported:**
   - **OpenStack Swift**
 
-**Implementation Details:**
+To configure immutability with your etcd cluster backups, there will be no changes needed to be performed in your etcd `spec`. `etcd-backup-restore` is designed to handle immutable backup buckets inherently.
+It is the responsibility of controllers/operators for configuring existing/new buckets with the necessary immutability settings for `etcd-backup-restore`'s snapshots to be immutable.
 
-- Operators are responsible for configuring new or existing buckets with these immutability settings before `etcd-backup-restore` begins uploading snapshots.
-- Once configured, snapshots uploaded by `etcd-backup-restore` will also be immutable and cannot be altered or deleted until the immutability period expires.
-- No additional configuration needs to be passed to etcd-druid.
+Immutable buckets are configured in the following way for each type of consumer of `etcd-druid`:
 
-#### ETCD Backup Configuration
+- For consumers using `gardener-extension-provider-<provider>` to configure their buckets, the following fields have to be added in the `backup.providerConfig` section of your `extensions.gardener.cloud` `BackupBucket` resource.
 
-Operators must ensure that the ETCD backup configuration aligns with the immutability requirements, including setting appropriate immutability periods.
+  ```yaml
+  backup:
+    providerConfig:
+      immutability:
+        retentionType: "bucket"
+        retentionPeriod: "<time>"
+        locked: false|true
+  ```
 
-#### Handling of Hibernated Clusters
+  The extension will act on the changes in the spec of the `BackupBucket` resource in the next reconciliation, and make the corresponding API calls to the cloud provider to make the `BackupBucket` immutable. Once this succeeds, your backups are now immutable and `etcd-backup-restore` reacts to this without any configuration changes or restarts.
 
-When an ETCD cluster is hibernated for a duration exceeding the immutability period, backups may become mutable again (this behavior depends on the cloud provider; refer to [Comparison of Storage Provider Properties](#comparison-of-storage-provider-properties-for-bucket-level-and-object-level-immutability)), compromising the intended immutability guarantees.
+- For consumers using `etcd-druid` standalone, the necessary API calls has to be made to make your buckets immutable, either by your controllers which provision and handle infrastructure like buckets, or if the buckets are provisioned by human operators, then the operators can use the corresponding cloud provider CLIs to run the necessary commands to make the buckets immutable. Please check `etcd-backup-restore` [docs](https://github.com/gardener/etcd-backup-restore/blob/master/docs/usage/immutable_snapshots.md) to find example CLI commands for the currently supported providers.
 
-Such handling of hibernated clusters is the type of scenario which the etcd operator-tasks frameworks lends itself to quite well, and thus for all proposed solutions, the operator tasks framework as defined [here](./05-etcd-operator-tasks.md) will be made use of for the designs of the solutions.
+### etcd Backup Configuration
 
-To maintain snapshot immutability during extended hibernation, we propose two approaches:
+Before configuring backup buckets to be immutable, it is the responsibility of the operators to configure the snapshot and snapshot garbage collection schedules to be meaningful with the context of the immutability duration of the bucket.
 
-##### Approach 1: Using the Compaction Job
+It is recommended that your full snapshot schedule enables the triggering of a full snapshot before the previous full snapshot's immutability period expires. This is to ensure that all the corresponding delta snapshots triggered on top of this full snapshot are ensured to use an intact full snapshot.
+
+It is recommended to configure your snapshot garbage collection policies to begin garbage collection only after the bucket immutability period, to avoid unnecessary API calls to the cloud provider.
+
+### Hibernation
+
+A new state for an etcd cluster will be introduced, called `Hibernated`. This state is inspired from a Gardener [Shoot's Hibernation](https://github.com/gardener/gardener/blob/master/docs/usage/shoot/shoot_hibernate.md).
+An etcd cluster is hibernated when the intent is for the etcd cluster to stop serving traffic for the foreseeable future.
+
+Since `etcd-druid` enables hosted control planes, if the intent is to bring down the control plane of a cluster completely, then the corresponding etcd cluster is also to be brought down. In such cases, the etcd cluster can be `Hibernated`.
+
+An explicit effort is made to differentiate between hibernating an etcd cluster and scaling-in the number of the replicas of the `StatefulSet` to zero. The replicas being scaled to zero *does not* mean that the cluster is hibernated. Therefore, to make it clear to all entities interacting with the etcd cluster, i.e. `etcd-druid` and human operators, the new field is introduced.
+
+To enable hibernating etcd clusters by `etcd-druid`, the following fields are proposed to be added to the `spec` and `status` of an etcd cluster respectively:
+
+- `spec.hibernation`:
+
+  ```yaml
+  spec:
+    hibernation:
+      enabled: <bool>
+  ```
+
+  When the `spec` of the etcd cluster is changed to contain `spec.hibernation.enabled: true`, in the next reconciliation, `etcd-druid` will stop traffic to be served from the etcd cluster by removing the corresponding client `Service`s, perform the necessary maintenance operations, and then scale-in the cluster.  
+  Similarly, when the `spec` has `spec.hibernation.enabled: true` removed, or set to `spec.hibernation.enabled: false`, the next reconciliation will scale-out the etcd cluster to `spec.replicas`.
+
+- `status.hibernationTime`:
+
+  ```yaml
+  status:
+    hibernationTime: <hibernation-UTC-time>
+  ```
+
+  This field conveys information about whether the cluster has been successfully hibernated after a reconciliation, and the time at which it entered hibernation. This field is cleared when the cluster is woken up from hibernation.
+
+The following are the implications:
+
+- Gardener consumers: When a Gardener Shoot cluster is hibernated, then the corresponding etcd cluster is also hibernated by `etcd-druid`, and vice versa.
+- Standalone: When an etcd cluster is to be hibernated, the spec of the etcd is to be changed by an operator.
+
+### Handling of Hibernated Clusters
+
+When an etcd cluster is hibernated for a duration exceeding the duration for which a backup is immutable, backups may become mutable again (this behavior depends on the cloud provider; refer to [Comparison of Storage Provider Properties](#comparison-of-storage-provider-properties-for-bucket-level-and-object-level-immutability)), compromising the intended immutability guarantees.
+
+Such handling of hibernated clusters is the type of scenario which the [etcd operator-tasks](./05-etcd-operator-tasks.md) framework lends itself to quite well, and thus for all proposed solutions, the operator tasks framework will be made use of for the design of the solution.
+
+To maintain snapshot immutability during extended hibernation, the authors propose two approaches:
+
+#### Approach 1: Using the Compaction Job
 
 **Proposed Solution:**
 
-Utilize the compaction job to periodically take fresh snapshots during hibernation. Introduce a new flag `--hibernation-snapshot-interval` to the compaction controller. This flag sets the interval after which a compaction job should be triggered for a hibernated ETCD cluster, based on the time elapsed since `fullLease.Spec.RenewTime.Time` and if `etcd.spec.replicas` is `0` (indicating hibernation). The compaction job uses the [compact command](https://github.com/gardener/etcd-backup-restore/blob/master/cmd/compact.go) to create a new snapshot.
+Utilize the compaction job to periodically take fresh snapshots during hibernation. Introduce a new flag `--hibernation-snapshot-interval` to the compaction controller. This flag sets the interval after which a compaction job should be triggered for a hibernated etcd cluster, based on the time elapsed since `fullLease.Spec.RenewTime.Time` and if `etcd.spec.replicas` is `0` (indicating hibernation). The compaction job uses the [compact command](https://github.com/gardener/etcd-backup-restore/blob/master/cmd/compact.go) to create a new snapshot.
 
 **Implementation Details:**
 
-- **Etcd Druid:**
-  - **Compaction Controller:**
-    - Introduce a new flag:
-      - **Flag:** `--hibernation-snapshot-interval`
-        - **Type:** Duration
-        - **Default:** `24h`
-        - **Description:** Interval after which a new snapshot is taken during hibernation.
-    - The compaction job starts an embedded ETCD instance to take snapshots during hibernation.
-    - This approach ensures that backups remain within the immutability period and are safeguarded against becoming mutable.
+- **Compaction Controller:**
+  - Introduce a new flag:
+    - **Flag:** `--hibernation-snapshot-interval`
+      - **Type:** Duration
+      - **Default:** `24h`
+      - **Description:** Interval after which a new snapshot is taken during hibernation.
+  - The compaction job starts an embedded etcd instance to take snapshots during hibernation.
 
-###### Advantages
+##### Advantages
 
-- **No Change to ETCD Cluster State:** Does not alter the actual ETCD cluster, keeping it in hibernation.
+- **No Change to etcd Cluster State:** Does not alter the actual etcd cluster, keeping it in hibernation.
 - **Automated Snapshot Creation:** Periodically creates new snapshots to extend immutability by triggering the compaction job.
 - **Leverages Existing Mechanism:** Utilizes the compaction job, which is already part of the system.
 
-###### Disadvantages
+##### Disadvantages
 
-- **Resource Consumption:** Starting an embedded ETCD instance periodically consumes resources.
+- **Resource Consumption:** Starting an embedded etcd instance periodically consumes resources.
 
-##### Approach 2: Re-upload of the latest snapshot
+#### Approach 2: Re-upload of the latest snapshot
 
 **Proposed Solution:**
 
-A new `EtcdOperatorTask` called `EtcdSnapshotImmutabilityExtension` will be created as defined in the operator tasks framework. This new `EtcdOperatorTask` extends the immutability period by deploying a job which uploads another copy of the latest snapshot to the object store.
+A new `EtcdOperatorTask` called `ExtendEtcdSnapshotImmutabilityTask` will be created as defined in the operator tasks framework. This new `EtcdOperatorTask` extends the immutability period by deploying a job which uploads another copy of the latest snapshot to the object store.
 
-A full snapshot is taken before hibernating the ETCD cluster. This is to ensure that no state maintained in the etcd cluster is lost before hibernation.
+A full snapshot is taken before hibernating the etcd cluster. This is to ensure that no state maintained in the etcd cluster is lost before hibernation.
 
 **Implementation Details:**
 
@@ -149,45 +218,45 @@ A full snapshot is taken before hibernating the ETCD cluster. This is to ensure
     ```
 
   - Add `immutableSettings.retentionType` under `etcd.spec.backup.store`.
-- **ETCD Controller Logic:**
+- **etcd Controller Logic:**
   - When hibernation is requested, by changing `etcd.spec.hibernated.enabled` to `true`:
-    - The controller removes the ETCD client ports `2380` and `2379` from the `etcd-client` service, leaving only the etcd-backup-restore port `8080`. This stops ETCD client traffic.
+    - The controller removes the etcd client ports `2380` and `2379` from the `etcd-client` service, leaving only the etcd-backup-restore port `8080`. This stops etcd client traffic.
     - The controller creates a `EtcdOperatorTask` to trigger an on-demand full snapshot.
       - On-demand full snapshot is successful: the controller does no additional handling.
       - On-demand full snapshot fails: the controller triggers the creation of a `EtcdOperatorTask` for an on-demand compaction job that compacts the latest base full snapshot and the corresponding deltas.
-    - The controller scales in the ETCD cluster (i.e., sets `StatefulSet.spec.replicas` to zero).
-    - The controller creates the `EtcdSnapshotImmutabilityExtension` periodically if `etcd.spec.backup.store.immutableSettings.retentionType` is set to `"Bucket"`.
+    - The controller scales in the etcd cluster (i.e., sets `StatefulSet.spec.replicas` to zero).
+    - The controller creates the `ExtendEtcdSnapshotImmutabilityTask` periodically if `etcd.spec.backup.store.immutableSettings.retentionType` is set to `"Bucket"`.
 
-- **`EtcdSnapshotImmutabilityExtension` specification:**
+- **`ExtendEtcdSnapshotImmutabilityTask` specification:**
   - Run `etcdbrctl extend-immutability --bucket-level-immutability` to extend the immutability of the latest snapshot.
   - Run `etcdbrctl garbage-collect --garbage-collection-policy <garbage-collection-policy>` to garbage collect the snapshots that are created during the extension.
 
 - **EtcdOperatorTask Controller Logic:**
-  - The operator-tasks controller will react to the creation of the custom resource, and will deploy a job named `<etcd-name>-extend-immutability` which is the `EtcdSnapshotImmutabilityExtension` job.
-  - The controller also reports metrics regarding the `EtcdSnapshotImmutabilityExtension` job, which can be used to raise alerts for operators that immutability has not been extended.
+  - The operator-tasks controller will react to the creation of the custom resource, and will deploy a job named `<etcd-name>-extend-immutability` which is the `ExtendEtcdSnapshotImmutabilityTask` job.
+  - The controller also reports metrics regarding the `ExtendEtcdSnapshotImmutabilityTask` job, which can be used to raise alerts for operators that immutability has not been extended.
 
-###### Advantages
+##### Advantages
 
-- **Minimal Operational Impact:** Does not alter the ETCD cluster's state during hibernation and respects the operator's intention to hibernate the cluster without unintended changes.
-- **Efficient Resource Utilization:** Only the latest snapshot is copied, limiting additional storage usage, and avoids the need to start an embedded ETCD instance.
-- **Automated Process:** The process of taking a full snapshot before hibernation and creating the `EtcdSnapshotImmutabilityExtension` is automated within the controller.
+- **Minimal Operational Impact:** Does not alter the etcd cluster's state during hibernation and respects the operator's intention to hibernate the cluster without unintended changes.
+- **Efficient Resource Utilization:** Only the latest snapshot is copied, limiting additional storage usage, and avoids the need to start an embedded etcd instance.
+- **Automated Process:** The process of taking a full snapshot before hibernation and creating the `ExtendEtcdSnapshotImmutabilityTask` is automated within the controller.
 
-###### Disadvantages
+##### Disadvantages
 
 - **Additional Complexity:** Requires updates to the etcd controller, introduction of the operator-tasks controller, and introduction of new etcdbrctl commands.
 - **Prerequisite Requirement:** Relies on successfully taking a full snapshot before hibernation, which may introduce delays or require handling snapshot failures.
 
-##### Recommendation
+#### Recommendation
 
-After evaluating both approaches, **Approach 2: Re-upload of the latest snapshot** is recommended due to its minimal operational impact and efficient resource utilization. By ensuring that a full snapshot is taken before hibernation, we maintain data consistency and extend the immutability period effectively. This approach respects the operator's intention to keep the ETCD cluster hibernated without introducing significant resource consumption or complexity.
+After evaluating both approaches, **Approach 2: Re-upload of the latest snapshot** is recommended due to its minimal operational impact and efficient resource utilization. By ensuring that a full snapshot is taken before hibernation, data consistency is maintained, and the immutability period is extended effectively. This approach respects the operator's intention to keep the etcd cluster hibernated without introducing significant resource consumption or complexity.
 
 ## Compatibility
 
-The proposed changes are fully compatible with existing ETCD clusters and backup processes.
+The proposed changes are fully compatible with existing etcd clusters and backup processes.
 
 - **Backward Compatibility:**
   - Existing clusters without immutable buckets will continue to function without change.
-  - The introduction of the `EtcdSnapshotImmutabilityExtension` does not affect clusters that are not hibernated.
+  - The introduction of the `ExtendEtcdSnapshotImmutabilityTask` does not affect clusters that are not hibernated.
 - **Forward Compatibility:**
   - Clusters can opt-in to use immutable backups by configuring the bucket accordingly.
   - The controller's logic to handle hibernation is additive and does not interfere with existing workflows.
@@ -220,7 +289,7 @@ Operators need to:
 
 - **Configure Buckets:**
 
-  - Set up buckets with appropriate immutability settings before deploying ETCD clusters.
+  - Set up buckets with appropriate immutability settings before deploying etcd clusters.
   - Ensure immutability periods align with organizational policies.
 
 - **Monitor Hibernation Processes:**
@@ -273,7 +342,7 @@ Using object-level immutability provides flexibility in scenarios where certain
 
 #### Conclusion
 
-Given the complexities and limitations, we recommend using bucket-level immutability in conjunction with the `EtcdCopyBackupsTask` approach (Approach 2) to manage immutability during hibernation effectively. This approach provides a balance between operational simplicity and meeting immutability requirements. The compaction job approach (Approach 1) is also viable but may introduce more resource consumption and operational overhead.
+Given the complexities and limitations, the authors recommend using bucket-level immutability in conjunction with the `EtcdCopyBackupsTask` approach (Approach 2) to manage immutability during hibernation effectively. This approach provides a balance between operational simplicity and meeting immutability requirements. The compaction job approach (Approach 1) is also viable but may introduce more resource consumption and operational overhead.
 
 ##### Comparison of Storage Provider Properties for Bucket-Level and Object-Level Immutability
 
@@ -297,19 +366,5 @@ Given the complexities and limitations, we recommend using bucket-level immutabi
 - [GCS Bucket Lock](https://cloud.google.com/storage/docs/bucket-lock)
 - [AWS S3 Object Lock](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html)
 - [Azure Immutable Blob Storage](https://learn.microsoft.com/en-us/azure/storage/blobs/immutable-policy-configure-container-scope?tabs=azure-portal)
-- [etcd-backup-restore PR #776](https://github.com/gardener/etcd-backup-restore/pull/776)
-- [EtcdCopyBackupsTask Implementation](https://github.com/gardener/etcd-druid/pull/544)
-
-## Glossary
-
-- **ETCD:** A distributed key-value store used as the backing store for Kubernetes.
-- **Etcd Druid:** A Kubernetes operator that manages ETCD clusters for Gardener.
-- **EtcdCopyBackupsTask:** A custom resource that defines a task to copy ETCD backups.
-- **Compaction Job:** A process that compacts ETCD snapshots to reduce storage size and improve performance.
-- **Hibernation:** Shutting down all the processes that correspond to an etcd cluster, while persisting information which can later be used to restart the etcd cluster with the same state.
-- **Immutability Period:** The duration for which data must remain immutable in storage before it can be modified or deleted.
-- **WORM (Write Once, Read Many):** A storage model where data, once written, cannot be modified or deleted until certain conditions are met.
-- **Immutability:** The property of an object being unchangeable after creation.
-- **Garbage Collection:** The process of deleting old or unnecessary data to free up storage space.
 
 ---

From 57e91a04db82a8e1645393c3ecfef5a2ab8f75a6 Mon Sep 17 00:00:00 2001
From: Seshachalam Yerasala Venkata <seshachalam.yerasala.venkata@sap.com>
Date: Wed, 4 Dec 2024 14:36:00 +0530
Subject: [PATCH 5/7] Rename `extend-immutability` command to `renew-snapshot`

---
 docs/proposals/06-immutable-etcd-backups.md | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/docs/proposals/06-immutable-etcd-backups.md b/docs/proposals/06-immutable-etcd-backups.md
index 9170859eb..201dc6fab 100644
--- a/docs/proposals/06-immutable-etcd-backups.md
+++ b/docs/proposals/06-immutable-etcd-backups.md
@@ -186,21 +186,21 @@ A full snapshot is taken before hibernating the etcd cluster. This is to ensure
 
 **Implementation Details:**
 
-- **Introduce the `extend-immutability` command to etcdbrctl**:
-  - etcd-backup-restore will be enhanced to support a new command `extend-immutability` which does the following:
-    - Downloads the latest full snapshot from the object store.
+- **Introduce the `renew-snapshot` command to etcdbrctl**:
+  - etcd-backup-restore will be enhanced to support a new command `renew-snapshot` which does the following:
+    - - Downloads the latest full snapshot from the object store.
     - Replaces the Unix epoch in the file-name of the downloaded snapshot to contain the time at which the file completes downloading.
     - Uploads this newly renamed snapshot to the same object store.
     - Renews the full snapshot lease after the upload is successful.
 
-    The immutability period of an object begins from the moment of upload, thus extending the immutability period of the latest snapshot. Renaming the snapshot is necessary since the downloaded snapshot can not simply be re-uploaded as uploading with the same name would be an attempt at modifying an already existing snapshot, which is disallowed.
+    The immutability period of an object begins from the moment of upload, thus renewing the immutability period of the snapshot. Renaming the snapshot is necessary since the downloaded snapshot can not simply be re-uploaded as uploading with the same name would be an attempt at modifying an already existing snapshot, which is disallowed.
 
     This command could either be implemented standalone, or could be implemented as a wrapper over the `copy` command of `etcdbrctl` by extending the functionality of the `copy` command accordingly.
 - **Introduce the `garbage-collect` command to etcdbrctl**:
   - etcd-backup-restore will be enhanced to support a new command `garbage-collect` which does the following:
     - Perform garbage collection of the snapshots in the object store according to the policy specified with the `--garbage-collection-policy` flag.  
 
-    This functionality is needed since it would be necessary to garbage collect the (identical final) snapshots that are (re)uploaded in order to ensure that there is always a snapshot which is immutable.
+    This functionality is needed to garbage collect the old snapshots whose immutability has expired but have been renewed as a fresh snapshot via the approach mentioned above.
 - **Update `Etcd` CRD:**
   - Add `etcd.spec.hibernation`:  
     Since there are situations outside of hibernation where the number of replicas of the statefulset would have to be scaled to zero, there needs to be an explicit way in which it is conveyed to etcd-druid that the etcd cluster is being hibernated. This can be achieved by extending the `Etcd` CRD by including a new field in the `spec` called `hibernated`.
@@ -228,7 +228,7 @@ A full snapshot is taken before hibernating the etcd cluster. This is to ensure
     - The controller creates the `ExtendEtcdSnapshotImmutabilityTask` periodically if `etcd.spec.backup.store.immutableSettings.retentionType` is set to `"Bucket"`.
 
 - **`ExtendEtcdSnapshotImmutabilityTask` specification:**
-  - Run `etcdbrctl extend-immutability --bucket-level-immutability` to extend the immutability of the latest snapshot.
+  - Run `etcdbrctl renew-snapshot` to extend the immutability of the latest snapshot.
   - Run `etcdbrctl garbage-collect --garbage-collection-policy <garbage-collection-policy>` to garbage collect the snapshots that are created during the extension.
 
 - **EtcdOperatorTask Controller Logic:**

From b68574ded863638096bd422c1cbc0b5b55fe7748 Mon Sep 17 00:00:00 2001
From: Seshachalam Yerasala Venkata <seshachalam.yerasala.venkata@sap.com>
Date: Thu, 5 Dec 2024 16:32:57 +0530
Subject: [PATCH 6/7] docs: Update proposal for immutable etcd cluster backups

- Improve readability and clarity of the summary and motivation sections.
- Add detailed terminology definitions.
- Refine the proposal to focus on bucket-level immutability.
---
 docs/proposals/06-immutable-etcd-backups.md | 260 ++++++--------------
 1 file changed, 71 insertions(+), 189 deletions(-)

diff --git a/docs/proposals/06-immutable-etcd-backups.md b/docs/proposals/06-immutable-etcd-backups.md
index 201dc6fab..6fb14d150 100644
--- a/docs/proposals/06-immutable-etcd-backups.md
+++ b/docs/proposals/06-immutable-etcd-backups.md
@@ -1,5 +1,5 @@
 ---
-title: Immutable etcd cluster backups
+title: Immutable etcd Cluster Backups
 dep-number: 06
 creation-date: 2024-09-25
 status: implementable
@@ -11,46 +11,51 @@ reviewers:
 - "@etcd-druid-maintainers"
 ---
 
-# DEP-06: Immutable etcd cluster backups
+# DEP-06: Immutable etcd Cluster Backups
 
 ## Summary
 
-Currently, along with being able to provision etcd clusters and handle cluster lifecycle, `etcd-druid` also enables regular backups of the etcd cluster state to be taken, through the side-car container `etcd-backup-restore` that is deployed in each etcd pod running a member of the etcd cluster.
-This functionality is toggled on when `spec.backup` is enabled with appropriate values for an etcd cluster.
+Currently, `etcd-druid` can provision etcd clusters and manage their lifecycle. Additionally, it enables regular backups of the etcd cluster state through the sidecar container `etcd-backup-restore`, which is deployed in each etcd pod running a member of the etcd cluster. This functionality is activated when `spec.backup` is enabled with appropriate values for an etcd cluster.
 
-All actors (with sufficient privileges) in the cluster where `etcd-druid` is deployed, and the etcd clusters it provisions have access to the `Secret` which holds the credentials that are used to upload snapshots of the etcd cluster state. These credentials are used by actors running in the system, and human operators - typically to perform various maintenance and recovery operations.
+All actors (with sufficient privileges) in the cluster where `etcd-druid` is deployed, and in the etcd clusters it provisions, have access to the `Secret` that holds the credentials used to upload snapshots of the etcd cluster state. These credentials are used by system actors and human operators—typically to perform various maintenance and recovery operations.
 
-To ensure erroneous operations do not occur by human operators during such maintenance and recovery operations, or by misbehaving actors in the cluster, and potentially create the scope for a complete restoration failure which can not be recovered from, the authors propose the usage of write-once-read-many (WORM) features offered by several cloud providers, wherever available.
-This WORM model will enhance the reliability and integrity of etcd cluster state backups created by `etcd-backup-restore` in etcd clusters managed and operated by `etcd-druid`; by ensuring that the backups are [*immutable*](#terminology) for a specific period of time from the time they are uploaded, thereby preventing all unintended modifications.
+To prevent erroneous operations by human operators during maintenance and recovery, or by misbehaving actors in the cluster - which could potentially lead to an unrecoverable restoration failure, the authors propose using write-once-read-many ([WORM](#terminology)) features offered by various cloud providers where available.
 
-`etcd-druid` and `etcd-backup-restore` will be enhanced to achieve the same functionality that currently is being achieved by modifying/deleting backups, without actually modifying/deleting these backups henceforth as the backups will be immutable for a set duration, thereby eliminating the possibility of a potential loss of data. `etcd-druid` will be the end-to-end solution for achieving this functionality, as relying just on `etcd-backup-restore` for such behavior will not be sufficient given the scope and all possible approaches to achieving this.
+This [WORM](#terminology) model will enhance the reliability and integrity of etcd cluster state backups created by `etcd-backup-restore` in etcd clusters managed and operated by `etcd-druid` by ensuring that the backups are [*immutable*](#terminology) for a specific period from the time they are uploaded, thereby preventing any unintended modifications.
+
+`etcd-druid` and `etcd-backup-restore` will be enhanced to achieve the same functionality currently achieved by modifying or deleting backups, but without actually modifying or deleting these backups, since they will now be immutable for a set duration. This approach eliminates the possibility of potential data loss. `etcd-druid` will provide an end-to-end solution for achieving this functionality, as relying solely on `etcd-backup-restore` is insufficient given the scope and possible approaches to achieving this.
 
 ## Terminology
 
+- **etcd-druid:** A controller that manages etcd clusters, including provisioning and lifecycle handling.
+- **etcd-backup-restore:** A sidecar container that manages backups and restores of etcd cluster state.
 - **WORM (Write Once, Read Many):** A storage model where data, once written, cannot be modified or deleted until certain conditions are met.
 - **Immutability:** The property of an object being unmodifiable after creation.
 - **Immutability Period:** The duration for which data must remain immutable in object storage before it can be modified or deleted.
+- **Bucket-Level Immutability:** A policy that applies a uniform immutability period to all objects within a bucket.
+- **Object-Level Immutability:** A policy that allows setting non-uniform immutability periods for individual objects within a bucket, providing more granular control.
 - **Garbage Collection:** The process of deleting old or unnecessary snapshot data to free up storage space.
+- **Hibernation:** The state in which an etcd cluster is scaled down to zero replicas, effectively pausing its operations. This is typically done to save resources when the cluster is not needed for an extended period. During hibernation, the cluster's data remains intact, and it can be resumed to its previous state when required.
 
 ## Motivation
 
-Backups are stored in object storage, which are accessible to both `etcd-backup-restore`, and human operators of these clusters with access to these credentials stored in `Secret`s.
-This however, is a double-edged sword. On one hand, it offers operators the capability to intervene in situations where restoration of the etcd cluster fails due to a multitude of reasons, like a potential bug in the side-car `etcd-backup-restore`, an unlikely bug in etcd's Snapshot API, and so on.
-There have been instances previously where such human operator intervention was necessary, as reported in https://github.com/gardener/etcd-backup-restore/issues/763. Such situations can be resolved by human operators through manual intervention, by either modifying or deleting erroneous snapshots.
+Backups are stored in object storage, which is accessible to both `etcd-backup-restore` and human operators of these clusters who have access to the credentials stored in `Secret`s. This, however, is a double-edged sword. On one hand, it offers operators the capability to intervene in situations where restoration of the etcd cluster fails due to a multitude of reasons, like a potential bug in the sidecar `etcd-backup-restore`, an unlikely bug in etcd's Snapshot API, and so on.
+
+There have been instances previously where such human operator intervention was necessary, as reported in [gardener/etcd-backup-restore#763](https://github.com/gardener/etcd-backup-restore/issues/763). Such situations can be resolved by human operators through manual intervention, by either modifying or deleting erroneous snapshots.
 
-Manual intervention is quite helpful in cases where restoration fails, but there is *glaring flaw* with this method - operators have full access to all backups: `GET`, `PUT`, and `DELETE` calls.
-This leaves backups vulnerable to potential erroneous operations from human operators, which could lead to a disastrous loss of backup data, which can not be recovered from.
+Manual intervention is quite helpful in cases where restoration fails, but there is a *glaring flaw* with this method—operators have full access to all backups: `GET`, `PUT`, and `DELETE` calls. This leaves backups vulnerable to potential erroneous operations from human operators, which could lead to a disastrous loss of backup data, which cannot be recovered from.
 
 Ensuring the integrity and availability of etcd cluster state backups is crucial for the ability to restore an etcd cluster when it has become non-functional or inoperable. Making the backups immutable protects against any unintended or malicious modifications post-creation, thereby enhancing the overall security posture.
 
 ### Goals
 
 - Secure backup data against unintended modifications after creation through bucket-level immutability policies with the storage providers that support such features.
-- Ensure a one-to-one map of functionality exists for recovery operations in special circumstances which require human operator intervention, where recovery involved direct manipulation of data in the object store.
+- Ensure a one-to-one mapping of functionality exists for recovery operations in special circumstances which require human operator intervention, where recovery involved direct manipulation of data in the object store.
 
 ### Non-Goals
 
-- Secure backup data through object-level immutability policies.
+- Implementing the hibernation support via `etcd.spec` or annotations on the `Etcd` CR  (i.e., specifying an intent for hibernation) as mentioned in [gardener/etcd-druid#922](https://github.com/gardener/etcd-druid/issues/922).
+- Securing backup data through object-level immutability policies.
 - Supporting immutable backups on storage providers that do not offer immutability features (e.g., OpenStack Swift).
 
 ## Proposal
@@ -64,191 +69,95 @@ There are two types of immutability options to consider:
 1. **Bucket-Level Immutability:** Applies a uniform immutability period to all objects within a bucket. This is widely supported and easier to manage across different cloud providers.
 2. **Object-Level Immutability:** Applies a non-uniform immutability period to the objects in the bucket, allowing setting immutability periods on a per-object basis, offering more granular control but with increased complexity and varying support across providers.
 
-Bucket-level immutability policies will be focused on in this proposal due to their broader support and simpler management, as mentioned in the [Non-Goals](#non-goals).
+Bucket-level immutability policies will be the focus of this proposal due to their broader support and simpler management, as mentioned in the [Non-Goals](#non-goals).
 
 ### Configuring Immutable Backups
 
 The bucket immutability feature configures an immutability policy for a cloud storage bucket, dictating how long objects in the bucket must remain immutable. It also allows for locking the bucket's immutability policy, permanently preventing the policy from being reduced or removed.
 
 - **Supported by Major Providers:**
+
   - **Google Cloud Storage (GCS):** [Bucket Lock](https://cloud.google.com/storage/docs/bucket-lock)
   - **Amazon S3 (S3):** [Object Lock](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html)
   - **Azure Blob Storage (ABS):** [Immutable Blob Storage](https://learn.microsoft.com/en-us/azure/storage/blobs/immutable-policy-configure-container-scope?tabs=azure-portal)
 
 - **Not Supported:**
-  - **OpenStack Swift**
-
-To configure immutability with your etcd cluster backups, there will be no changes needed to be performed in your etcd `spec`. `etcd-backup-restore` is designed to handle immutable backup buckets inherently.
-It is the responsibility of controllers/operators for configuring existing/new buckets with the necessary immutability settings for `etcd-backup-restore`'s snapshots to be immutable.
-
-Immutable buckets are configured in the following way for each type of consumer of `etcd-druid`:
-
-- For consumers using `gardener-extension-provider-<provider>` to configure their buckets, the following fields have to be added in the `backup.providerConfig` section of your `extensions.gardener.cloud` `BackupBucket` resource.
-
-  ```yaml
-  backup:
-    providerConfig:
-      immutability:
-        retentionType: "bucket"
-        retentionPeriod: "<time>"
-        locked: false|true
-  ```
-
-  The extension will act on the changes in the spec of the `BackupBucket` resource in the next reconciliation, and make the corresponding API calls to the cloud provider to make the `BackupBucket` immutable. Once this succeeds, your backups are now immutable and `etcd-backup-restore` reacts to this without any configuration changes or restarts.
-
-- For consumers using `etcd-druid` standalone, the necessary API calls has to be made to make your buckets immutable, either by your controllers which provision and handle infrastructure like buckets, or if the buckets are provisioned by human operators, then the operators can use the corresponding cloud provider CLIs to run the necessary commands to make the buckets immutable. Please check `etcd-backup-restore` [docs](https://github.com/gardener/etcd-backup-restore/blob/master/docs/usage/immutable_snapshots.md) to find example CLI commands for the currently supported providers.
-
-### etcd Backup Configuration
-
-Before configuring backup buckets to be immutable, it is the responsibility of the operators to configure the snapshot and snapshot garbage collection schedules to be meaningful with the context of the immutability duration of the bucket.
-
-It is recommended that your full snapshot schedule enables the triggering of a full snapshot before the previous full snapshot's immutability period expires. This is to ensure that all the corresponding delta snapshots triggered on top of this full snapshot are ensured to use an intact full snapshot.
 
-It is recommended to configure your snapshot garbage collection policies to begin garbage collection only after the bucket immutability period, to avoid unnecessary API calls to the cloud provider.
-
-### Hibernation
-
-A new state for an etcd cluster will be introduced, called `Hibernated`. This state is inspired from a Gardener [Shoot's Hibernation](https://github.com/gardener/gardener/blob/master/docs/usage/shoot/shoot_hibernate.md).
-An etcd cluster is hibernated when the intent is for the etcd cluster to stop serving traffic for the foreseeable future.
-
-Since `etcd-druid` enables hosted control planes, if the intent is to bring down the control plane of a cluster completely, then the corresponding etcd cluster is also to be brought down. In such cases, the etcd cluster can be `Hibernated`.
-
-An explicit effort is made to differentiate between hibernating an etcd cluster and scaling-in the number of the replicas of the `StatefulSet` to zero. The replicas being scaled to zero *does not* mean that the cluster is hibernated. Therefore, to make it clear to all entities interacting with the etcd cluster, i.e. `etcd-druid` and human operators, the new field is introduced.
-
-To enable hibernating etcd clusters by `etcd-druid`, the following fields are proposed to be added to the `spec` and `status` of an etcd cluster respectively:
-
-- `spec.hibernation`:
-
-  ```yaml
-  spec:
-    hibernation:
-      enabled: <bool>
-  ```
+  - **OpenStack Swift**
 
-  When the `spec` of the etcd cluster is changed to contain `spec.hibernation.enabled: true`, in the next reconciliation, `etcd-druid` will stop traffic to be served from the etcd cluster by removing the corresponding client `Service`s, perform the necessary maintenance operations, and then scale-in the cluster.  
-  Similarly, when the `spec` has `spec.hibernation.enabled: true` removed, or set to `spec.hibernation.enabled: false`, the next reconciliation will scale-out the etcd cluster to `spec.replicas`.
+Currently, configuring immutable buckets is not handled directly within `etcd-druid`.
 
-- `status.hibernationTime`:
+By configuring an immutability policy on your storage bucket/container, you ensure that all snapshots are stored in an immutable (WORM) state for a specified duration. This prevents snapshots from being modified or deleted until they reach the end of the immutability period.
 
-  ```yaml
-  status:
-    hibernationTime: <hibernation-UTC-time>
-  ```
+This immutability policy can be set for a specific duration, ensuring that snapshots are not altered during this period, providing a safeguard against accidental or malicious modifications. The immutability policy configuration can be applied via cloud provider consoles, APIs, or CLI tools. Once the policy is set, it becomes an inherent protection mechanism for `etcd` backups.
 
-  This field conveys information about whether the cluster has been successfully hibernated after a reconciliation, and the time at which it entered hibernation. This field is cleared when the cluster is woken up from hibernation.
+Immutable buckets are configured differently depending on the consumer:
 
-The following are the implications:
+- **Large-Scale Consumers (e.g., Gardener):**
+  - Have automated the configuration of immutability for existing or new buckets, as discussed [here](https://github.com/gardener/gardener/issues/10866).
+- **Standalone Consumers of `etcd-druid`:**
+  - Need to handle the immutability configuration manually using the respective cloud provider's CLI tools or console.
 
-- Gardener consumers: When a Gardener Shoot cluster is hibernated, then the corresponding etcd cluster is also hibernated by `etcd-druid`, and vice versa.
-- Standalone: When an etcd cluster is to be hibernated, the spec of the etcd is to be changed by an operator.
+Detailed documentation on configuring the backup bucket for immutable etcd snapshots uploaded by `etcd-backup-restore` can be found [here](https://github.com/gardener/etcd-backup-restore/blob/master/docs/usage/enabling_immutable_snapshots.md).
 
 ### Handling of Hibernated Clusters
 
-When an etcd cluster is hibernated for a duration exceeding the duration for which a backup is immutable, backups may become mutable again (this behavior depends on the cloud provider; refer to [Comparison of Storage Provider Properties](#comparison-of-storage-provider-properties-for-bucket-level-and-object-level-immutability)), compromising the intended immutability guarantees.
-
-Such handling of hibernated clusters is the type of scenario which the [etcd operator-tasks](./05-etcd-operator-tasks.md) framework lends itself to quite well, and thus for all proposed solutions, the operator tasks framework will be made use of for the design of the solution.
-
-To maintain snapshot immutability during extended hibernation, the authors propose two approaches:
-
-#### Approach 1: Using the Compaction Job
-
-**Proposed Solution:**
-
-Utilize the compaction job to periodically take fresh snapshots during hibernation. Introduce a new flag `--hibernation-snapshot-interval` to the compaction controller. This flag sets the interval after which a compaction job should be triggered for a hibernated etcd cluster, based on the time elapsed since `fullLease.Spec.RenewTime.Time` and if `etcd.spec.replicas` is `0` (indicating hibernation). The compaction job uses the [compact command](https://github.com/gardener/etcd-backup-restore/blob/master/cmd/compact.go) to create a new snapshot.
+#### What is Hibernation?
 
-**Implementation Details:**
-
-- **Compaction Controller:**
-  - Introduce a new flag:
-    - **Flag:** `--hibernation-snapshot-interval`
-      - **Type:** Duration
-      - **Default:** `24h`
-      - **Description:** Interval after which a new snapshot is taken during hibernation.
-  - The compaction job starts an embedded etcd instance to take snapshots during hibernation.
-
-##### Advantages
-
-- **No Change to etcd Cluster State:** Does not alter the actual etcd cluster, keeping it in hibernation.
-- **Automated Snapshot Creation:** Periodically creates new snapshots to extend immutability by triggering the compaction job.
-- **Leverages Existing Mechanism:** Utilizes the compaction job, which is already part of the system.
+Hibernation typically involves scaling down an etcd cluster or other resources to conserve costs or resources during inactive periods. The specific implementation and processes can vary based on the setup and tools used. However, the primary objective is to pause operations while maintaining the state for future resumption.
 
-##### Disadvantages
+When an etcd cluster is hibernated for a duration exceeding the immutability period of its backups, the backups may become mutable again (this behavior depends on the cloud provider; refer to [Comparison of Storage Provider Properties](#comparison-of-storage-provider-properties-for-bucket-level-and-object-level-immutability)). This compromises the intended immutability guarantees and may expose the backups to accidental or malicious modifications.
 
-- **Resource Consumption:** Starting an embedded etcd instance periodically consumes resources.
+To address this issue, the authors propose a solution that maintains the immutability of snapshots during extended hibernation periods.
 
-#### Approach 2: Re-upload of the latest snapshot
+#### Proposed Solution: Re-uploading the Latest Snapshot
 
-**Proposed Solution:**
+The authors propose a mechanism to periodically extend the immutability period of the latest snapshot during hibernation. This is achieved by re-uploading the latest snapshot, which resets the immutability period because the immutability countdown starts from the time of upload.
 
-A new `EtcdOperatorTask` called `ExtendEtcdSnapshotImmutabilityTask` will be created as defined in the operator tasks framework. This new `EtcdOperatorTask` extends the immutability period by deploying a job which uploads another copy of the latest snapshot to the object store.
-
-A full snapshot is taken before hibernating the etcd cluster. This is to ensure that no state maintained in the etcd cluster is lost before hibernation.
+This handling of hibernated clusters is a scenario where the [etcd operator-tasks](./05-etcd-operator-tasks.md) framework can be effectively utilized. Therefore, the authors will leverage the operator tasks framework in the design of this solution.
 
 **Implementation Details:**
 
-- **Introduce the `renew-snapshot` command to etcdbrctl**:
-  - etcd-backup-restore will be enhanced to support a new command `renew-snapshot` which does the following:
-    - - Downloads the latest full snapshot from the object store.
-    - Replaces the Unix epoch in the file-name of the downloaded snapshot to contain the time at which the file completes downloading.
-    - Uploads this newly renamed snapshot to the same object store.
-    - Renews the full snapshot lease after the upload is successful.
+- **Introduce the `renew-snapshot` Command to `etcdbrctl`:**
 
-    The immutability period of an object begins from the moment of upload, thus renewing the immutability period of the snapshot. Renaming the snapshot is necessary since the downloaded snapshot can not simply be re-uploaded as uploading with the same name would be an attempt at modifying an already existing snapshot, which is disallowed.
+  - `etcd-backup-restore` will be enhanced to support a new command `renew-snapshot`, which performs the following steps:
+    - Downloads the latest full snapshot from the object store.
+    - Renames the snapshot by updating the Unix epoch in the filename to reflect the time of the download completion.
+    - Uploads the renamed snapshot back to the object store.
+    - Updates the full snapshot lease after the successful upload.
 
-    This command could either be implemented standalone, or could be implemented as a wrapper over the `copy` command of `etcdbrctl` by extending the functionality of the `copy` command accordingly.
-- **Introduce the `garbage-collect` command to etcdbrctl**:
-  - etcd-backup-restore will be enhanced to support a new command `garbage-collect` which does the following:
-    - Perform garbage collection of the snapshots in the object store according to the policy specified with the `--garbage-collection-policy` flag.  
+  The immutability period of an object begins from the moment of upload, so re-uploading the snapshot renews its immutability period. Renaming the snapshot is necessary because uploading with the same name would be considered an attempt to modify an existing snapshot, which is disallowed under immutability policies.
 
-    This functionality is needed to garbage collect the old snapshots whose immutability has expired but have been renewed as a fresh snapshot via the approach mentioned above.
-- **Update `Etcd` CRD:**
-  - Add `etcd.spec.hibernation`:  
-    Since there are situations outside of hibernation where the number of replicas of the statefulset would have to be scaled to zero, there needs to be an explicit way in which it is conveyed to etcd-druid that the etcd cluster is being hibernated. This can be achieved by extending the `Etcd` CRD by including a new field in the `spec` called `hibernated`.
+- **Introduce the `garbage-collect` Command to `etcdbrctl`:**
 
-    ```yaml
-    hibernation:
-      enabled: <bool>
-    ```
+  - `etcd-backup-restore` will be enhanced with a new command `garbage-collect`, which:
+    - Performs garbage collection of snapshots in the object store according to `etcd-backup-restore`'s [garbage collection policy](https://github.com/gardener/etcd-backup-restore/blob/master/docs/usage/garbage_collection.md#gc-policies).
 
-  - Add `etcd.status.hibernatedAt`:  
-    This field conveys information about whether the cluster has been successfully hibernated after a reconciliation, and the time at which it entered hibernation. This field is cleared when the cluster is woken up from hibernation.
+  This functionality is needed to remove old snapshots whose immutability has expired, preventing storage from growing indefinitely.
 
-    ```yaml
-    hibernatedAt: <hibernation-time>
-    ```
+- **Update the `Etcd` CRD:**
 
-  - Add `immutableSettings.retentionType` under `etcd.spec.backup.store`.
-- **etcd Controller Logic:**
-  - When hibernation is requested, by changing `etcd.spec.hibernated.enabled` to `true`:
-    - The controller removes the etcd client ports `2380` and `2379` from the `etcd-client` service, leaving only the etcd-backup-restore port `8080`. This stops etcd client traffic.
-    - The controller creates a `EtcdOperatorTask` to trigger an on-demand full snapshot.
-      - On-demand full snapshot is successful: the controller does no additional handling.
-      - On-demand full snapshot fails: the controller triggers the creation of a `EtcdOperatorTask` for an on-demand compaction job that compacts the latest base full snapshot and the corresponding deltas.
-    - The controller scales in the etcd cluster (i.e., sets `StatefulSet.spec.replicas` to zero).
-    - The controller creates the `ExtendEtcdSnapshotImmutabilityTask` periodically if `etcd.spec.backup.store.immutableSettings.retentionType` is set to `"Bucket"`.
-
-- **`ExtendEtcdSnapshotImmutabilityTask` specification:**
-  - Run `etcdbrctl renew-snapshot` to extend the immutability of the latest snapshot.
-  - Run `etcdbrctl garbage-collect --garbage-collection-policy <garbage-collection-policy>` to garbage collect the snapshots that are created during the extension.
+  - Add a new field `immutability.retentionType` under `etcd.spec.backup.store` to specify the type of immutability config.
 
-- **EtcdOperatorTask Controller Logic:**
-  - The operator-tasks controller will react to the creation of the custom resource, and will deploy a job named `<etcd-name>-extend-immutability` which is the `ExtendEtcdSnapshotImmutabilityTask` job.
-  - The controller also reports metrics regarding the `ExtendEtcdSnapshotImmutabilityTask` job, which can be used to raise alerts for operators that immutability has not been extended.
+- **etcd Controller Logic:**
 
-##### Advantages
+  - When hibernation is requested:
+    - The controller removes the etcd client ports `2380` and `2379` from the `etcd-client` service, leaving only the `etcd-backup-restore` port `8080`, effectively stopping etcd client traffic.
+    - The controller creates an `EtcdOperatorTask` to trigger an on-demand full snapshot.
+    - The controller scales down the etcd cluster by setting `StatefulSet.spec.replicas` to zero.
+    - The controller periodically creates the `ExtendEtcdSnapshotImmutabilityTask` if `etcd.spec.backup.store.immutability.retentionType` is set to `"Bucket"` and based on `etcd.spec.backup.fullSnapshotSchedule`.
 
-- **Minimal Operational Impact:** Does not alter the etcd cluster's state during hibernation and respects the operator's intention to hibernate the cluster without unintended changes.
-- **Efficient Resource Utilization:** Only the latest snapshot is copied, limiting additional storage usage, and avoids the need to start an embedded etcd instance.
-- **Automated Process:** The process of taking a full snapshot before hibernation and creating the `ExtendEtcdSnapshotImmutabilityTask` is automated within the controller.
+- **`ExtendEtcdSnapshotImmutabilityTask` Specification:**
 
-##### Disadvantages
+  - Runs `etcdbrctl renew-snapshot` to extend the immutability of the latest snapshot.
+  - Runs `etcdbrctl garbage-collect --garbage-collection-policy <policy>` to remove old snapshots.
 
-- **Additional Complexity:** Requires updates to the etcd controller, introduction of the operator-tasks controller, and introduction of new etcdbrctl commands.
-- **Prerequisite Requirement:** Relies on successfully taking a full snapshot before hibernation, which may introduce delays or require handling snapshot failures.
+- **EtcdOperatorTask Controller Logic:**
 
-#### Recommendation
+  - The operator-tasks controller reacts to the creation of the custom resource and deploys a job named `<etcd-name>-extend-immutability`, which is the `ExtendEtcdSnapshotImmutabilityTask` job.
+  - The controller reports metrics regarding the `ExtendEtcdSnapshotImmutabilityTask` job, which can be used to raise alerts for operators if immutability has not been extended.
 
-After evaluating both approaches, **Approach 2: Re-upload of the latest snapshot** is recommended due to its minimal operational impact and efficient resource utilization. By ensuring that a full snapshot is taken before hibernation, data consistency is maintained, and the immutability period is extended effectively. This approach respects the operator's intention to keep the etcd cluster hibernated without introducing significant resource consumption or complexity.
+By periodically re-uploading the latest snapshot during hibernation, the authors ensure that the immutability period is extended, and the backups remain **protected throughout the hibernation period**.
 
 ## Compatibility
 
@@ -264,44 +173,17 @@ The proposed changes are fully compatible with existing etcd clusters and backup
 ## Risks and Mitigations
 
 - **Increased Storage Costs:**
-
   - **Risk:** Copying snapshots or frequent snapshots may lead to increased storage usage.
-  - **Mitigation:** Since only the latest full snapshot is copied in Approach 2, the additional storage usage is minimal. Garbage collection helps manage storage utilization.
+  - **Mitigation:** Since only the latest full snapshot is copied, the additional storage usage is minimal. Garbage collection helps manage storage utilization.
 
 - **Operational Complexity:**
-
   - **Risk:** Introducing new resources and processes might add complexity.
   - **Mitigation:** The processes are automated within the controller, requiring minimal operator intervention. Clear documentation and tooling support will help manage complexity.
 
 - **Failed Snapshot Before Hibernation:**
-
   - **Risk:** Failure to take a full snapshot before hibernation could delay the hibernation process.
   - **Mitigation:** Implement robust error handling and retries. Notify operators of failures to take corrective action.
 
-- **Failed Operations:**
-
-  - **Risk:** Errors during copy or snapshot operations could lead to incomplete backups.
-  - **Mitigation:** Implement robust error handling and retries in the copier and compaction job logic. Ensure proper logging and alerting.
-
-## Operational Considerations
-
-Operators need to:
-
-- **Configure Buckets:**
-
-  - Set up buckets with appropriate immutability settings before deploying etcd clusters.
-  - Ensure immutability periods align with organizational policies.
-
-- **Monitor Hibernation Processes:**
-
-  - Keep track of hibernated clusters and ensure that full snapshots are taken before hibernation.
-  - Verify that `EtcdCopyBackupsTask` resources are created and executed as expected.
-
-- **Review Retention Policies:**
-
-  - Set `maxBackups` and `maxBackupAge` in the `EtcdCopyBackupsTask` to manage storage utilization effectively.
-  - Configure the `--hibernation-snapshot-interval` for the compaction job if using Approach 1.
-
 ## Alternatives
 
 ### Object-Level Immutability Policies vs. Bucket-Level Immutability Policies
@@ -313,13 +195,11 @@ An alternative to implementing immutability via bucket-level immutability polici
 Major cloud storage providers such as Google Cloud Storage (GCS), Amazon S3, and Azure Blob Storage (ABS) support both bucket-level and object-level immutability mechanisms to enforce data immutability.
 
 1. **Bucket-Level Immutability Policies:**
-
    - **Applies Uniformly:** Applies a uniform immutability period to all objects within a bucket.
    - **Immutable Objects:** Once set, objects cannot be modified or deleted until the immutability period expires.
    - **Simplified Management:** Simplifies management by applying the same policy to all objects.
 
 2. **Object-Level Immutability Policies:**
-
    - **Granular Control:** Allows setting immutability periods on a per-object basis.
    - **Flexible Immutability Durations:** Offers granular control, enabling different immutability durations for individual objects.
    - **Varying Requirements:** Can accommodate varying immutability requirements for different types of backups.
@@ -328,6 +208,8 @@ Major cloud storage providers such as Google Cloud Storage (GCS), Amazon S3, and
 
 Using object-level immutability provides flexibility in scenarios where certain backups require different immutability periods. However, current limitations and complexities make it less practical for immediate implementation.
 
+- Enabling object-level immutability requires bucket-level immutability to be set first (applicable in S3 and ABS). In GCS, the capability to enable object-level immutability on an existing bucket is not available.
+
 **Advantages:**
 
 - **Granular Control:** Allows setting different immutability periods for different objects.
@@ -342,7 +224,7 @@ Using object-level immutability provides flexibility in scenarios where certain
 
 #### Conclusion
 
-Given the complexities and limitations, the authors recommend using bucket-level immutability in conjunction with the `EtcdCopyBackupsTask` approach (Approach 2) to manage immutability during hibernation effectively. This approach provides a balance between operational simplicity and meeting immutability requirements. The compaction job approach (Approach 1) is also viable but may introduce more resource consumption and operational overhead.
+Given the complexities and limitations, the authors recommend using bucket-level immutability in conjunction with the `ExtendEtcdSnapshotImmutabilityTask` approach to manage immutability during hibernation effectively. This approach provides a balance between operational simplicity and meeting immutability requirements.
 
 ##### Comparison of Storage Provider Properties for Bucket-Level and Object-Level Immutability
 
@@ -357,7 +239,7 @@ Given the complexities and limitations, the authors recommend using bucket-level
 | Support for enabling object-level immutability in new buckets             | Yes | Yes | Yes                           |
 | Precedence between bucket-level and object-level immutability periods     | Max(bucket, object)| Object-level| Max(bucket, object) |
 
-> **Note:** *In AWS S3, changes to the bucket-level immutability period can be blocked by adding a specific bucket policy.
+> **Note:** In AWS S3, changes to the bucket-level immutability period can be blocked by adding a specific bucket policy.
 
 ---
 
@@ -367,4 +249,4 @@ Given the complexities and limitations, the authors recommend using bucket-level
 - [AWS S3 Object Lock](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html)
 - [Azure Immutable Blob Storage](https://learn.microsoft.com/en-us/azure/storage/blobs/immutable-policy-configure-container-scope?tabs=azure-portal)
 
----
+---
\ No newline at end of file

From 39255f7aa6ef0ceb8001cce2d12a54b1411433cb Mon Sep 17 00:00:00 2001
From: Seshachalam Yerasala Venkata <seshachalam.yerasala.venkata@sap.com>
Date: Thu, 5 Dec 2024 18:23:18 +0530
Subject: [PATCH 7/7] addressed feedback

---
 docs/proposals/06-immutable-etcd-backups.md | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/docs/proposals/06-immutable-etcd-backups.md b/docs/proposals/06-immutable-etcd-backups.md
index 6fb14d150..08b0eb47c 100644
--- a/docs/proposals/06-immutable-etcd-backups.md
+++ b/docs/proposals/06-immutable-etcd-backups.md
@@ -25,6 +25,8 @@ This [WORM](#terminology) model will enhance the reliability and integrity of et
 
 `etcd-druid` and `etcd-backup-restore` will be enhanced to achieve the same functionality currently achieved by modifying or deleting backups, but without actually modifying or deleting these backups, since they will now be immutable for a set duration. This approach eliminates the possibility of potential data loss. `etcd-druid` will provide an end-to-end solution for achieving this functionality, as relying solely on `etcd-backup-restore` is insufficient given the scope and possible approaches to achieving this.
 
+Additionally, handling [hibernation](#terminology) for immutable backups presents a unique challenge. When an `etcd` cluster is hibernated for a duration exceeding the immutability period of its backups, the backups may become mutable again, compromising the intended immutability guarantees and exposing the backups to accidental or malicious modifications. To address this, the authors propose a solution to maintain the immutability of snapshots during extended [hibernation](#terminology) periods.
+
 ## Terminology
 
 - **etcd-druid:** A controller that manages etcd clusters, including provisioning and lifecycle handling.
@@ -181,8 +183,8 @@ The proposed changes are fully compatible with existing etcd clusters and backup
   - **Mitigation:** The processes are automated within the controller, requiring minimal operator intervention. Clear documentation and tooling support will help manage complexity.
 
 - **Failed Snapshot Before Hibernation:**
-  - **Risk:** Failure to take a full snapshot before hibernation could delay the hibernation process.
-  - **Mitigation:** Implement robust error handling and retries. Notify operators of failures to take corrective action.
+  - **Risk:** Failure to take a full snapshot before hibernation could delay the hibernation process and potentially compromise data integrity.
+    - **Mitigation:** Implement robust error handling and retry mechanisms to ensure snapshots are taken successfully before hibernation. Notify operators of any failures by updating the `etcd.status.lastErrors` and `etcd.status.lastOperation` fields. Additionally, operators can leverage the metrics provided by `ExtendEtcdSnapshotImmutabilityTask`, which follows the [operator task framework](https://github.com/gardener/etcd-druid/blob/master/docs/proposals/05-etcd-operator-tasks.md#metrics), to trigger alerts. This ensures timely intervention and resolution of issues, maintaining the integrity and availability of the etcd cluster state.
 
 ## Alternatives
 
@@ -224,7 +226,7 @@ Using object-level immutability provides flexibility in scenarios where certain
 
 #### Conclusion
 
-Given the complexities and limitations, the authors recommend using bucket-level immutability in conjunction with the `ExtendEtcdSnapshotImmutabilityTask` approach to manage immutability during hibernation effectively. This approach provides a balance between operational simplicity and meeting immutability requirements.
+**Given the complexities and limitations, the authors recommend using bucket-level immutability in conjunction with the `ExtendEtcdSnapshotImmutabilityTask` approach to manage immutability during hibernation effectively. This approach provides a balance between operational simplicity and meeting immutability requirements.**
 
 ##### Comparison of Storage Provider Properties for Bucket-Level and Object-Level Immutability
 
@@ -249,4 +251,4 @@ Given the complexities and limitations, the authors recommend using bucket-level
 - [AWS S3 Object Lock](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html)
 - [Azure Immutable Blob Storage](https://learn.microsoft.com/en-us/azure/storage/blobs/immutable-policy-configure-container-scope?tabs=azure-portal)
 
----
\ No newline at end of file
+---