[Feature] Enhanced snapshot compaction based on events size #591
Labels
area/control-plane
Control plane related
area/performance
Performance (across all domains, such as control plane, networking, storage, etc.) related
area/robustness
Robustness, reliability, resilience related
area/storage
Storage related
kind/enhancement
Enhancement, improvement, extension
lifecycle/rotten
Nobody worked on this for 12 months (final aging stage)
priority/1
Priority (lower number equals higher priority)
Feature (What you would like to be added):
Smarter snapshot compaction based on events size.
Motivation (Why is this needed?):
Snapshot compaction today is based purely on the number of etcd events recorded in the delta snapshots, based on which druid takes the decision to trigger a compaction job for compacting the latest set of delta snapshots into the latest full snapshot. This approach does not consider the size of the events recorded by the delta snapshots. Events can be quite large, upto 1.5 MB which is the default threshold for event size defined by etcd.
To put things in perspective, the default etcd events threshold for druid is 1 million events. In the worst case, if each event is 1.5 MB large, then the total size of accumulated delta snapshots can go up to 1.5 million MB, ie, 1.5 TB. Compacting this kind of backup is practically impossible due to the following reasons:
We cannot rely on the number of etcd events alone to trigger snapshot compactions.
Approach/Hint to the implement solution (optional):
Druid should also consider the total size of the accumulated events and trigger compaction after a certain (configurable) threshold is reached. For druid to get this information, it must depend on the EtcdMember[State] resource using which the backup-restore sidecar publishes snapshot information for druid to consume and act upon.
The text was updated successfully, but these errors were encountered: