[Feature] Enhanced snapshot compaction based on events size #591

shreyas-s-rao · 2023-05-03T08:55:33Z

Feature (What you would like to be added):
Smarter snapshot compaction based on events size.

Motivation (Why is this needed?):
Snapshot compaction today is based purely on the number of etcd events recorded in the delta snapshots, based on which druid takes the decision to trigger a compaction job for compacting the latest set of delta snapshots into the latest full snapshot. This approach does not consider the size of the events recorded by the delta snapshots. Events can be quite large, upto 1.5 MB which is the default threshold for event size defined by etcd.

To put things in perspective, the default etcd events threshold for druid is 1 million events. In the worst case, if each event is 1.5 MB large, then the total size of accumulated delta snapshots can go up to 1.5 million MB, ie, 1.5 TB. Compacting this kind of backup is practically impossible due to the following reasons:

During restoration as part of compaction, the etcd does not support this scale of data since the DB is not defragmented during restoration
Even if the DB size is kept in check during restoration with the use of multiple defragmentations while applying the events to the etcd, the amount of time it takes to compact such a DB is impractically large
Letting the backup size blow up means a very long restoration time in case of an actual restoration upon a PV corruption

We cannot rely on the number of etcd events alone to trigger snapshot compactions.

Approach/Hint to the implement solution (optional):
Druid should also consider the total size of the accumulated events and trigger compaction after a certain (configurable) threshold is reached. For druid to get this information, it must depend on the EtcdMember[State] resource using which the backup-restore sidecar publishes snapshot information for druid to consume and act upon.

shreyas-s-rao · 2023-05-03T09:18:43Z

/assign @abdasgupta

gardener-robot assigned abdasgupta May 3, 2023

unmarshall changed the title ~~[Feature] Smarter snapshot compaction based on events size~~ [Feature] Improved snapshot compaction based on events size Jul 21, 2023

unmarshall changed the title ~~[Feature] Improved snapshot compaction based on events size~~ [Feature] Enhanced snapshot compaction based on events size Jul 21, 2023

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Apr 1, 2024

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Enhanced snapshot compaction based on events size #591

[Feature] Enhanced snapshot compaction based on events size #591

shreyas-s-rao commented May 3, 2023

shreyas-s-rao commented May 3, 2023

[Feature] Enhanced snapshot compaction based on events size #591

[Feature] Enhanced snapshot compaction based on events size #591

Comments

shreyas-s-rao commented May 3, 2023

shreyas-s-rao commented May 3, 2023