[epic] dry-run horizon database truncation #5236

mollykarcher · 2024-03-07T14:56:30Z

We'll be truncating SDF Horizon's history retention to 1 year later this year. To our knowledge, most partners that enable history retention use 1-3 months of history, so it's possible that there could be issues that only present with this data profile (not retaining full history, but large retention window) that we simply haven't seen or heard of yet.

We should dry-run the truncation and mirror traffic to it for some amount of time, observing the performance impact and resolving any issues that arise from this process. Given the timing and the need to continue using staging to test/issue releases of Horizon prior to the truncation, we should not be doing this on the staging cluster and will need to spin up a new/independent one.

At a minimum:

Spin up another staging-like cluster of Horizon (https://github.com/stellar/ops/issues/2900)
~~Upgrade PostgreSQL 12 ➡️ 16 (services/horizon: upgrade psql support to most recent versions #4831)~~
Enabling reaping on that instance and set the retention to 1 year
- There are different ways this can be accomplished and we need time to evaluate that. For example, we could turn on reaping on the whole DB and see what happens (which may result in a lockup due to the massive amount of data that needs to be reaped, plus a possible full vacuum) or we could start from scratch and ingest a year+ of data and then enable reaping, or there may be other options.
- After discussion, it appears we must approach this by reaping the whole DB, because reingestion may take on the order of months. The hope 🤞 is that because the database will be much smaller, the full vacuum will be feasible without any extra operational concerns.
- some of reaping performance epic, [Epic] Improving Reap Performance of History Lookup Tables #4870, may be applicable and needed here to ensure reaping exhibits acceptable performance on the truncated 1 year pubnet history db.
- The periodic reaping frequency should be configurable services/horizon: Reap in batches of 100k ledgers per second, to play nicely with others #3823
Document the operational plan that should be performed to repeat this process for live prd on blue/green prod clusters eventually https://github.com/stellar/go-internal/issues/18
Mirror traffic from production to this cluster
Brainstorm how we could (or if we need to) simulate load from transaction submission
Observe, identify, and resolve (or defer/prioritize) any performance degradation

sreuland · 2024-05-16T20:46:04Z

@tamirms , what is the latest status on spinning up the test db cluster for this reaping test effort? I think you've mentioned it was in progress but pending due to PG16 issues?

I ask b/c @aditya1702 and I are triaging reports of the reaper sql becoming non-performant in pubnet db ingestion deployments of horizon, such as this one from community member observed reaper timeouts on issues/5299 and #5320

triaging the reported problem is very similar to doing the dry-run validation effort, we can probably converge on this and join the effort to obtain reaper results in a staging environment as it helps both cases?

mollykarcher added the cdp-horizon-scrum label Mar 7, 2024

mollykarcher added this to the Sprint 44 milestone Mar 7, 2024

mollykarcher added this to Platform Scrum Mar 7, 2024

github-project-automation bot moved this to Backlog in Platform Scrum Mar 7, 2024

mollykarcher moved this from Backlog to Current Sprint in Platform Scrum Mar 7, 2024

mollykarcher changed the title ~~dry-run horizon database truncation~~ [epic] dry-run horizon database truncation Mar 19, 2024

mollykarcher moved this from Current Sprint to Next Sprint Proposal in Platform Scrum Mar 19, 2024

mollykarcher modified the milestones: Sprint 44, platform sprint 45 Mar 27, 2024

mollykarcher moved this from Next Sprint Proposal to In Progress in Platform Scrum Mar 27, 2024

mollykarcher assigned Shaptic and tamirms Mar 27, 2024

Shaptic mentioned this issue Apr 5, 2024

services/horizon: Make reaping batch sizes configurable via --history-retention-reap-count. #5272

Merged

mollykarcher modified the milestones: platform sprint 45, platform sprint 46 Apr 24, 2024

sreuland modified the milestones: platform sprint 46, platform sprint 47 May 22, 2024

mollykarcher unassigned Shaptic Jun 11, 2024

mollykarcher modified the milestones: platform sprint 47, platform sprint 49, platform sprint 48 Jun 18, 2024

tamirms closed this as completed Jul 11, 2024

github-project-automation bot moved this from In Progress to Done in Platform Scrum Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[epic] dry-run horizon database truncation #5236

[epic] dry-run horizon database truncation #5236

mollykarcher commented Mar 7, 2024 •

edited by tamirms

Loading

sreuland commented May 16, 2024 •

edited

Loading

[epic] dry-run horizon database truncation #5236

[epic] dry-run horizon database truncation #5236

Comments

mollykarcher commented Mar 7, 2024 • edited by tamirms Loading

sreuland commented May 16, 2024 • edited Loading

mollykarcher commented Mar 7, 2024 •

edited by tamirms

Loading

sreuland commented May 16, 2024 •

edited

Loading