-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Safekeeper peer recovery rfc. #4875
base: main
Are you sure you want to change the base?
Conversation
1264 tests run: 1214 passed, 0 failed, 50 skipped (full report) |
bda28b5
to
5ac3f86
Compare
missing part to pg_wal and holds WAL since this horizon ever since. This is | ||
problematic, because | ||
|
||
1. If one safekeeper is down and/or lagging, pg_wal eventually explodes -- we intentionally don't have much space on computes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explanation of "explodes" for external reader: project becomes stuck and cannot start in this case.
We download WAL before starting active phase of walproposer and doing everything else, and because of that walproposer cannot finish sync-safekeepers (compute disk is limited).
|
||
Proposed solution is to teach safekeepers to fetch WAL directly from peers, | ||
respecting consensus rules. Namely, | ||
- On start, walproposer won't download WAL at all -- it will have it only since |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will walproposer sync-safekeepers work in this case? Usually it determines propEpochStartLsn
and waits until WAL is streamed and committed until that point. Here it's almost the same, but walproposer doesn't control WAL streaming and should wait until flushLsn
reaches propEpochStartLsn
everywhere somehow. Only then it can commit that LSN and finish the work.
The main incovenience here is that WAL fetching from peers happens in background and it will be harder to know about WAL streaming inside walproposer:
- Currently, if something goes wrong, we usually see it in the compute logs. After switching to peer recovery, we should implement something like this, to at least distinguish cases when everything is stuck, or we just waiting for a huge lag to be fetched. This will help debugging.
- How will walproposer wait for
flushLsn
? Will it just reconnect every second until it's>= propEpochStartLsn
, or safekeeper will send updates to walproposer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not much changes here, sync-safekeepers would wait for commit_lsn reaching propEpochStartLsn
in the same way. There is a question how walproposer should handle the case when it doesn't have WAL for a particular safekeeper. The simplest way seems to be to check for this during each read and send ERROR to the safekeeper if WAL is missing, logging and closing the connection. Then during recovery walproposer -> sk communications would look like connect, vote, send ProposerElected, log error 'no WAL' with lsns etc, rinsing and repeating each second.
Alternatively we could add special state kinda 'waiting' in the protocol, but it is harder and isn't much useful.
So an addition here is 'lowest available LSN' variable tracked in walproposer, it seems we don't have it currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then during recovery walproposer -> sk communications would look like connect, vote, send ProposerElected, log error 'no WAL' with lsns etc, rinsing and repeating each second.
Then it means that sync-safekeepers will work like this, retrying connection every second until 2/3 safekeepers will reach propEpochStartLsn
. If peer recovery will align flushLsn on safekeepers, it will work fine, but otherwise (if peer recovery condition is not met, for example), sync-safekeepers will work for at least a second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. But practically it should almost never happen, since aligning should happen in the background even in compute absense.
== donor's term leader. That is, sk A will fetch WAL from sk B if | ||
1) B's (last_log_term, LSN) is higher than A's (last_log_term, LSN) *and* | ||
2) A's term <= B's term -- otherwise append request can't be accepted. | ||
3) B's term == B's last_log_term -- to ensure that such a leader was ever elected in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly a note to myself: last_log_term
is self.term_history.up_to(flush_lsn).last()
. Then we know that B's term == B's last_log_term
means that safekeeper got a ProposerElected
message and flush_lsn >= epoch_start_lsn
.
|
||
### Start/stop criterion | ||
|
||
Recovery shouldn't prevent actively streaming compute -- we don't skip records, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "actively streaming compute" mean exactly? Is it compute connected to at least one safekeeper, compute connected to the current safekeeper or elected compute that pushes WAL to the current safekeeper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is elected compute that pushes WAL to the current safekeeper, 3).
respecting consensus rules. Namely, | ||
- On start, walproposer won't download WAL at all -- it will have it only since | ||
writing position. As WAL grows it should also keep some fixed number of | ||
latest segments (~20) to provide gradual switch from peer recovery to walproposer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How this switch will work? It's not clear how walpropsoer<->safekeeper communication will look like to make the switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above, hopefully another comment clarifies this.
Also add both walsenders and walreceivers to TimelineStatus (available under v1/tenant/xxx/timeline/yyy). Prepares for #4875
Also add both walsenders and walreceivers to TimelineStatus (available under v1/tenant/xxx/timeline/yyy). Prepares for #4875
Also add both walsenders and walreceivers to TimelineStatus (available under v1/tenant/xxx/timeline/yyy). Prepares for #4875
Slightly refactors init: now load_tenant_timelines is also async to properly init the timeline, but to keep global map lock sync we just acquire it anew for each timeline. Recovery task itself is just a stub here. part of #4875
Add derive Ord for easy comparison of <term, lsn> pairs. part of #4875
We need them for safekeeper peer recovery #4875
Slightly refactors init: now load_tenant_timelines is also async to properly init the timeline, but to keep global map lock sync we just acquire it anew for each timeline. Recovery task itself is just a stub here. part of #4875
Add derive Ord for easy comparison of <term, lsn> pairs. part of #4875
We need them for safekeeper peer recovery #4875
Slightly refactors init: now load_tenant_timelines is also async to properly init the timeline, but to keep global map lock sync we just acquire it anew for each timeline. Recovery task itself is just a stub here. part of #4875
Add derive Ord for easy comparison of <term, lsn> pairs. part of #4875
We need them for safekeeper peer recovery #4875
Implements fetching of WAL by safekeeper from another safekeeper by imitating behaviour of last elected leader. This allows to avoid WAL accumulation on compute and facilitates faster compute startup as it doesn't need to download any WAL. Actually removing WAL download in walproposer is a matter of another patch though. There is a per timeline task which always runs, checking regularly if it should start recovery frome someone, meaning there is something to fetch and there is no streaming compute. It then proceeds with fetching, finishing when there is nothing more to receive. Implements #4875
Implements fetching of WAL by safekeeper from another safekeeper by imitating behaviour of last elected leader. This allows to avoid WAL accumulation on compute and facilitates faster compute startup as it doesn't need to download any WAL. Actually removing WAL download in walproposer is a matter of another patch though. There is a per timeline task which always runs, checking regularly if it should start recovery frome someone, meaning there is something to fetch and there is no streaming compute. It then proceeds with fetching, finishing when there is nothing more to receive. Implements #4875
Implements fetching of WAL by safekeeper from another safekeeper by imitating behaviour of last elected leader. This allows to avoid WAL accumulation on compute and facilitates faster compute startup as it doesn't need to download any WAL. Actually removing WAL download in walproposer is a matter of another patch though. There is a per timeline task which always runs, checking regularly if it should start recovery frome someone, meaning there is something to fetch and there is no streaming compute. It then proceeds with fetching, finishing when there is nothing more to receive. Implements #4875
Implements fetching of WAL by safekeeper from another safekeeper by imitating behaviour of last elected leader. This allows to avoid WAL accumulation on compute and facilitates faster compute startup as it doesn't need to download any WAL. Actually removing WAL download in walproposer is a matter of another patch though. There is a per timeline task which always runs, checking regularly if it should start recovery frome someone, meaning there is something to fetch and there is no streaming compute. It then proceeds with fetching, finishing when there is nothing more to receive. Implements #4875
Implements fetching of WAL by safekeeper from another safekeeper by imitating behaviour of last elected leader. This allows to avoid WAL accumulation on compute and facilitates faster compute startup as it doesn't need to download any WAL. Actually removing WAL download in walproposer is a matter of another patch though. There is a per timeline task which always runs, checking regularly if it should start recovery frome someone, meaning there is something to fetch and there is no streaming compute. It then proceeds with fetching, finishing when there is nothing more to receive. Implements #4875
Implements fetching of WAL by safekeeper from another safekeeper by imitating behaviour of last elected leader. This allows to avoid WAL accumulation on compute and facilitates faster compute startup as it doesn't need to download any WAL. Actually removing WAL download in walproposer is a matter of another patch though. There is a per timeline task which always runs, checking regularly if it should start recovery frome someone, meaning there is something to fetch and there is no streaming compute. It then proceeds with fetching, finishing when there is nothing more to receive. Implements #4875
Implements fetching of WAL by safekeeper from another safekeeper by imitating behaviour of last elected leader. This allows to avoid WAL accumulation on compute and facilitates faster compute startup as it doesn't need to download any WAL. Actually removing WAL download in walproposer is a matter of another patch though. There is a per timeline task which always runs, checking regularly if it should start recovery frome someone, meaning there is something to fetch and there is no streaming compute. It then proceeds with fetching, finishing when there is nothing more to receive. Implements #4875
Implements fetching of WAL by safekeeper from another safekeeper by imitating behaviour of last elected leader. This allows to avoid WAL accumulation on compute and facilitates faster compute startup as it doesn't need to download any WAL. Actually removing WAL download in walproposer is a matter of another patch though. There is a per timeline task which always runs, checking regularly if it should start recovery frome someone, meaning there is something to fetch and there is no streaming compute. It then proceeds with fetching, finishing when there is nothing more to receive. Implements #4875
Implements fetching of WAL by safekeeper from another safekeeper by imitating behaviour of last elected leader. This allows to avoid WAL accumulation on compute and facilitates faster compute startup as it doesn't need to download any WAL. Actually removing WAL download in walproposer is a matter of another patch though. There is a per timeline task which always runs, checking regularly if it should start recovery frome someone, meaning there is something to fetch and there is no streaming compute. It then proceeds with fetching, finishing when there is nothing more to receive. Implements #4875
https://github.com/neondatabase/neon/blob/bda28b53a0dcfbaa5b363aa2451c513062b8afc3/docs/rfcs/025-sk-peer-recovery.md