Safekeeper peer recovery rfc. #4875

arssher · 2023-08-02T14:06:38Z

https://github.com/neondatabase/neon/blob/bda28b53a0dcfbaa5b363aa2451c513062b8afc3/docs/rfcs/025-sk-peer-recovery.md

github-actions · 2023-08-02T14:30:12Z

1264 tests run: 1214 passed, 0 failed, 50 skipped (full report)

petuhovskiy · 2023-08-04T14:57:37Z

docs/rfcs/025-sk-peer-recovery.md

+missing part to pg_wal and holds WAL since this horizon ever since. This is
+problematic, because
+
+1. If one safekeeper is down and/or lagging, pg_wal eventually explodes -- we intentionally don't have much space on computes.


Explanation of "explodes" for external reader: project becomes stuck and cannot start in this case.

We download WAL before starting active phase of walproposer and doing everything else, and because of that walproposer cannot finish sync-safekeepers (compute disk is limited).

petuhovskiy · 2023-08-05T09:14:15Z

docs/rfcs/025-sk-peer-recovery.md

+
+Proposed solution is to teach safekeepers to fetch WAL directly from peers,
+respecting consensus rules. Namely,
+- On start, walproposer won't download WAL at all -- it will have it only since


How will walproposer sync-safekeepers work in this case? Usually it determines propEpochStartLsn and waits until WAL is streamed and committed until that point. Here it's almost the same, but walproposer doesn't control WAL streaming and should wait until flushLsn reaches propEpochStartLsn everywhere somehow. Only then it can commit that LSN and finish the work.

The main incovenience here is that WAL fetching from peers happens in background and it will be harder to know about WAL streaming inside walproposer:

Currently, if something goes wrong, we usually see it in the compute logs. After switching to peer recovery, we should implement something like this, to at least distinguish cases when everything is stuck, or we just waiting for a huge lag to be fetched. This will help debugging.

How will walproposer wait for flushLsn? Will it just reconnect every second until it's >= propEpochStartLsn, or safekeeper will send updates to walproposer?

Not much changes here, sync-safekeepers would wait for commit_lsn reaching propEpochStartLsn in the same way. There is a question how walproposer should handle the case when it doesn't have WAL for a particular safekeeper. The simplest way seems to be to check for this during each read and send ERROR to the safekeeper if WAL is missing, logging and closing the connection. Then during recovery walproposer -> sk communications would look like connect, vote, send ProposerElected, log error 'no WAL' with lsns etc, rinsing and repeating each second.

Alternatively we could add special state kinda 'waiting' in the protocol, but it is harder and isn't much useful.

So an addition here is 'lowest available LSN' variable tracked in walproposer, it seems we don't have it currently.

Then during recovery walproposer -> sk communications would look like connect, vote, send ProposerElected, log error 'no WAL' with lsns etc, rinsing and repeating each second.

Then it means that sync-safekeepers will work like this, retrying connection every second until 2/3 safekeepers will reach propEpochStartLsn. If peer recovery will align flushLsn on safekeepers, it will work fine, but otherwise (if peer recovery condition is not met, for example), sync-safekeepers will work for at least a second.

Yes. But practically it should almost never happen, since aligning should happen in the background even in compute absense.

petuhovskiy · 2023-08-07T10:35:07Z

docs/rfcs/025-sk-peer-recovery.md

+== donor's term leader. That is, sk A will fetch WAL from sk B if
+1) B's (last_log_term, LSN) is higher than A's (last_log_term, LSN) *and* 
+2) A's term <= B's term -- otherwise append request can't be accepted. 
+3) B's term == B's last_log_term -- to ensure that such a leader was ever elected in 


Mostly a note to myself: last_log_term is self.term_history.up_to(flush_lsn).last(). Then we know that B's term == B's last_log_term means that safekeeper got a ProposerElected message and flush_lsn >= epoch_start_lsn.

petuhovskiy · 2023-08-07T10:55:37Z

docs/rfcs/025-sk-peer-recovery.md

+
+### Start/stop criterion
+
+Recovery shouldn't prevent actively streaming compute -- we don't skip records,


What does "actively streaming compute" mean exactly? Is it compute connected to at least one safekeeper, compute connected to the current safekeeper or elected compute that pushes WAL to the current safekeeper?

It is elected compute that pushes WAL to the current safekeeper, 3).

petuhovskiy · 2023-08-07T12:02:06Z

docs/rfcs/025-sk-peer-recovery.md

+respecting consensus rules. Namely,
+- On start, walproposer won't download WAL at all -- it will have it only since
+  writing position. As WAL grows it should also keep some fixed number of
+  latest segments (~20) to provide gradual switch from peer recovery to walproposer


How this switch will work? It's not clear how walpropsoer<->safekeeper communication will look like to make the switch.

See above, hopefully another comment clarifies this.

Also add both walsenders and walreceivers to TimelineStatus (available under v1/tenant/xxx/timeline/yyy). Prepares for #4875

Slightly refactors init: now load_tenant_timelines is also async to properly init the timeline, but to keep global map lock sync we just acquire it anew for each timeline. Recovery task itself is just a stub here. part of #4875

Add derive Ord for easy comparison of <term, lsn> pairs. part of #4875

We need them for safekeeper peer recovery #4875

Slightly refactors init: now load_tenant_timelines is also async to properly init the timeline, but to keep global map lock sync we just acquire it anew for each timeline. Recovery task itself is just a stub here. part of #4875

Add derive Ord for easy comparison of <term, lsn> pairs. part of #4875

We need them for safekeeper peer recovery #4875

Slightly refactors init: now load_tenant_timelines is also async to properly init the timeline, but to keep global map lock sync we just acquire it anew for each timeline. Recovery task itself is just a stub here. part of #4875

Add derive Ord for easy comparison of <term, lsn> pairs. part of #4875

We need them for safekeeper peer recovery #4875

Implements fetching of WAL by safekeeper from another safekeeper by imitating behaviour of last elected leader. This allows to avoid WAL accumulation on compute and facilitates faster compute startup as it doesn't need to download any WAL. Actually removing WAL download in walproposer is a matter of another patch though. There is a per timeline task which always runs, checking regularly if it should start recovery frome someone, meaning there is something to fetch and there is no streaming compute. It then proceeds with fetching, finishing when there is nothing more to receive. Implements #4875

arssher requested a review from petuhovskiy August 2, 2023 14:06

Safekeeper peer recovery rfc.

5ac3f86

arssher force-pushed the sk-peer-recovery-rfc branch from bda28b5 to 5ac3f86 Compare August 3, 2023 11:15

petuhovskiy reviewed Aug 7, 2023

View reviewed changes

arssher added a commit that referenced this pull request Aug 11, 2023

Track list of walreceivers and their voting/streaming state in shmem.

3d06eda

Also add both walsenders and walreceivers to TimelineStatus (available under v1/tenant/xxx/timeline/yyy). Prepares for #4875

arssher mentioned this pull request Aug 11, 2023

Track list of walreceivers and their voting/streaming state in shmem. #4977

Merged

arssher added a commit that referenced this pull request Aug 15, 2023

Track list of walreceivers and their voting/streaming state in shmem.

727f944

Also add both walsenders and walreceivers to TimelineStatus (available under v1/tenant/xxx/timeline/yyy). Prepares for #4875

arssher added a commit that referenced this pull request Aug 23, 2023

Track list of walreceivers and their voting/streaming state in shmem.

d597e6d

Also add both walsenders and walreceivers to TimelineStatus (available under v1/tenant/xxx/timeline/yyy). Prepares for #4875

arssher added a commit that referenced this pull request Aug 28, 2023

Rename TermSwitchEntry to TermLsn.

faf01cd

Add derive Ord for easy comparison of <term, lsn> pairs. part of #4875

arssher added a commit that referenced this pull request Aug 28, 2023

Add term and http endpoint to broker messaged SkTimelineInfo.

cb4afeb

We need them for safekeeper peer recovery #4875

arssher mentioned this pull request Aug 28, 2023

Safekeeper peer recovery preparatory patches #5118

Merged

arssher added a commit that referenced this pull request Aug 28, 2023

Rename TermSwitchEntry to TermLsn.

494d9fb

Add derive Ord for easy comparison of <term, lsn> pairs. part of #4875

arssher added a commit that referenced this pull request Aug 28, 2023

Add term and http endpoint to broker messaged SkTimelineInfo.

5634c1c

We need them for safekeeper peer recovery #4875

petuhovskiy approved these changes Aug 28, 2023

View reviewed changes

arssher added a commit that referenced this pull request Aug 29, 2023

Rename TermSwitchEntry to TermLsn.

804ef23

Add derive Ord for easy comparison of <term, lsn> pairs. part of #4875

arssher added a commit that referenced this pull request Aug 29, 2023

Add term and http endpoint to broker messaged SkTimelineInfo.

e98580b

We need them for safekeeper peer recovery #4875

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Safekeeper peer recovery rfc. #4875

Safekeeper peer recovery rfc. #4875

arssher commented Aug 2, 2023 •

edited

Loading

github-actions bot commented Aug 2, 2023 •

edited

Loading

petuhovskiy Aug 4, 2023

petuhovskiy Aug 5, 2023

arssher Aug 8, 2023

petuhovskiy Aug 9, 2023

arssher Aug 10, 2023

petuhovskiy Aug 7, 2023

petuhovskiy Aug 7, 2023

arssher Aug 8, 2023

petuhovskiy Aug 7, 2023

arssher Aug 8, 2023


		### Start/stop criterion

		Recovery shouldn't prevent actively streaming compute -- we don't skip records,

Safekeeper peer recovery rfc. #4875

Are you sure you want to change the base?

Safekeeper peer recovery rfc. #4875

Conversation

arssher commented Aug 2, 2023 • edited Loading

github-actions bot commented Aug 2, 2023 • edited Loading

1264 tests run: 1214 passed, 0 failed, 50 skipped (full report)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arssher commented Aug 2, 2023 •

edited

Loading

github-actions bot commented Aug 2, 2023 •

edited

Loading