Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safekeeper peer recovery rfc. #4875

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Safekeeper peer recovery rfc. #4875

wants to merge 1 commit into from

Conversation

arssher
Copy link
Contributor

@arssher arssher commented Aug 2, 2023

@github-actions
Copy link

github-actions bot commented Aug 2, 2023

1264 tests run: 1214 passed, 0 failed, 50 skipped (full report)


missing part to pg_wal and holds WAL since this horizon ever since. This is
problematic, because

1. If one safekeeper is down and/or lagging, pg_wal eventually explodes -- we intentionally don't have much space on computes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explanation of "explodes" for external reader: project becomes stuck and cannot start in this case.

We download WAL before starting active phase of walproposer and doing everything else, and because of that walproposer cannot finish sync-safekeepers (compute disk is limited).


Proposed solution is to teach safekeepers to fetch WAL directly from peers,
respecting consensus rules. Namely,
- On start, walproposer won't download WAL at all -- it will have it only since
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will walproposer sync-safekeepers work in this case? Usually it determines propEpochStartLsn and waits until WAL is streamed and committed until that point. Here it's almost the same, but walproposer doesn't control WAL streaming and should wait until flushLsn reaches propEpochStartLsn everywhere somehow. Only then it can commit that LSN and finish the work.

The main incovenience here is that WAL fetching from peers happens in background and it will be harder to know about WAL streaming inside walproposer:

  • Currently, if something goes wrong, we usually see it in the compute logs. After switching to peer recovery, we should implement something like this, to at least distinguish cases when everything is stuck, or we just waiting for a huge lag to be fetched. This will help debugging.
  • How will walproposer wait for flushLsn? Will it just reconnect every second until it's >= propEpochStartLsn, or safekeeper will send updates to walproposer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not much changes here, sync-safekeepers would wait for commit_lsn reaching propEpochStartLsn in the same way. There is a question how walproposer should handle the case when it doesn't have WAL for a particular safekeeper. The simplest way seems to be to check for this during each read and send ERROR to the safekeeper if WAL is missing, logging and closing the connection. Then during recovery walproposer -> sk communications would look like connect, vote, send ProposerElected, log error 'no WAL' with lsns etc, rinsing and repeating each second.

Alternatively we could add special state kinda 'waiting' in the protocol, but it is harder and isn't much useful.

So an addition here is 'lowest available LSN' variable tracked in walproposer, it seems we don't have it currently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then during recovery walproposer -> sk communications would look like connect, vote, send ProposerElected, log error 'no WAL' with lsns etc, rinsing and repeating each second.

Then it means that sync-safekeepers will work like this, retrying connection every second until 2/3 safekeepers will reach propEpochStartLsn. If peer recovery will align flushLsn on safekeepers, it will work fine, but otherwise (if peer recovery condition is not met, for example), sync-safekeepers will work for at least a second.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But practically it should almost never happen, since aligning should happen in the background even in compute absense.

== donor's term leader. That is, sk A will fetch WAL from sk B if
1) B's (last_log_term, LSN) is higher than A's (last_log_term, LSN) *and*
2) A's term <= B's term -- otherwise append request can't be accepted.
3) B's term == B's last_log_term -- to ensure that such a leader was ever elected in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly a note to myself: last_log_term is self.term_history.up_to(flush_lsn).last(). Then we know that B's term == B's last_log_term means that safekeeper got a ProposerElected message and flush_lsn >= epoch_start_lsn.


### Start/stop criterion

Recovery shouldn't prevent actively streaming compute -- we don't skip records,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "actively streaming compute" mean exactly? Is it compute connected to at least one safekeeper, compute connected to the current safekeeper or elected compute that pushes WAL to the current safekeeper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is elected compute that pushes WAL to the current safekeeper, 3).

respecting consensus rules. Namely,
- On start, walproposer won't download WAL at all -- it will have it only since
writing position. As WAL grows it should also keep some fixed number of
latest segments (~20) to provide gradual switch from peer recovery to walproposer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How this switch will work? It's not clear how walpropsoer<->safekeeper communication will look like to make the switch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above, hopefully another comment clarifies this.

arssher added a commit that referenced this pull request Aug 11, 2023
Also add both walsenders and walreceivers to TimelineStatus (available under
v1/tenant/xxx/timeline/yyy).

Prepares for
#4875
arssher added a commit that referenced this pull request Aug 15, 2023
Also add both walsenders and walreceivers to TimelineStatus (available under
v1/tenant/xxx/timeline/yyy).

Prepares for
#4875
arssher added a commit that referenced this pull request Aug 23, 2023
Also add both walsenders and walreceivers to TimelineStatus (available under
v1/tenant/xxx/timeline/yyy).

Prepares for
#4875
arssher added a commit that referenced this pull request Aug 28, 2023
Slightly refactors init: now load_tenant_timelines is also async to properly
init the timeline, but to keep global map lock sync we just acquire it anew for
each timeline.

Recovery task itself is just a stub here.

part of
#4875
arssher added a commit that referenced this pull request Aug 28, 2023
Add derive Ord for easy comparison of <term, lsn> pairs.

part of #4875
arssher added a commit that referenced this pull request Aug 28, 2023
arssher added a commit that referenced this pull request Aug 28, 2023
Slightly refactors init: now load_tenant_timelines is also async to properly
init the timeline, but to keep global map lock sync we just acquire it anew for
each timeline.

Recovery task itself is just a stub here.

part of
#4875
arssher added a commit that referenced this pull request Aug 28, 2023
Add derive Ord for easy comparison of <term, lsn> pairs.

part of #4875
arssher added a commit that referenced this pull request Aug 28, 2023
arssher added a commit that referenced this pull request Aug 29, 2023
Slightly refactors init: now load_tenant_timelines is also async to properly
init the timeline, but to keep global map lock sync we just acquire it anew for
each timeline.

Recovery task itself is just a stub here.

part of
#4875
arssher added a commit that referenced this pull request Aug 29, 2023
Add derive Ord for easy comparison of <term, lsn> pairs.

part of #4875
arssher added a commit that referenced this pull request Aug 29, 2023
arssher added a commit that referenced this pull request Sep 5, 2023
Implements fetching of WAL by safekeeper from another safekeeper by imitating
behaviour of last elected leader. This allows to avoid WAL accumulation on
compute and facilitates faster compute startup as it doesn't need to download
any WAL. Actually removing WAL download in walproposer is a matter of another
patch though.

There is a per timeline task which always runs, checking regularly if it should
start recovery frome someone, meaning there is something to fetch and there is
no streaming compute. It then proceeds with fetching, finishing when there is
nothing more to receive.

Implements #4875
arssher added a commit that referenced this pull request Sep 5, 2023
Implements fetching of WAL by safekeeper from another safekeeper by imitating
behaviour of last elected leader. This allows to avoid WAL accumulation on
compute and facilitates faster compute startup as it doesn't need to download
any WAL. Actually removing WAL download in walproposer is a matter of another
patch though.

There is a per timeline task which always runs, checking regularly if it should
start recovery frome someone, meaning there is something to fetch and there is
no streaming compute. It then proceeds with fetching, finishing when there is
nothing more to receive.

Implements #4875
arssher added a commit that referenced this pull request Sep 18, 2023
Implements fetching of WAL by safekeeper from another safekeeper by imitating
behaviour of last elected leader. This allows to avoid WAL accumulation on
compute and facilitates faster compute startup as it doesn't need to download
any WAL. Actually removing WAL download in walproposer is a matter of another
patch though.

There is a per timeline task which always runs, checking regularly if it should
start recovery frome someone, meaning there is something to fetch and there is
no streaming compute. It then proceeds with fetching, finishing when there is
nothing more to receive.

Implements #4875
arssher added a commit that referenced this pull request Oct 4, 2023
Implements fetching of WAL by safekeeper from another safekeeper by imitating
behaviour of last elected leader. This allows to avoid WAL accumulation on
compute and facilitates faster compute startup as it doesn't need to download
any WAL. Actually removing WAL download in walproposer is a matter of another
patch though.

There is a per timeline task which always runs, checking regularly if it should
start recovery frome someone, meaning there is something to fetch and there is
no streaming compute. It then proceeds with fetching, finishing when there is
nothing more to receive.

Implements #4875
arssher added a commit that referenced this pull request Oct 4, 2023
Implements fetching of WAL by safekeeper from another safekeeper by imitating
behaviour of last elected leader. This allows to avoid WAL accumulation on
compute and facilitates faster compute startup as it doesn't need to download
any WAL. Actually removing WAL download in walproposer is a matter of another
patch though.

There is a per timeline task which always runs, checking regularly if it should
start recovery frome someone, meaning there is something to fetch and there is
no streaming compute. It then proceeds with fetching, finishing when there is
nothing more to receive.

Implements #4875
arssher added a commit that referenced this pull request Oct 19, 2023
Implements fetching of WAL by safekeeper from another safekeeper by imitating
behaviour of last elected leader. This allows to avoid WAL accumulation on
compute and facilitates faster compute startup as it doesn't need to download
any WAL. Actually removing WAL download in walproposer is a matter of another
patch though.

There is a per timeline task which always runs, checking regularly if it should
start recovery frome someone, meaning there is something to fetch and there is
no streaming compute. It then proceeds with fetching, finishing when there is
nothing more to receive.

Implements #4875
arssher added a commit that referenced this pull request Oct 20, 2023
Implements fetching of WAL by safekeeper from another safekeeper by imitating
behaviour of last elected leader. This allows to avoid WAL accumulation on
compute and facilitates faster compute startup as it doesn't need to download
any WAL. Actually removing WAL download in walproposer is a matter of another
patch though.

There is a per timeline task which always runs, checking regularly if it should
start recovery frome someone, meaning there is something to fetch and there is
no streaming compute. It then proceeds with fetching, finishing when there is
nothing more to receive.

Implements #4875
arssher added a commit that referenced this pull request Oct 20, 2023
Implements fetching of WAL by safekeeper from another safekeeper by imitating
behaviour of last elected leader. This allows to avoid WAL accumulation on
compute and facilitates faster compute startup as it doesn't need to download
any WAL. Actually removing WAL download in walproposer is a matter of another
patch though.

There is a per timeline task which always runs, checking regularly if it should
start recovery frome someone, meaning there is something to fetch and there is
no streaming compute. It then proceeds with fetching, finishing when there is
nothing more to receive.

Implements #4875
arssher added a commit that referenced this pull request Oct 20, 2023
Implements fetching of WAL by safekeeper from another safekeeper by imitating
behaviour of last elected leader. This allows to avoid WAL accumulation on
compute and facilitates faster compute startup as it doesn't need to download
any WAL. Actually removing WAL download in walproposer is a matter of another
patch though.

There is a per timeline task which always runs, checking regularly if it should
start recovery frome someone, meaning there is something to fetch and there is
no streaming compute. It then proceeds with fetching, finishing when there is
nothing more to receive.

Implements #4875
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants