Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read only compute pinned to an lsn requests page at a future lsn #3532

Closed
Tracked by #882
LizardWizzard opened this issue Feb 3, 2023 · 4 comments
Closed
Tracked by #882
Assignees
Labels
c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug

Comments

@LizardWizzard
Copy link
Contributor

Steps to reproduce

Run pgdump on pinned node while primary is active (or not all the time active, but is active rather frequently)

Expected result

No errors

Actual result

page server returned error: Timed out while waiting for WAL record at LSN F/464B4A48 to arrive,

Environment

prod

Details

While migrating projects into different aws account I've found an issue thats quite puzzling. I spin up compute node in docker locally to check that it works. Compare dump output, etc.
I follow this instruction from @kelvich with slightly modified script:

#!/bin/sh
set -e

cd /var/db/postgres
rm -rf bb_export
mkdir bb_export
chmod 0700 bb_export
cd bb_export

psql -h$PSHOST -p$PSPORT -c "basebackup $TNT $TLI" > backup.tar
tar xvf backup.tar

LSN=`pg_controldata . | grep 'REDO location' | awk '{print $5}'`

echo "Basebackup LSN: $LSN"
echo "zenith.signal: `cat zenith.signal`"

touch standby.signal

cat << EOF > postgresql.conf
neon.pageserver_connstring = 'host=$PSHOST port=$PSPORT'
neon.tenant_id = '$TNT' neon.timeline_id = '$TLI'
shared_preload_libraries = 'neon'
listen_addresses = '127.0.0.1'
port = 5555
recovery_target_lsn = '$LSN'
EOF

cd ..
mkdir -p $BATCH
pg_ctl -D bb_export -l logfile start
pg_dumpall -h127.0.0.1 -p5555 -U cloud_admin > $BATCH/dump-$TNT-$TLI-$PSHOST-$SUFFIX
pg_ctl -D bb_export -l logfile stop
rm -rf bb_export

The problem is that pg_dump fails with:

pg_dump: error: Dumping the contents of table "x" failed: PQgetResult() failed.
pg_dump: detail: Error message from server: ERROR: could not read block 62753 in rel 1663/32768/41055.0 from page server at lsn F/464B4A48
DETAIL: page server returned error: Timed out while waiting for WAL record at LSN F/464B4A48 to arrive, last\_record\_lsn F/4613EF28 disk consistent LSN=F/4610FF58
pg_dump: detail: Command was: COPY ... stdout;
pg\_dumpall: error: pg\_dump failed on database "y", exiting

Prior to the error script outputs this:

Basebackup LSN: F/4610D030
zenith.signal: PREV LSN: F/4610D008

The error looks strange to me, because basebackup lsn (and thus recovery_target_lsn) in this case are both equal to F/4610D030. And F/464B4A48 - the lsn at which the page was requested is greater. In my understanding in order to reconstruct the state at some lsn it shouldnt require future changes.
Additional info, this setup worked well for many projects before this one, the error happened for a project that is bigger in size and is quite active.

Full Postgres log:

2023-01-31 19:23:25.820 GMT [24] LOG:  starting PostgreSQL 15.1 on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2023-01-31 19:23:25.820 GMT [24] LOG:  listening on IPv4 address "127.0.0.1", port 5555
2023-01-31 19:23:25.825 GMT [24] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5555"
2023-01-31 19:23:25.830 GMT [27] LOG:  [ZENITH] found 'zenith.signal' file. setting prev LSN to F/4610D008
2023-01-31 19:23:25.830 GMT [27] LOG:  database system was shut down at 2022-11-25 01:24:05 GMT
2023-01-31 19:23:25.830 GMT [27] LOG:  starting with zenith basebackup at LSN F/4610D030, prev F/4610D008
2023-01-31 19:23:25.830 GMT [27] LOG:  Continue writing WAL at F/4610D030
2023-01-31 19:23:25.838 GMT [24] LOG:  database system is ready to accept connections
2023-01-31 19:23:25.843 GMT [30] LOG:  [NEON_SMGR] libpagestore: connected to 'host=zenith-1-ps-4.local port=6400'
2023-01-31 19:23:25.843 GMT [29] LOG:  [NEON_SMGR] libpagestore: connected to 'host=zenith-1-ps-4.local port=6400'
2023-01-31 19:23:25.922 GMT [34] LOG:  [NEON_SMGR] libpagestore: connected to 'host=zenith-1-ps-4.local port=6400'
2023-01-31 19:23:26.004 GMT [37] LOG:  [NEON_SMGR] libpagestore: connected to 'host=zenith-1-ps-4.local port=6400'
2023-01-31 19:23:26.250 GMT [40] LOG:  [NEON_SMGR] libpagestore: connected to 'host=zenith-1-ps-4.local port=6400'
2023-01-31 19:23:26.493 GMT [43] LOG:  [NEON_SMGR] libpagestore: connected to 'host=zenith-1-ps-4.local port=6400'
2023-01-31 19:24:25.909 GMT [44] LOG:  [NEON_SMGR] libpagestore: connected to 'host=zenith-1-ps-4.local port=6400'
2023-01-31 19:24:36.369 GMT [43] ERROR:  could not read block 62753 in rel 1663/32768/41055.0 from page server at lsn F/464B4A48
2023-01-31 19:24:36.369 GMT [43] DETAIL:  page server returned error: Timed out while waiting for WAL record at LSN F/464B4A48 to arrive, last_record_lsn F/4613EF28 disk consistent LSN=F/4610FF58
2023-01-31 19:24:36.369 GMT [43] STATEMENT:  COPY ... TO stdout;

The correct log during startup should be:

2023-01-26 15:01:25.514 GMT [24] LOG:  entering standby mode
2023-01-26 15:01:25.514 GMT [24] LOG:  starting with zenith basebackup at LSN 0/172D680, prev 0/172D658
2023-01-26 15:01:25.515 GMT [24] LOG:  database system was not properly shut down; automatic recovery in progress
2023-01-26 15:01:25.519 GMT [24] LOG:  consistent recovery state reached at 0/172D680

But for some reason postgres doesnt go into standby mode despite having standby.signal. See https://github.com/neondatabase/postgres/blob/REL_15_STABLE_neon/src/backend/access/transam/xlogrecovery.c#L524

This reproduces only for certain projects. Bigger and active ones. I've managed to run this sequence earlier and later for other projects without errors. On project I faced this error it reproduced every time

This might be relevant for read replica efforts cc @lubennikovaav

Originally posted in slack here

@LizardWizzard LizardWizzard added t/bug Issue Type: Bug c/storage Component: storage c/compute Component: compute, excluding postgres itself labels Feb 3, 2023
vadim2404 added a commit that referenced this issue Feb 7, 2023
Bumps [aiohttp](https://github.com/aio-libs/aiohttp) from 3.7.0 to
3.7.4.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/aio-libs/aiohttp/releases">aiohttp's
releases</a>.</em></p>
<blockquote>
<h2>aiohttp 3.7.3 release</h2>
<h2>Features</h2>
<ul>
<li>Use Brotli instead of brotlipy
<code>[#3803](aio-libs/aiohttp#3803)
&lt;https://github.com/aio-libs/aiohttp/issues/3803&gt;</code>_</li>
<li>Made exceptions pickleable. Also changed the repr of some
exceptions.
<code>[#4077](aio-libs/aiohttp#4077)
&lt;https://github.com/aio-libs/aiohttp/issues/4077&gt;</code>_</li>
</ul>
<h2>Bugfixes</h2>
<ul>
<li>Raise a ClientResponseError instead of an AssertionError for a blank
HTTP Reason Phrase.
<code>[#3532](aio-libs/aiohttp#3532)
&lt;https://github.com/aio-libs/aiohttp/issues/3532&gt;</code>_</li>
<li>Fix <code>web_middlewares.normalize_path_middleware</code> behavior
for patch without slash.
<code>[#3669](aio-libs/aiohttp#3669)
&lt;https://github.com/aio-libs/aiohttp/issues/3669&gt;</code>_</li>
<li>Fix overshadowing of overlapped sub-applications prefixes.
<code>[#3701](aio-libs/aiohttp#3701)
&lt;https://github.com/aio-libs/aiohttp/issues/3701&gt;</code>_</li>
<li>Make <code>BaseConnector.close()</code> a coroutine and wait until
the client closes all connections. Drop deprecated &quot;with
Connector():&quot; syntax.
<code>[#3736](aio-libs/aiohttp#3736)
&lt;https://github.com/aio-libs/aiohttp/issues/3736&gt;</code>_</li>
<li>Reset the <code>sock_read</code> timeout each time data is received
for a <code>aiohttp.client</code> response.
<code>[#3808](aio-libs/aiohttp#3808)
&lt;https://github.com/aio-libs/aiohttp/issues/3808&gt;</code>_</li>
<li>Fixed type annotation for add_view method of UrlDispatcher to accept
any subclass of View
<code>[#3880](aio-libs/aiohttp#3880)
&lt;https://github.com/aio-libs/aiohttp/issues/3880&gt;</code>_</li>
<li>Fixed querying the address families from DNS that the current host
supports.
<code>[#5156](aio-libs/aiohttp#5156)
&lt;https://github.com/aio-libs/aiohttp/issues/5156&gt;</code>_</li>
<li>Change return type of MultipartReader.<strong>aiter</strong>() and
BodyPartReader.<strong>aiter</strong>() to AsyncIterator.
<code>[#5163](aio-libs/aiohttp#5163)
&lt;https://github.com/aio-libs/aiohttp/issues/5163&gt;</code>_</li>
<li>Provide x86 Windows wheels.
<code>[#5230](aio-libs/aiohttp#5230)
&lt;https://github.com/aio-libs/aiohttp/issues/5230&gt;</code>_</li>
</ul>
<h2>Improved Documentation</h2>
<ul>
<li>Add documentation for <code>aiohttp.web.FileResponse</code>.
<code>[#3958](aio-libs/aiohttp#3958)
&lt;https://github.com/aio-libs/aiohttp/issues/3958&gt;</code>_</li>
<li>Removed deprecation warning in tracing example docs
<code>[#3964](aio-libs/aiohttp#3964)
&lt;https://github.com/aio-libs/aiohttp/issues/3964&gt;</code>_</li>
<li>Fixed wrong &quot;Usage&quot; docstring of
<code>aiohttp.client.request</code>.
<code>[#4603](aio-libs/aiohttp#4603)
&lt;https://github.com/aio-libs/aiohttp/issues/4603&gt;</code>_</li>
<li>Add aiohttp-pydantic to third party libraries
<code>[#5228](aio-libs/aiohttp#5228)
&lt;https://github.com/aio-libs/aiohttp/issues/5228&gt;</code>_</li>
</ul>
<h2>Misc</h2>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/aio-libs/aiohttp/blob/master/CHANGES.rst">aiohttp's
changelog</a>.</em></p>
<blockquote>
<h1>3.7.4 (2021-02-25)</h1>
<h2>Bugfixes</h2>
<ul>
<li>
<p><strong>(SECURITY BUG)</strong> Started preventing open redirects in
the
<code>aiohttp.web.normalize_path_middleware</code> middleware. For
more details, see
<a
href="https://github.com/aio-libs/aiohttp/security/advisories/GHSA-v6wp-4m6f-gcjg">https://github.com/aio-libs/aiohttp/security/advisories/GHSA-v6wp-4m6f-gcjg</a>.</p>
<p>Thanks to <code>Beast Glatisant
&lt;https://github.com/g147&gt;</code>__ for
finding the first instance of this issue and <code>Jelmer Vernooij
&lt;https://jelmer.uk/&gt;</code>__ for reporting and tracking it down
in aiohttp.
<code>[#5497](aio-libs/aiohttp#5497)
&lt;https://github.com/aio-libs/aiohttp/issues/5497&gt;</code>_</p>
</li>
<li>
<p>Fix interpretation difference of the pure-Python and the Cython-based
HTTP parsers construct a <code>yarl.URL</code> object for HTTP
request-target.</p>
<p>Before this fix, the Python parser would turn the URI's absolute-path
for <code>//some-path</code> into <code>/</code> while the Cython code
preserved it as
<code>//some-path</code>. Now, both do the latter.
<code>[#5498](aio-libs/aiohttp#5498)
&lt;https://github.com/aio-libs/aiohttp/issues/5498&gt;</code>_</p>
</li>
</ul>
<hr />
<h1>3.7.3 (2020-11-18)</h1>
<h2>Features</h2>
<ul>
<li>Use Brotli instead of brotlipy
<code>[#3803](aio-libs/aiohttp#3803)
&lt;https://github.com/aio-libs/aiohttp/issues/3803&gt;</code>_</li>
<li>Made exceptions pickleable. Also changed the repr of some
exceptions.
<code>[#4077](aio-libs/aiohttp#4077)
&lt;https://github.com/aio-libs/aiohttp/issues/4077&gt;</code>_</li>
</ul>
<h2>Bugfixes</h2>
<ul>
<li>Raise a ClientResponseError instead of an AssertionError for a blank
HTTP Reason Phrase.
<code>[#3532](aio-libs/aiohttp#3532)
&lt;https://github.com/aio-libs/aiohttp/issues/3532&gt;</code>_</li>
<li>Fix <code>web_middlewares.normalize_path_middleware</code> behavior
for patch without slash.
<code>[#3669](aio-libs/aiohttp#3669)
&lt;https://github.com/aio-libs/aiohttp/issues/3669&gt;</code>_</li>
<li>Fix overshadowing of overlapped sub-applications prefixes.
<code>[#3701](aio-libs/aiohttp#3701)
&lt;https://github.com/aio-libs/aiohttp/issues/3701&gt;</code>_</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/aio-libs/aiohttp/commit/0a26acc1de9e1b0244456b7881ec16ba8bb64fc3"><code>0a26acc</code></a>
Bump aiohttp to v3.7.4 for a security release</li>
<li><a
href="https://github.com/aio-libs/aiohttp/commit/021c416c18392a111225bc7326063dc4a99a5138"><code>021c416</code></a>
Merge branch 'GHSA-v6wp-4m6f-gcjg' into master</li>
<li><a
href="https://github.com/aio-libs/aiohttp/commit/4ed7c25b537f71c6245bb74d6b20e5867db243ab"><code>4ed7c25</code></a>
Bump chardet from 3.0.4 to 4.0.0 (<a
href="https://github-redirect.dependabot.com/aio-libs/aiohttp/issues/5333">#5333</a>)</li>
<li><a
href="https://github.com/aio-libs/aiohttp/commit/b61f0fdffc887df24244ba7bdfe8567c580240ff"><code>b61f0fd</code></a>
Fix how pure-Python HTTP parser interprets <code>//</code></li>
<li><a
href="https://github.com/aio-libs/aiohttp/commit/5c1efbc32c46820250bd25440bb7ea96cb05abe9"><code>5c1efbc</code></a>
Bump pre-commit from 2.9.2 to 2.9.3 (<a
href="https://github-redirect.dependabot.com/aio-libs/aiohttp/issues/5322">#5322</a>)</li>
<li><a
href="https://github.com/aio-libs/aiohttp/commit/007507580137efcc0a20391a0792f39b337d9c1a"><code>0075075</code></a>
Bump pygments from 2.7.2 to 2.7.3 (<a
href="https://github-redirect.dependabot.com/aio-libs/aiohttp/issues/5318">#5318</a>)</li>
<li><a
href="https://github.com/aio-libs/aiohttp/commit/5085173d947e6cc01b6daf1aa48fe7698834c569"><code>5085173</code></a>
Bump multidict from 5.0.2 to 5.1.0 (<a
href="https://github-redirect.dependabot.com/aio-libs/aiohttp/issues/5308">#5308</a>)</li>
<li><a
href="https://github.com/aio-libs/aiohttp/commit/5d1a75e68d278c641c90021409f4eb5de1810e5e"><code>5d1a75e</code></a>
Bump pre-commit from 2.9.0 to 2.9.2 (<a
href="https://github-redirect.dependabot.com/aio-libs/aiohttp/issues/5290">#5290</a>)</li>
<li><a
href="https://github.com/aio-libs/aiohttp/commit/6724d0e7a944fd7e3a710dc292d785fa8fe424fd"><code>6724d0e</code></a>
Bump pre-commit from 2.8.2 to 2.9.0 (<a
href="https://github-redirect.dependabot.com/aio-libs/aiohttp/issues/5273">#5273</a>)</li>
<li><a
href="https://github.com/aio-libs/aiohttp/commit/c688451ce31b914c71b11d2ac6c326b0c87e6d1f"><code>c688451</code></a>
Removed duplicate timeout parameter in ClientSession reference docs. (<a
href="https://github-redirect.dependabot.com/aio-libs/aiohttp/issues/5262">#5262</a>)
...</li>
<li>Additional commits viewable in <a
href="https://github.com/aio-libs/aiohttp/compare/v3.7.0...v3.7.4">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=aiohttp&package-manager=pip&previous-version=3.7.0&new-version=3.7.4)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
- `@dependabot use these labels` will set the current labels as the
default for future PRs for this repo and language
- `@dependabot use these reviewers` will set the current reviewers as
the default for future PRs for this repo and language
- `@dependabot use these assignees` will set the current assignees as
the default for future PRs for this repo and language
- `@dependabot use this milestone` will set the current milestone as the
default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/neondatabase/neon/network/alerts).

</details>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Vadim Kharitonov <[email protected]>
@shanyp shanyp removed the c/storage Component: storage label Mar 22, 2023
@lubennikovaav
Copy link
Contributor

https://github.com/neondatabase/postgres/blob/REL_15_STABLE_neon/src/backend/access/transam/xlogrecovery.c#L1059

if (standby_signal_file_found)
{
StandbyModeRequested = true;
ArchiveRecoveryRequested = XLogRecPtrIsInvalid(zenithLastRec); /* no need to perform WAL recovery in Neon */
}

where zenithLastRec is PREV LSN from zenith.signal file
https://github.com/neondatabase/postgres/blob/REL_15_STABLE_neon/src/backend/access/transam/xlog.c#L5016

If it is valid, we won't enter standby mode.

PREV LSN can be sent as "none" or "invalid", if prev_record_lsn is unknown in pageserver
https://github.com/neondatabase/neon/blob/main/pageserver/src/basebackup.rs#L422

This explains, why script worked in other cases. Though, I would expect that this is an exception, rather than norm.
Will dig a bit more into this.

@LizardWizzard , can you please confirm that zenith.signal contains PREV LSN "none" or PREV LSN "invalid" for working cases?

@LizardWizzard
Copy link
Contributor Author

See the attached thread in slack:

Prior to the error script outputs this:
Basebackup LSN: F/4610D030
zenith.signal: PREV LSN: F/4610D008

@lubennikovaav
Copy link
Contributor

See the attached thread in slack:

Prior to the error script outputs this:
Basebackup LSN: F/4610D030
zenith.signal: PREV LSN: F/4610D008

Yep, I've read the thread. These are values for error case, right?

Is it the same for projects that pass this pg_dump check without failure?

@LizardWizzard
Copy link
Contributor Author

These are values for error case, right?

Yes

Is it the same for projects that pass this pg_dump check without failure?

I believe so, but cannot say for sure, because unfortunately I no longer have access to the setup that was used when the error occurred (we've discontinued these hosts).

The error should be reproducible. The script can be tried in any environment (even when compute is running).

@jcsp jcsp closed this as completed Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

4 participants