core/rawdb: fsync the index file after each freezer write #28483

rjl493456442 · 2023-11-08T04:03:17Z

This pull request tries to fix an issue in freezer, raised by #28105

Let's briefly recap the issue: If a power failure occurs during the execution in path model, Geth may not be able to recover in the subsequent restart.

After investigating a bit further, I discovered that the underlying state freezer is corrupted. Specifically, the freezer has experienced corruption in its index files, which now contain indexes with the values {file_num = 0; offset = 0} at the end.

Upon accessing the freezer table, it will initiate a repair process by truncating the excess bytes in the data file. However, due to the "zero" index at the end, the freezer table truncates the entire data file while retaining the complete index file.

Therefore, freezer table enters a very weird status that items indexes are present but their corresponding data is missing.

The root cause is why index file will contain zero indexes at the end after the power failure? It turns out it's relevant with OS file management.

According to Linux [manual][(https://man7.org/linux/man-pages/man2/write.2.html), it said that the file size is expanded when we try to write some items at the end of the file. In another word, the file offset will be moved even without flushing the associated data from OS cache from underlying disk.

If the file was [open(2)](https://man7.org/linux/man-pages/man2/open.2.html)ed with O_APPEND, the
file offset is first set to the end of the file before writing.
The adjustment of the file offset and the write operation are
performed as an atomic step.

This feature is used to ensure concurrent file writers won't overwrite with each other by atomically increasing the file offset.

Among the APIs subsequently listed are write() and [writev(2)](https://man7.org/linux/man-pages/man2/writev.2.html).
And among the effects that should be atomic across threads (and
processes) are updates of the file offset.  However, before Linux
3.14, this was not the case: if two processes that share an open
file description (see [open(2)](https://man7.org/linux/man-pages/man2/open.2.html)) perform a write() (or [writev(2)](https://man7.org/linux/man-pages/man2/writev.2.html))
at the same time, then the I/O operations were not atomic with
respect to updating the file offset, with the result that the
blocks of data output by the two processes might (incorrectly)
overlap.  This problem was fixed in Linux 3.14.

Also from the documentation of sqlite, it also points out this issue at 6.2

When data is appended to the end of the rollback journal, SQLite normally makes the pessimistic assumption
that the file is first extended with invalid "garbage" data and that afterwards the correct data replaces the garbage.

In other words, SQLite assumes that the file size is increased first and then afterwards the content is written 
into the file. If a power failure occurs after the file size has been increased but before the file content has been
written, the rollback journal can be left containing garbage data. If after power is restored, another SQLite process 
sees the rollback journal containing the garbage data and tries to roll it back into the original database file, 
it might copy some of the garbage into the database file and thus corrupt the database file.

We can conclude that the issue is relevant with the async file write.

How to fix the issue in order to survive the power failure?

Detect the garbage index data and truncate them in the repair stage

It's a straightforward idea that we can blindly truncate the index items with {0,0} values at the end of the file. However, due to the fact that freezer might store 0-size items, {0, 0} might be a valid index if all previous items + itself are all zero size.

So the opening question is left here how can we avoid this special scenario and correctly detect the garbage data?

Sync the file after each freezer write

We can minimize the possibility of having "garbage" zero value index items by SYNC index file after each write.
Once the index file is sync'd, it will ensure the stored index item is complete. But what if the power failure occurs
between the file.Write() and file.Sync()?

Apparently, sqlite uses another marker to track the item number in the index file, and use a two step SYNC approach
to address it. Essentially, it flushes the index file as the first step and then update this marker as the second step.
In this approach, the marker will be the indicator how many index items we expect and truncate all other above.

This approach is a bit over-complicated and might degrade the performance.

rjl493456442 · 2023-11-09T08:53:38Z

The additional file sync operation indeed takes more time, roughly 0.5ms, kind of acceptable.

Wrt overall performance, this pull request is even faster than master branch(no idea why it's faster).

holiman · 2023-11-09T09:45:24Z

triage discussion: let's benchmark fsyncing the head-file too, since otherwise we might up with corrupted data (all zeroes) in the actual data content, even if the index is correct.

rjl493456442 · 2023-11-10T01:59:14Z

Run the benchmark again with sync both data file and index file

Snap sync

This pull request is ~20 minutes faster than master. There are many factors can affect the snap sync performance, but at least we can have this conclusion that this pull request won't degrade performance significantly.

Full Sync

I swapped the machine, to deploy pr on machine 8 and master on machine 7 this time, the switch time is 19:06. We can see that 08 starts to catch up with better performance.

Obviously, the EVM execution is faster with this pull request.

Although it's also true that pr will spend +1ms time to build the state history and it's expected. The overhead is totally acceptable.

All in all, we can summarize that this pull request won't affect the performance and safe to merge it.

holiman

LGTM 💯

…8483) * core/rawdb: fsync the index and data file after each freezer write * core/rawdb: fsync the data file in freezer after write

…hereum#28483)" This reverts commit aee1721.

…8483) * core/rawdb: fsync the index and data file after each freezer write * core/rawdb: fsync the data file in freezer after write

rjl493456442 requested review from karalabe and holiman as code owners November 8, 2023 04:03

core/rawdb: fsync the index and data file after each freezer write

7072af7

rjl493456442 force-pushed the freezer-poweroff-fix branch from 8b19fe9 to 7072af7 Compare November 8, 2023 05:49

rjl493456442 changed the title ~~core/rawdb: fsync the index and data file after each freezer write~~ core/rawdb: fsync the index file after each freezer write Nov 9, 2023

core/rawdb: fsync the data file in freezer after write

95ffac5

holiman approved these changes Nov 10, 2023

View reviewed changes

holiman added this to the 1.13.5 milestone Nov 10, 2023

karalabe merged commit 326fa00 into ethereum:master Nov 10, 2023
1 of 2 checks passed

rjl493456442 mentioned this pull request Nov 10, 2023

CRIT Failed to truncate extra state histories err=EOF #28105

Closed

devopsbo3 added a commit to HorizenOfficial/go-ethereum that referenced this pull request Nov 10, 2023

Revert "core/rawdb: fsync the index file after each freezer write (et…

fc3efae

…hereum#28483)" This reverts commit aee1721.

devopsbo3 added a commit to HorizenOfficial/go-ethereum that referenced this pull request Nov 10, 2023

Revert "core/rawdb: fsync the index file after each freezer write (et…

b006c0b

…hereum#28483)" This reverts commit aee1721.

roberto-bayardo mentioned this pull request Nov 12, 2023

Support for reducing disk space requirements base-org/node#146

Merged

BrewTestBot mentioned this pull request Nov 14, 2023

ethereum 1.13.5 Homebrew/homebrew-core#154261

Merged

Francesco4203 mentioned this pull request Oct 29, 2024

trie/triedb/pathdb, core/rawdb: pbss fix release v1.13.5 (corner-cases in path scheme state management) axieinfinity/ronin#619

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core/rawdb: fsync the index file after each freezer write #28483

core/rawdb: fsync the index file after each freezer write #28483

rjl493456442 commented Nov 8, 2023 •

edited

Loading

rjl493456442 commented Nov 9, 2023

holiman commented Nov 9, 2023

rjl493456442 commented Nov 10, 2023

holiman left a comment

core/rawdb: fsync the index file after each freezer write #28483

core/rawdb: fsync the index file after each freezer write #28483

Conversation

rjl493456442 commented Nov 8, 2023 • edited Loading

rjl493456442 commented Nov 9, 2023

holiman commented Nov 9, 2023

rjl493456442 commented Nov 10, 2023

holiman left a comment

Choose a reason for hiding this comment

rjl493456442 commented Nov 8, 2023 •

edited

Loading