implement unclean-shutdown marker #21893

holiman · 2020-11-23T14:39:38Z

Replaces #21133

Some todos :

instead of single marker, use a list of markers and store every unclean shutdown

https://github.com/ethereum/go-ethereum/pull/21133/files#r437253799:

The PR uses a single unsafe-shutdown key to store a single timestamp. That's a good start, but we need a bit more information, otherwise quirky scenarios get lost. Notably, it's important to retain past crashes too, so a user might see if there was a fault recently, not just in the last restart (user's tend to restart their nodes a few times until attempting to investigate an issue, so a restart cannot delete the cause).

A simplistic suggestion would be to use a slice of timestamps instead of a single timestamp as the value of the entry. When Geth starts, you can append the current time, and on shutdown, you can pop off the tail. That way on startup you can actually list all the recent crashes, not just the last one.

…oncerns

holiman · 2020-11-23T18:01:48Z

Example execution

[user@work go-ethereum]$ ./build/bin/geth --goerli --nodiscover --maxpeers 0
...
INFO [11-23|18:59:36.189] Loaded most recent local header          number=3798046 hash="6bbd9e…118ec3" td=5559455 age=1d6h55m
INFO [11-23|18:59:36.189] Loaded most recent local full block      number=0       hash="bf7e33…b88c1a" td=1       age=1y10mo3d
INFO [11-23|18:59:36.189] Loaded most recent local fast block      number=3798046 hash="6bbd9e…118ec3" td=5559455 age=1d6h55m
INFO [11-23|18:59:36.189] Loaded last fast-sync pivot marker       number=3800031
INFO [11-23|18:59:36.193] Loaded local transaction journal         transactions=0 dropped=0
INFO [11-23|18:59:36.193] Regenerated local transaction journal    transactions=0 accounts=0
INFO [11-23|18:59:36.207] Allocated fast sync bloom                size=512.00MiB
WARN [11-23|18:59:36.207] Unclean shutdown(s) detected             latest="1h23m26s ago" count=3
INFO [11-23|18:59:36.207] Starting peer-to-peer node               instance=Geth/v1.9.25-unstable-bf2aedd9-20201123/linux-amd64/go1.15.5
...

And after yet another SIGKILL

WARN [11-23|19:01:20.512] Unclean shutdown(s) detected             latest="16s ago" count=4

karalabe · 2020-11-23T18:33:02Z

I think in general it would be more useful to log each shutdown separately.

E.g.

WARN [..] Unclean shutdown detected    time=timestamp age=1m ago

You can also use PrettyDuration to not have too big of a age string there, and I'd also use age=... vs latest=... ago because it allows you to get rid of a quotation mark :)

karalabe · 2020-11-23T18:33:51Z

Otherwise we only knwo that there was 5 shutdown, but we don;t know if they are recent or old, so it's not particularly useful.

We might also add some form of cleanup so we don't keep too many old events?

holiman · 2020-11-23T19:03:02Z

Yes, if we show many we definitely need to clean them up. So, keep the most recent 5? 10?

karalabe · 2020-11-23T19:04:45Z

10 recent + a counter to how many older ones there were?

holiman · 2020-11-23T22:11:44Z

Ok!
Now it looks like this:

WARN [11-23|23:05:35.803] Old unclean shutdowns found              count=1
WARN [11-23|23:05:35.803] Unclean shutdown detected                time=2020-11-23T22:56:03+0100 age=9m32s
WARN [11-23|23:05:35.803] Unclean shutdown detected                time=2020-11-23T22:56:15+0100 age=9m20s
WARN [11-23|23:05:35.803] Unclean shutdown detected                time=2020-11-23T23:00:49+0100 age=4m46s
WARN [11-23|23:05:35.803] Unclean shutdown detected                time=2020-11-23T23:01:12+0100 age=4m23s
WARN [11-23|23:05:35.803] Unclean shutdown detected                time=2020-11-23T23:01:23+0100 age=4m12s
WARN [11-23|23:05:35.803] Unclean shutdown detected                time=2020-11-23T23:01:27+0100 age=4m8s
WARN [11-23|23:05:35.803] Unclean shutdown detected                time=2020-11-23T23:01:31+0100 age=4m4s
WARN [11-23|23:05:35.803] Unclean shutdown detected                time=2020-11-23T23:01:36+0100 age=3m59s
WARN [11-23|23:05:35.803] Unclean shutdown detected                time=2020-11-23T23:01:42+0100 age=3m53s
WARN [11-23|23:05:35.803] Unclean shutdown detected                time=2020-11-23T23:01:47+0100 age=3m48s
INFO [11-23|23:05:35.803] Starting peer-to-peer node               instance=Geth/v1.9.25-unstable-bf2aedd9-20201123/linux-amd64/go1.15.5

holiman · 2020-11-23T22:12:55Z

The very first time it runs, it prints out

WARN [11-23|23:12:10.152] Error reading USM                        error="leveldb: not found"

rjl493456442

nitpicks

core/rawdb/accessors_metadata.go

rjl493456442 · 2020-11-24T05:42:11Z

core/rawdb/accessors_metadata.go

+	var uncleanShutdowns ucmList
+	// Read old data
+	if data, err := db.Get(uncleanShutdownKey); err != nil {
+		log.Warn("Error reading USM", "error", err)


Perhaps: Failed to read the unclean shutdown markers?

core/rawdb/accessors_metadata.go

core/rawdb/schema.go

core/rawdb/accessors_metadata.go

rjl493456442 · 2020-11-24T05:55:03Z

eth/backend.go

+		if discards > 0 {
+			log.Warn("Old unclean shutdowns found", "count", discards)
+		}
+		for _, tstamp := range uncleanShutdowns {


It's inaccurate actually. It's the timestart of the START, not the CRASH. Usually geth will run for a very long time

Can we have an underlying thread to update the shutdown marker e.g. every 1 minute? For the cost wise, it's basically no overhead but at least we can decrease the bias

Maybe we could do that in a follow-up PR?
For now I can replace time with booted, to clarify that it's the startup timestamp, not the crash timestamp

MariusVanDerWijden · 2020-11-30T13:12:05Z

core/rawdb/accessors_metadata.go

+	// Add a new (but cap it)
+	uncleanShutdowns.Recent = append(uncleanShutdowns.Recent, uint64(time.Now().Unix()))
+	if l := len(uncleanShutdowns.Recent); l > crashesToKeep+1 {
+		uncleanShutdowns.Recent = uncleanShutdowns.Recent[l-crashesToKeep-1:]


Thats a pretty fancy way to discard the first element, wouldn't Recent[1:] suffice?
(I assume that only one element can be added/discarded, since Discarded++ is not Discarded += len(Recents)-crashesToKeep)

Myeah, I intended for it to handle the case where we discard more than one, e.g. if we decide to change the limit to 5 or there's some mishap.
So the proper thing to do is rather to uncrease uncleanShutdowns.Discarded appropriately. Good catch!

MariusVanDerWijden

LGTM

muhammednagy and others added 6 commits November 23, 2020 15:07

cmd added unsafe shutdown detection

b1579ab

cmd added unsafe shutdown detection to ethereum and lightchain

e5f3a0a

core/rawdb, eth, les: move unclean shutdown key to rawdb + fix some c…

2e3bc39

…oncerns

core/rawdb: move usm to meta

a69f54a

core/rawdb, eth, les: use multple shutdown markers

fd300c1

cmd/geth, core: linter nits

bf2aedd

holiman marked this pull request as ready for review November 23, 2020 17:03

holiman requested review from karalabe, rjl493456442 and zsfelfoldi as code owners November 23, 2020 17:03

holiman mentioned this pull request Nov 23, 2020

Fixes #20859 added unsafe shutdown detector #21133

Closed

core, eth, les: fixes to the unclean-shutdown detection and reporting

79a943f

holiman force-pushed the unclean branch from b206d8b to 79a943f Compare November 23, 2020 22:26

rjl493456442 reviewed Nov 24, 2020

View reviewed changes

core/rawdb, eth, les: fix review concerns

2787328

holiman force-pushed the unclean branch from 7fe99ab to 2787328 Compare November 24, 2020 12:28

holiman mentioned this pull request Nov 27, 2020

eth/tracers: testcase for crashing duktape ref: #21879 #21913

Closed

MariusVanDerWijden reviewed Nov 30, 2020

View reviewed changes

core/rawdb: unclean shutdown marker, count discarded better

2c567d0

MariusVanDerWijden approved these changes Nov 30, 2020

View reviewed changes

FridayOrtiz mentioned this pull request Dec 5, 2020

Detect unclean shutdowns #20859

Closed

holiman merged commit 4d48980 into ethereum:master Dec 11, 2020

holiman added this to the 1.10.0 milestone Dec 11, 2020

quorumbot mentioned this pull request Sep 3, 2021

[Upgrade] Go-Ethereum release v1.10.0 Consensys/quorum#1249

Merged

9 tasks

baptiste-b-pegasys mentioned this pull request Sep 3, 2021

[Upgrade] Go-Ethereum release v1.10.0 baptiste-b-pegasys/quorum#16

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement unclean-shutdown marker #21893

implement unclean-shutdown marker #21893

holiman commented Nov 23, 2020 •

edited

Loading

holiman commented Nov 23, 2020

karalabe commented Nov 23, 2020

karalabe commented Nov 23, 2020

holiman commented Nov 23, 2020

karalabe commented Nov 23, 2020

holiman commented Nov 23, 2020

holiman commented Nov 23, 2020

rjl493456442 left a comment

rjl493456442 Nov 24, 2020

rjl493456442 Nov 24, 2020

rjl493456442 Nov 24, 2020

holiman Nov 24, 2020

MariusVanDerWijden Nov 30, 2020

holiman Nov 30, 2020

MariusVanDerWijden left a comment

implement unclean-shutdown marker #21893

implement unclean-shutdown marker #21893

Conversation

holiman commented Nov 23, 2020 • edited Loading

holiman commented Nov 23, 2020

karalabe commented Nov 23, 2020

karalabe commented Nov 23, 2020

holiman commented Nov 23, 2020

karalabe commented Nov 23, 2020

holiman commented Nov 23, 2020

holiman commented Nov 23, 2020

rjl493456442 left a comment

Choose a reason for hiding this comment

rjl493456442 Nov 24, 2020

Choose a reason for hiding this comment

rjl493456442 Nov 24, 2020

Choose a reason for hiding this comment

rjl493456442 Nov 24, 2020

Choose a reason for hiding this comment

holiman Nov 24, 2020

Choose a reason for hiding this comment

MariusVanDerWijden Nov 30, 2020

Choose a reason for hiding this comment

holiman Nov 30, 2020

Choose a reason for hiding this comment

MariusVanDerWijden left a comment

Choose a reason for hiding this comment

holiman commented Nov 23, 2020 •

edited

Loading