crashes with "database is locked" #4857

x3nb63 · 2021-10-11T11:37:23Z

Issue and Steps to Reproduce

version is 0.10.1

but 0.9.3 did the same, maybe more often.

lightningd emergency exits itself cause the database says its locked during INSERT or UPDATE statements:

$ grep 'database is locked' crash.log.*
crash.log.20211009101318:+498.816259929 lightningdBROKEN: Error executing statement: wallet/wallet.c:3786: INSERT INTO transactions (  id, blockheight, txindex, rawtx) VALUES (?, ?, ?, ?);: database is locked
crash.log.20211009175028:+27421.465744630 lightningdBROKEN: Error executing statement: wallet/wallet.c:768: UPDATE shachains SET num_valid=?, min_index=? WHERE id=?: database is locked
crash.log.20211009175054:+17.143080552 lightningdBROKEN: Error executing statement: wallet/wallet.c:3540: INSERT INTO blocks (height, hash, prev_hash) VALUES (?, ?, ?);: database is locked
crash.log.20211009222937:+16702.140267710 lightningdBROKEN: Error executing statement: wallet/wallet.c:768: UPDATE shachains SET num_valid=?, min_index=? WHERE id=?: database is locked
crash.log.20211010034113:+18687.834439849 lightningdBROKEN: Error executing statement: wallet/wallet.c:768: UPDATE shachains SET num_valid=?, min_index=? WHERE id=?: database is locked
crash.log.20211011032432:+85389.796158637 lightningdBROKEN: Error executing statement: wallet/wallet.c:768: UPDATE shachains SET num_valid=?, min_index=? WHERE id=?: database is locked

Don't know how to reproduce. It happens few times a day, presumably related to how busy the node is, but that is a guess already - I failed to trigger it.

My Digging

Looking generally I find SQLite has a story with locks or here - I can not say whether this applies.

Naturally I suspect "race-condition because of threads" - but can't say anything either, not knowing lightningd's thread model...

On the hardware I can say its rather slow: 2 x 7200rpm HDD combined as zpool mirror. That is low latency compared to SSD. So, if lightningd is rarely used / never stress-tested on such disks, I guess such a thing can go unnoticed.

The text was updated successfully, but these errors were encountered:

ZmnSCPxj · 2021-10-11T22:32:09Z

lightningd has no threads and is purely single-threaded. In addition, only one process can access the database, the main lightningd executable. Our code is also written so that multiple SELECT calls should be impossible.... modulo bugs, of course.

Is the HDD accessed locally or over NFS? If over NFS that might cause problems as NFS has classically not followed some Unix conventions for locking on filesystems (though those are being worked on over time and my information may be out-of-date). Locally should not have problems, ZFS should work properly as a Unix filesystem locally. Are you running ZFS on a Linux box with the ZFS in a module? Ubuntu or non-Ubuntu?

Are you running some kind of backup process in the background that also backs up the lightningd.sqlite3 file?

x3nb63 · 2021-10-12T09:35:25Z

there is no NFS involved.

it is ZFS as kernel module, precisely: zfs-linux-lts-git on Arch Linux, it produced this build: 2021.10.08.r7124.ga5b464263_5.10.72.1-1. Kernel is 5.10.63-1-lts from September 8th. Looks all fine to me.

As backup I run the SQLite Litestream Replication just as described (I guess).

x3nb63 · 2021-10-12T10:42:08Z

can something make threads you don't intent? such as a smart compiler?

I build on my own with ./configure --prefix ... && make && make install, essentially.

vincenzopalazzo · 2021-10-12T16:11:49Z

can something make threads you don't intent? such as a smart compiler?

No, I think there are nothings under the hood. However, do you use some particular plugin?

ZmnSCPxj · 2021-10-13T03:11:52Z

litestream locks the database file with a read-shared lock; see #4860 . If you are OK with it, can you try disabling the litestream and see if the problem disappears?

If the problem disappears, you could try modifying this to something larger (say 10000 or 15000):

lightning/wallet/db_sqlite3.c

Lines 44 to 47 in cd7d87f

    
           	/* In case another writer (litestream?) grabs a lock, we don't 
        
           	 * want to return SQLITE_BUSY immediately (which will cause a 
        
           	 * fatal error): give it 5 seconds. */ 
        
           	sqlite3_busy_timeout(db->conn, 5000);

Then re-enable litestream, and if the new 10-second or 15-second works on your system, we could try adding an SQLITE3-specific option to change the busy timeout. And probably modify BACKUP.md to warn users about slow HDDs and litestream. As a bonus, it does mean that using sqlite3 .dump or VACUUM INTO would work.

(Also: the comment is wrong, newer versions of litestream now just use a read-shared lock, but since we use a write-exclusive lock when we update the db then we should modify the comment to be about readers, since the write-exclusive lock we use is going to conflict with read-shared lockers)

x3nb63 · 2021-10-13T08:08:16Z

However, do you use some particular plugin?

summary and rebalance - but it did that before I installed them.

x3nb63 · 2021-10-13T08:37:43Z

@ZmnSCPxj: thanks, I'll go straight with patching the timeout - sounds saver to me than running without litestream. Can take some days until I have a result ...

x3nb63 · 2021-10-15T08:19:53Z

epiphany occurred and I understand why the patch always fails in my actual build procedure: the lines you quoted do not exist in the v0.10.1 tag ...

Unless there is a default, it runs with no timeout at all on my machine.

Is it safe to run master branch in production?

vincenzopalazzo · 2021-10-15T09:38:21Z

Is it safe to run master branch in production?

IMO yes I run it on my server https://bruce.bublina.eu.org
However, we are close to the v0.10.2, but we can ping @cdecker to know more about the status of the master :) he is the Capitan release this time

cdecker · 2021-10-15T10:26:12Z

Yes, I'm running master everywhere, and while we don't guarantee stability we do our bast to make it as stable as possible, so at release time we can just tag the latest master

As for the v0.10.2 release we're really close, just some minor discussions on how to address the dust limit issue for HTLCs and then we should be good to start with a release candidate.

ZmnSCPxj · 2021-10-15T12:43:25Z

Unless there is a default, it runs with no timeout at all on my machine.

The default is 0 seconds (;;^^)v

Maybe try inserting the new code then?

Nevertheless it seems to me a good idea to make this tweakable via option, just in case some user has even slower media (7200RPM is pretty much high end HDDs).

x3nb63 · 2021-10-19T07:54:29Z

running 3 1/2 days now and got one crash. Before it was at least 2 crashes a day.

for the one crash there is probably some other job accountable, which put high load on the disks.

that was all with the 5 seconds timeout before your fix.

From experience I can say that 5 seconds should cover pretty much all HDDs given normal circumstances. Then, if that happens, I'd say that lightningd is likely not the cause.

so, yes, the 60 seconds you made now are more than sufficient. I hope it does not risk any of lightningd's doings, ie when recording transactions or channels.

ZmnSCPxj · 2021-10-19T11:12:04Z

Oh, we depend on sqlite3 behaving properly and not losing any data. In particular, we always ensure that a transaction finishes before we send irrevocable data over the network, which in filesystem terms means sqlite3 has done an fsync on the database file and truncated the rollback journal and done an fsync on the journal, or if you are using a WAL (which of course you are, you are using litestream) done an fsync on the WAL. sqlite3 by all reports is a very robust library.

It is not particularly obvious, but the LN protocol is specifically designed so that there is a specific time where implementations must ensure that the information they are about to transmit is first written to disk. Thus, from the spec by itself, it is safe for an implementation to suddenly crash; the data is either on the disk, or it was never sent to the counterparty at all (and thus you are safe, the counterparty has no data that can be used to hurt you)..

In any case, do check out the updated BACKUP.md file. In particular because the litestream process has its own mainloop there is a tiny probability that data is written to main storage by lightningd, then lightningd sends irrevocable data, then the main storage crashes before litestream can read it. The timing for this is very tiny though. The backup plugin is superior to this since it will actually suspend the lightningd mainloop until the plugin can confirm it has read the data and written it to the backup medium, basically inserting between "written to main storage" and "send irrevocable data".

x3nb63 · 2021-10-20T10:25:17Z

thanks for explaining.

With 60 seconds timeout there is no crash this far.

I'll look into the backup plugin. The litestream method has another downside: using a content based incremental backup on its snapshots is very inefficient on the backup volume. (backup size vs. data actually changed, that is)

ZmnSCPxj · 2021-10-21T01:47:17Z

IIRC the backup plugin also does incremental backups, so the backup size will be larger than the "live" size I think.

It could be written so that only the most recent set of db entries is kept separately, and when a new set is saved the previous set is written to a replica db, so that the backup is only a small amount larger than the live data, but I do not know if it was written that way.

x3nb63 · 2021-10-21T10:07:04Z

thanks, I'll have a look. Ideal for my subsequent backup process would be to just overwrite a file, even more ideally a uncompressed text file such as a SQL dump ... cause the backup does a datablock based diff on its own ... but I read dumping active sqlite files is no good.

BAD news on this issue here: got another crash last night, while it was inserting what I think are bitcoin blocks:

< some channel announcement and ignorance stuff >

+123507.711396131 plugin-bcliIO_IN: 7b226a736f6e727063223a22322e30222c226964223a34393937382c22726573756c74223a7b22616d6f756e74223a6e756c6c2c22736372697074223a6e756c6c7d207d0a0a
... 67 skipped...
+123508.172560703 plugin-bcliIO_IN: 7b226a736f6e72[...]

< many more lines like this >

+123511.111564218 plugin-bcliIO_IN: 7b226a736f6e727063223a22322e30222c226964223a35303435382c22726573756c74223a7b22616d6f756e74223a6e756c6c2c22736372697074223a6e756c6c7d207d0a0a
+123511.115281350 plugin-bcliIO_IN: 7b226a736f6e727063223a22322e30222c226964223a35303435392c22726573756c74223a7b22616d6f756e74223a6e756c6c2c22736372697074223a6e756c6c7d207d0a0a
+123511.118446244 plugin-bcliIO_IN: 7b226a736f6e727063223a22322e30222c226964223a35303436302c22726573756c74223a7b22616d6f756e74223a6e756c6c2c22736372697074223a6e756c6c7d207d0a0a
+123511.122011011 plugin-bcliIO_IN: 7b226a736f6e727063223a22322e30222c226964223a35303436312c22726573756c74223a7b22616d6f756e74223a22343130303030306d736174222c22736372697074223a223030323030663962626264316236343965356561623735323530393536383134343961666261326362646365373039343166653466636230396461343939323938373337227d207d0a0a
+123511.157163174 lightningdBROKEN: Error executing statement: wallet/wallet.c:3724: INSERT INTO blocks (height, hash, prev_hash) VALUES (?, ?, ?);: database is locked
+123511.438047396 lightningdBROKEN: FATAL SIGNAL 6 (version 09459a9)
+123511.438060837 lightningdBROKEN: backtrace: common/daemon.c:38 (send_backtrace) 0x55ac374b0a99
+123511.438065266 lightningdBROKEN: backtrace: common/daemon.c:46 (crashdump) 0x55ac374b0ad9
+123511.438068496 lightningdBROKEN: backtrace: (null):0 ((null)) 0x7f9a01398d5f
+123511.438072259 lightningdBROKEN: backtrace: (null):0 ((null)) 0x7f9a01398ce1
+123511.438074935 lightningdBROKEN: backtrace: (null):0 ((null)) 0x7f9a01382536
+123511.438077720 lightningdBROKEN: backtrace: lightningd/log.c:804 (fatal) 0x55ac374742b2
+123511.438083293 lightningdBROKEN: backtrace: wallet/db.c:2180 (db_exec_prepared_v2) 0x55ac37497183
+123511.438089662 lightningdBROKEN: backtrace: wallet/wallet.c:3730 (wallet_filteredblock_add) 0x55ac374a14f6
+123511.438093338 lightningdBROKEN: backtrace: lightningd/gossip_control.c:54 (got_filteredblock) 0x55ac37468285
+123511.438098245 lightningdBROKEN: backtrace: lightningd/bitcoind.c:691 (process_getfiltered_block_final) 0x55ac37456f92
+123511.438101943 lightningdBROKEN: backtrace: lightningd/bitcoind.c:606 (process_getfilteredblock_step2) 0x55ac3745745f
+123511.438104879 lightningdBROKEN: backtrace: lightningd/bitcoind.c:538 (getutxout_callback) 0x55ac37456a1b
+123511.438118836 lightningdBROKEN: backtrace: lightningd/plugin.c:521 (plugin_response_handle) 0x55ac3748e28f
+123511.438121917 lightningdBROKEN: backtrace: lightningd/plugin.c:627 (plugin_read_json_one) 0x55ac3748e3c9
+123511.438125454 lightningdBROKEN: backtrace: lightningd/plugin.c:672 (plugin_read_json) 0x55ac3748e564
+123511.438128524 lightningdBROKEN: backtrace: ccan/ccan/io/io.c:59 (next_plan) 0x55ac374f5d46
+123511.438131419 lightningdBROKEN: backtrace: ccan/ccan/io/io.c:407 (do_plan) 0x55ac374f61cd
+123511.438135619 lightningdBROKEN: backtrace: ccan/ccan/io/io.c:417 (io_ready) 0x55ac374f6266
+123511.438139189 lightningdBROKEN: backtrace: ccan/ccan/io/poll.c:445 (io_loop) 0x55ac374f7b2c
+123511.438142049 lightningdBROKEN: backtrace: lightningd/io_loop_with_timers.c:21 (io_loop_with_timers) 0x55ac3746e447
+123511.438145124 lightningdBROKEN: backtrace: lightningd/lightningd.c:1112 (main) 0x55ac374721e8
+123511.438147840 lightningdBROKEN: backtrace: (null):0 ((null)) 0x7f9a01383d09
+123511.438150546 lightningdBROKEN: backtrace: (null):0 ((null)) 0x55ac374561c9
+123511.438154528 lightningdBROKEN: backtrace: (null):0 ((null)) 0xffffffffffffffff

this is 09459a9 with the 60 seconds timeout; if that first number is seconds I don't see that this took place ...

ZmnSCPxj · 2021-10-21T15:12:25Z

Well that is not good... Supposedly as soon as we get a response from plugin-bcli we should start the locking timer. Hmmm. Can you confirm lightningd --version?

x3nb63 · 2021-10-21T15:59:28Z

$ lightningd --version
09459a9

cdecker · 2021-10-21T16:54:45Z

IIRC the backup plugin also does incremental backups, so the backup size will be larger than the "live" size I think.

It could be written so that only the most recent set of db entries is kept separately, and when a new set is saved the previous set is written to a replica db, so that the backup is only a small amount larger than the live data, but I do not know if it was written that way.

That is definitely one way of doing it, and we're also working on a variant that will work this way. The reason it was not the first implementation is that with this continuous replication into a replica DB we cannot encrypt the queries, since the backup server could not apply them. However, you're right, keeping a separate table or file with the in-flight query and applying it onces it's been confirmed is definitely something we should do asap.

x3nb63 · 2021-11-04T10:10:36Z

that crash happened again, same version:

$ lightningd --version
09459a9
$ grep 'database is locked' crash.log.2021110* 
crash.log.20211101160350:+981638.015138198 lightningdBROKEN: Error executing statement: wallet/wallet.c:3724: INSERT INTO blocks (height, hash, prev_hash) VALUES (?, ?, ?);: database is locked
crash.log.20211102200719:+152.979660783 lightningdBROKEN: Error executing statement: wallet/wallet.c:3919: INSERT OR IGNORE INTO transaction_annotations (txid, idx, location, type, channel) VALUES (?, ?, ?, ?, ?);: database is locked
crash.log.20211104031348:+111976.347061380 lightningdBROKEN: Error executing statement: wallet/wallet.c:3724: INSERT INTO blocks (height, hash, prev_hash) VALUES (?, ?, ?);: database is locked
$

the first one was right in the middle of the day -> no backup I/O involved here, except for the attached litestream as described above.

the second one happened few minutes after restarting -> I assume it was catching up with the missed blocks from bitcoind, so it did more I/O, increasing its chance to crash

the third one was yet another ~2 days later, same as first. -> this time it did not miss that much from bitcoind ...

should I reopen this issue or is this a new one?

ZmnSCPxj · 2021-11-08T05:40:33Z

Looks like it would be better to just remove the litestream recommendation from BACKUP.md, if litesteam is indeed the one interfering here.

#4890 puts similar functionality right into lightningd, and also removes any potential races where lightningd commits to main and sends revocation keys before litestream can read and replicate to backup, but it is slated for 0.10.3. Would that work for you?

The sections on SQLite Litestream, sqlite3 .dump and VACUUM INTO commands were to be removed six months after 0.10.3, due to issues observed in ElementsProject#4857. We now recommend using --wallet=sqlite3://${main}:${backup} instead.

Changelog-None The sections on SQLite Litestream, sqlite3 .dump and VACUUM INTO commands were to be removed six months after 0.10.3, due to issues observed in ElementsProject#4857. We now recommend using --wallet=sqlite3://${main}:${backup} instead.

Changelog-None The sections on SQLite Litestream, sqlite3 .dump and VACUUM INTO commands were to be removed six months after 0.10.3, due to issues observed in #4857. We now recommend using --wallet=sqlite3://${main}:${backup} instead.

Changelog-None The sections on SQLite Litestream, sqlite3 .dump and VACUUM INTO commands were to be removed six months after 0.10.3, due to issues observed in ElementsProject#4857. We now recommend using --wallet=sqlite3://${main}:${backup} instead.

This was referenced Oct 15, 2021

Change sqlite3_busy_timeout to 60 seconds, add warnings re: litestream. #4867

Merged

litestream recommendation may lead to crashes? #4860

Closed

ZmnSCPxj closed this as completed in #4867 Oct 17, 2021

ZmnSCPxj mentioned this issue Nov 17, 2021

doc/BACKUP.md: Discourage litestream use. #4916

Merged

blckbx mentioned this issue Jul 12, 2022

sqlite db backend lightningnetwork/lnd#6570

Closed

11 tasks

adi-shankara mentioned this issue Jan 30, 2023

doc: remove sections on litestream, .dump and vacuum into #5935

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crashes with "database is locked" #4857

crashes with "database is locked" #4857

x3nb63 commented Oct 11, 2021

ZmnSCPxj commented Oct 11, 2021 •

edited

Loading

x3nb63 commented Oct 12, 2021

x3nb63 commented Oct 12, 2021 •

edited

Loading

vincenzopalazzo commented Oct 12, 2021

ZmnSCPxj commented Oct 13, 2021 •

edited

Loading

x3nb63 commented Oct 13, 2021

x3nb63 commented Oct 13, 2021

x3nb63 commented Oct 15, 2021 •

edited

Loading

vincenzopalazzo commented Oct 15, 2021

cdecker commented Oct 15, 2021

ZmnSCPxj commented Oct 15, 2021 •

edited

Loading

x3nb63 commented Oct 19, 2021

ZmnSCPxj commented Oct 19, 2021

x3nb63 commented Oct 20, 2021

ZmnSCPxj commented Oct 21, 2021 •

edited

Loading

x3nb63 commented Oct 21, 2021 •

edited

Loading

ZmnSCPxj commented Oct 21, 2021

x3nb63 commented Oct 21, 2021

cdecker commented Oct 21, 2021

x3nb63 commented Nov 4, 2021

ZmnSCPxj commented Nov 8, 2021 •

edited

Loading

crashes with "database is locked" #4857

crashes with "database is locked" #4857

Comments

x3nb63 commented Oct 11, 2021

Issue and Steps to Reproduce

My Digging

ZmnSCPxj commented Oct 11, 2021 • edited Loading

x3nb63 commented Oct 12, 2021

x3nb63 commented Oct 12, 2021 • edited Loading

vincenzopalazzo commented Oct 12, 2021

ZmnSCPxj commented Oct 13, 2021 • edited Loading

x3nb63 commented Oct 13, 2021

x3nb63 commented Oct 13, 2021

x3nb63 commented Oct 15, 2021 • edited Loading

vincenzopalazzo commented Oct 15, 2021

cdecker commented Oct 15, 2021

ZmnSCPxj commented Oct 15, 2021 • edited Loading

x3nb63 commented Oct 19, 2021

ZmnSCPxj commented Oct 19, 2021

x3nb63 commented Oct 20, 2021

ZmnSCPxj commented Oct 21, 2021 • edited Loading

x3nb63 commented Oct 21, 2021 • edited Loading

ZmnSCPxj commented Oct 21, 2021

x3nb63 commented Oct 21, 2021

cdecker commented Oct 21, 2021

x3nb63 commented Nov 4, 2021

ZmnSCPxj commented Nov 8, 2021 • edited Loading

ZmnSCPxj commented Oct 11, 2021 •

edited

Loading

x3nb63 commented Oct 12, 2021 •

edited

Loading

ZmnSCPxj commented Oct 13, 2021 •

edited

Loading

x3nb63 commented Oct 15, 2021 •

edited

Loading

ZmnSCPxj commented Oct 15, 2021 •

edited

Loading

ZmnSCPxj commented Oct 21, 2021 •

edited

Loading

x3nb63 commented Oct 21, 2021 •

edited

Loading

ZmnSCPxj commented Nov 8, 2021 •

edited

Loading