-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crashes with "database is locked" #4857
Comments
Is the HDD accessed locally or over NFS? If over NFS that might cause problems as NFS has classically not followed some Unix conventions for locking on filesystems (though those are being worked on over time and my information may be out-of-date). Locally should not have problems, ZFS should work properly as a Unix filesystem locally. Are you running ZFS on a Linux box with the ZFS in a module? Ubuntu or non-Ubuntu? Are you running some kind of backup process in the background that also backs up the |
there is no NFS involved. it is ZFS as kernel module, precisely: zfs-linux-lts-git on Arch Linux, it produced this build: As backup I run the SQLite Litestream Replication just as described (I guess). |
can something make threads you don't intent? such as a smart compiler? I build on my own with |
No, I think there are nothings under the hood. However, do you use some particular plugin? |
If the problem disappears, you could try modifying this to something larger (say Lines 44 to 47 in cd7d87f
Then re-enable (Also: the comment is wrong, newer versions of |
|
@ZmnSCPxj: thanks, I'll go straight with patching the timeout - sounds saver to me than running without litestream. Can take some days until I have a result ... |
epiphany occurred and I understand why the patch always fails in my actual build procedure: the lines you quoted do not exist in the v0.10.1 tag ... Unless there is a default, it runs with no timeout at all on my machine. Is it safe to run master branch in production? |
IMO yes I run it on my server https://bruce.bublina.eu.org |
Yes, I'm running As for the v0.10.2 release we're really close, just some minor discussions on how to address the dust limit issue for HTLCs and then we should be good to start with a release candidate. |
The default is 0 seconds (;;^^)v Maybe try inserting the new code then? Nevertheless it seems to me a good idea to make this tweakable via option, just in case some user has even slower media (7200RPM is pretty much high end HDDs). |
running 3 1/2 days now and got one crash. Before it was at least 2 crashes a day. for the one crash there is probably some other job accountable, which put high load on the disks. that was all with the 5 seconds timeout before your fix. From experience I can say that 5 seconds should cover pretty much all HDDs given normal circumstances. Then, if that happens, I'd say that so, yes, the 60 seconds you made now are more than sufficient. I hope it does not risk any of |
Oh, we depend on It is not particularly obvious, but the LN protocol is specifically designed so that there is a specific time where implementations must ensure that the information they are about to transmit is first written to disk. Thus, from the spec by itself, it is safe for an implementation to suddenly crash; the data is either on the disk, or it was never sent to the counterparty at all (and thus you are safe, the counterparty has no data that can be used to hurt you).. In any case, do check out the updated |
thanks for explaining. With 60 seconds timeout there is no crash this far. I'll look into the backup plugin. The litestream method has another downside: using a content based incremental backup on its snapshots is very inefficient on the backup volume. (backup size vs. data actually changed, that is) |
IIRC the It could be written so that only the most recent set of db entries is kept separately, and when a new set is saved the previous set is written to a replica db, so that the backup is only a small amount larger than the live data, but I do not know if it was written that way. |
thanks, I'll have a look. Ideal for my subsequent backup process would be to just overwrite a file, even more ideally a uncompressed text file such as a SQL dump ... cause the backup does a datablock based diff on its own ... but I read dumping active sqlite files is no good. BAD news on this issue here: got another crash last night, while it was inserting what I think are bitcoin blocks:
this is 09459a9 with the 60 seconds timeout; if that first number is seconds I don't see that this took place ... |
Well that is not good... Supposedly as soon as we get a response from |
|
That is definitely one way of doing it, and we're also working on a variant that will work this way. The reason it was not the first implementation is that with this continuous replication into a replica DB we cannot encrypt the queries, since the backup server could not apply them. However, you're right, keeping a separate table or file with the in-flight query and applying it onces it's been confirmed is definitely something we should do asap. |
that crash happened again, same version:
the first one was right in the middle of the day -> no backup I/O involved here, except for the attached the second one happened few minutes after restarting -> I assume it was catching up with the missed blocks from the third one was yet another ~2 days later, same as first. -> this time it did not miss that much from should I reopen this issue or is this a new one? |
Looks like it would be better to just remove the #4890 puts similar functionality right into |
The sections on SQLite Litestream, sqlite3 .dump and VACUUM INTO commands were to be removed six months after 0.10.3, due to issues observed in ElementsProject#4857. We now recommend using --wallet=sqlite3://${main}:${backup} instead.
Changelog-None The sections on SQLite Litestream, sqlite3 .dump and VACUUM INTO commands were to be removed six months after 0.10.3, due to issues observed in ElementsProject#4857. We now recommend using --wallet=sqlite3://${main}:${backup} instead.
Changelog-None The sections on SQLite Litestream, sqlite3 .dump and VACUUM INTO commands were to be removed six months after 0.10.3, due to issues observed in #4857. We now recommend using --wallet=sqlite3://${main}:${backup} instead.
Changelog-None The sections on SQLite Litestream, sqlite3 .dump and VACUUM INTO commands were to be removed six months after 0.10.3, due to issues observed in ElementsProject#4857. We now recommend using --wallet=sqlite3://${main}:${backup} instead.
Changelog-None The sections on SQLite Litestream, sqlite3 .dump and VACUUM INTO commands were to be removed six months after 0.10.3, due to issues observed in ElementsProject#4857. We now recommend using --wallet=sqlite3://${main}:${backup} instead.
Changelog-None The sections on SQLite Litestream, sqlite3 .dump and VACUUM INTO commands were to be removed six months after 0.10.3, due to issues observed in ElementsProject#4857. We now recommend using --wallet=sqlite3://${main}:${backup} instead.
Issue and Steps to Reproduce
version is 0.10.1
but 0.9.3 did the same, maybe more often.
lightningd
emergency exits itself cause the database says its locked during INSERT or UPDATE statements:Don't know how to reproduce. It happens few times a day, presumably related to how busy the node is, but that is a guess already - I failed to trigger it.
My Digging
Looking generally I find SQLite has a story with locks or here - I can not say whether this applies.
Naturally I suspect "race-condition because of threads" - but can't say anything either, not knowing lightningd's thread model...
On the hardware I can say its rather slow: 2 x 7200rpm HDD combined as zpool mirror. That is low latency compared to SSD. So, if lightningd is rarely used / never stress-tested on such disks, I guess such a thing can go unnoticed.
The text was updated successfully, but these errors were encountered: