-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tapdb: use the serializable isolation by default for postgres #334
Conversation
itest failure show we do indeed have an issue here:
|
Alternative path for the last commit: https://github.com/powerman/pqx/blob/a6a4bb664f620589441e7b28d20464134af550c2/serialize.go#L16 |
Fixed with the latest commit: we now retry (forever). We should log, and also consider an upper limit for retry as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with #336, and confirmed these change fix the proof upload issues we observed before.
Need to apply the fixup commits but otherwise LGTM, nice catch on the default postgres modes 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! This should indeed make things safer.
Though with the current implementation I think we might get into weird situations with the combinations of goto
and defer()
(see inline comments).
tapdb/interfaces.go
Outdated
// Roll back the transaction, then pop back up to try | ||
// once again. | ||
// | ||
// TODO(roasbeef): needs retry limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we fix this TODO to avoid endless re-tries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think so, open question: what's a good limit for retries? My strategy was going to be deploy the modified version in staging/testnet, then check the logs to see how often we retry. We have @jharveyb's mega mint script, so that can be used to calibrate.
tapdb/interfaces.go
Outdated
@@ -100,6 +101,7 @@ func NewTransactionExecutor[Querier any](db BatchedQuerier, | |||
func (t *TransactionExecutor[Q]) ExecTx(ctx context.Context, | |||
txOptions TxOptions, txBody func(Q) error) error { | |||
|
|||
txStart: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will cause a new defer
with a rollback to be caused for each retry. And IIUC then Rollback()
will return an error if it is called twice. So I think we need to roll this out into an actual loop, otherwise we might get weird side effects or errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need to rollback after each attempt, will do some doc/code digging.
beb672a
to
782c21b
Compare
Tacked on a proper retry loop PTAL. Also has logging now so we can see how many times things need to be retried in practice. |
0b70655
to
28bad1f
Compare
In this commit, we start to use the serializable snapshot isolation by default for postgres. For sqlite, this is a noop as only a single writer is allowed at any time. With this mode, we should guard against all types of anomalies, at the cost of extra locking and potential perf hits. One thing we'll need to test out is if the DB driver will automatically do retries or not if a transaction cannot be serialized. This is an attempt at fixing lightninglabs#333 at a global level.
In this commit, we introduce a new postgres specific error returned when a transaction cannot be serialized. We then catch this error in order to implement retry logic for a transaction that failed to be serialized. We'll retry up to 10 times, waiting a random amount up to 50 ms between each retry.
28bad1f
to
5f11328
Compare
Re calling rollback multiple times, looks like any second call will be a noop here due to the CAS: https://cs.opensource.google/go/go/+/refs/tags/go1.20.4:src/database/sql/sql.go;l=2279-2284 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, LGTM 🎉
Ah, yes. I overlooked that we just ignore the |
In this commit, we start to use the serializable snapshot isolation by default for postgres. For sqlite, this is a noop as only a single writer is allowed at any time. With this mode, we should guard against all types of anomalies, at the cost of extra locking and potential perf hits.
One thing we'll need to test out is if the DB driver will automatically do retries or not if a transaction cannot be serialized.
This is an attempt at fixing #333 at a global level.