Use Replace-by-Fee if the previous sent checkpoint has not been included for long time #176

gitferry · 2023-06-19T10:11:41Z

This PR fixes #175.

A previous PR #170 enables the RBF feature but is not used. This PR allows the use of RBF if a previous checkpoint has been sent but not included on BTC for a long period of time. Particularly, we save the meta information (i.e., timestamp, BTC txs) of each submitted checkpoint. Whenever a submitted-but-not-included-for-long checkpoint is found, we will resend the checkpoint using the same BTC txs used in the last submission. Note that we will re-estimate the tx fee and change the value part in the second output of each BTC tx.

Please note that this PR does not include end-to-end tests. @KonradStaniec could you please take a look and see how we can simulate some rough scenarios (e.g., BTC txs stuck in the mempool) and add relevant e2e test?

vitsalis · 2023-06-19T12:31:37Z

submitter/relayer/relayer.go

-	tag              btctxformatter.BabylonTag
-	version          btctxformatter.FormatVersion
-	submitterAddress sdk.AccAddress
+	submittedCheckpoints map[uint64]*types.CheckpointInfo


Do we need to keep submission information about all the past checkpoints? I feel this is just using extra memory.

From my understanding, we only need to keep information about the last submission that we sent and the epoch number it corresponds to. Then,

If the last SEALED epoch remains the same as our last submission and the resend interval has not passed: Do nothing

If the last SEALED epoch remains the same as our last submission and the resend interval has passed: Bump the fee of the last submission

If the last SEALED epoch changed, then create a new checkpoint and reset the variables to the new checkpoints and epochs.

Unless there's another utility of having a list of the submitted checkpoints.

Agreed. After a second thought, indeed there's no need to keep a list of the submitted checkpoints as the checkpoint is submitted in the monotonically increasing order. Further, the next checkpoint will not be submitted if the previous one is still Sealed.

vitsalis · 2023-06-19T12:32:36Z

submitter/relayer/relayer.go

-		log.Logger.Debugf("Skip submitting the raw checkpoint for epoch %v", ckpt.Ckpt.EpochNum)
+
+	epoch := ckpt.Ckpt.EpochNum
+	ckptInfo, hasSubmitted := rl.submittedCheckpoints[epoch]


Here, we only care about the last epoch, so I guess we can remove information about any epoch' < epoch etc. Overall, I suggest we do not keep a mapping, but only the last submitted checkpoint.

The logic can be converted to:

Get the last checkpoint submitted.

If the epoch of the last checkpoint submitted does not match the latest sealed checkpoint of Babylon, we have a new checkpoint to submit and we submit it. Set the last checkpoint submitted to the new one.

If the epoch matches, check if the resend interval has passed. If yes, bump the fee of the checkpoint and set the last checkpoint submitted to the new one.

vitsalis · 2023-06-19T12:36:36Z

types/ckpt_info.go

+	Tx            *wire.MsgTx
+	ChangeAddress btcutil.Address
+	Size          uint64 // the size of the BTC tx
+	UtxoAmount    uint64 // the amount of the UTXO used in the BTC tx


Can't those values be calculated by the transaction itself?

For calculating the Size, we need the UTXO and the priv key etc., see here. So I thought it would be easier to just save size and utxoAmount in the BtcTxInfo

vitsalis · 2023-06-19T12:38:45Z

submitter/relayer/relayer.go

+	tag                  btctxformatter.BabylonTag
+	version              btctxformatter.FormatVersion
+	submitterAddress     sdk.AccAddress
+	resendIntervals      uint


Suggested change

resendIntervals uint

resendIntervalSeconds uint

Another way (that might be a bit cleaner) is

Suggested change

resendIntervals uint

resendIntervals time.Duration

I agree that changing the interval to time.Duration is cleaner and more flexible, but it increases the chance of having invalid configurations as it requires the time unit specified, for example, 100s vs. 100.

I don't think that's a big issue, we already do that in other places as well (e.g. the API).

vitsalis · 2023-06-19T12:38:58Z

submitter/relayer/relayer.go

@@ -37,11 +39,12 @@ func New(
 	resendIntervals uint,


Suggested change

resendIntervals uint,

resendIntervalSeconds uint,

gitferry · 2023-06-20T07:33:51Z

Thanks, @vitsalis and @SebastianElvis for your comments. I have replaced the submitted checkpoint list with the last submitted checkpoint as we agree that there's no need to keep all the historical checkpoints.

vitsalis

Other than some minor stuff, looks good! How did you test this? Once this is merged is it ready for deployment?

submitter/relayer/relayer.go

vitsalis · 2023-06-20T08:07:01Z

submitter/relayer/relayer.go

+		log.Logger.Debugf("The checkpoint for epoch %v was sent more than %v seconds ago but not included on BTC, resending the checkpoint",
+			ckptEpoch, rl.resendIntervalSeconds)
+
+		err := rl.resendCheckpointToBTC(rl.lastSubmittedCheckpoint)


I suggest that this method returns the checkpoint and we set rl.lastSubmittedCheckpoint here.

vitsalis · 2023-06-20T08:07:20Z

submitter/relayer/relayer.go

+	if rl.lastSubmittedCheckpoint == nil || rl.lastSubmittedCheckpoint.Epoch < ckptEpoch {
+		log.Logger.Debugf("Submitting a raw checkpoint for epoch %v for the first time", ckptEpoch)
+
+		err := rl.convertCkptToTwoTxAndSubmit(ckpt)


I suggest this method returns the checkpoint and we set rl.lastSubmittedCheckpoint here.

gitferry · 2023-06-20T10:36:05Z

Thanks, @vitsalis for your suggestions! I tested this PR in local deployment using babylon-deployment. It ran well in the happy path, but the resending is not triggered. I'm not sure adding unit tests on this would be meaningful as other components are all mocked. So I think maybe we can add some e2e tests where BTC txs are stuck in the mempool for a long time. Wdyt @KonradStaniec?

Anyway, I think we can deploy it in the devnet first and see how it goes. Is it what the devnet is for? Can we create scenarios where the BTC mempool is busy?

KonradStaniec · 2023-06-20T10:40:05Z

submitter/relayer/relayer.go

+	// resend tx1 of the checkpoint
+	tx1Fee := rl.GetTxFee(ckptInfo.Tx1.Size)
+	tx1 := ckptInfo.Tx1
+	tx1.Tx.TxOut[1].Value = int64(ckptInfo.Tx1.UtxoAmount - tx1Fee)


how does this work ? Doesn't changing output value require signing transaction once again before sending?

You are right, this is a nice catch. We definitely need to sign the tx again

We also need to check if the new fee is larger than the previous fee. If it is not, then the transaction won't be accepted as RBF only works if we set a higher fee.

KonradStaniec · 2023-06-20T10:44:48Z

Thanks, @vitsalis for your suggestions! I tested this PR in local deployment using babylon-deployment. It ran well in the happy path, but the resending is not triggered. I'm not sure adding unit tests on this would be meaningful as other components are all mocked. So I think maybe we can add some e2e tests where BTC txs are stuck in the mempool for a long time. Wdyt @KonradStaniec?

Probably the easiest way to test (at least partially) it would be with e2e_test.go. In those tests mining of the block is triggered manualy with mineBlockWithTxes so most obvious test would be:

set up resendIntervalSeconds to some low value like 5s
start test
wait for submitter initial submission to be included in mempool
do not mine this transaction and wait for re-submission
check that old transaction is evicted from mempool and the new one has taken its place

KonradStaniec · 2023-06-20T10:55:08Z

did a quick experiment with this and got logs in form of:

time="2023-06-20T12:53:34+02:00" level=error msg="Failed to submit the raw checkpoint for 1: failed to re-send tx1 of the checkpoint 1: -26: TX rejected: already have transaction 4cfff6e42a15a7f7be10f1176749c15a3fc552f1c60ef9810e2dbb66315f5b47" module=submitter

So maybe there is some node config option which needs to be set up ?

KonradStaniec · 2023-06-20T11:04:14Z

submitter/relayer/relayer.go

+	// if the resend interval has passed, should resend
+	durSeconds := uint(time.Since(rl.lastSubmittedCheckpoint.Ts).Seconds())
+	if durSeconds >= rl.resendIntervalSeconds {
+		log.Logger.Debugf("The checkpoint for epoch %v was sent more than %v seconds ago but not included on BTC, resending the checkpoint",


How do we known that at this point the checkpoint was really not included in btc ? Is this implicit, due too the fact that babylon still perceive this checkpoint as SEALED ?

If yes, I would add a comment explaining that and also mentioning that it is not perfect way, but it is also possible that submission succeeded and was yet not reported to Babylon.

The only foolproof way would be to have separate go routine which checks wheter sent checkpoint transactions have been included into the block. And tracking those submission from k blocks to defend against rollbacks.

You are right, we can only be sure that the checkpoint has not been reported without knowing whether the BTC txs have been included on BTC. Maybe for now it's not an issue because we run both the reporter and the submitter by ourselves (we will know which part is broken).

Yep, but given that typically the resend interval is in the range of hours, this will not be an issue. Even if the resend interval was low, we would only bump the fee (to a probably similar value or one that more closely relates to the latest reality). If the transaction has been included and we're not aware of it yet (because Babylon still considers it as SEALED), we will bump the fee of an existing checkpoint so our transaction will be rejected.

KonradStaniec · 2023-06-20T11:20:36Z

Another set of experiment results:

initial error was due to the fact that in tests fee returned by wallet is 0 (there is not much transactionc), so we effectively re-sent the same transaction which is treated as duplicate and rejected by mempool. We need to make sure that when re-sending that fee is larger that in replaced transaction.
after tweaking fee, the error I got is:

time="2023-06-20T13:10:32+02:00" level=error msg="Failed to submit the raw checkpoint for 1: failed to re-send tx1 of the checkpoint 1: -26: TX rejected: replacement transaction d0dbe6046a8520e8e91a9bd2ce849351a0b1bb09df582a7bf3e58bbaf94baeee has an insufficient fee rate: needs more than 3773, has 3594" module=submitter

Conclusion: We need to somehow make sure that we provide enough fee for replacement.

After tweaking fee once again, the error I got is:

time="2023-06-20T13:12:21+02:00" level=error msg="Failed to submit the raw checkpoint for 1: failed to re-send tx1 of the checkpoint 1: -25: TX rejected: failed to validate input a9ece0b726cf4e82000a6ab3745786137aac7672e9fa95ce80e1460663300347:0 which references output 1cf7c0cfde61f1c8a98a74d8de6db5c24a7058c46436758ea9907bfca27f2177:0 - signature not empty on failed checksig (input witness [], input script bytes 473044022029289cf04781a7d8344d367b49d204b0641bf196255795baf61fe640ee6ea0df022067033b25236bcf53645a6e2d6ff8ecabf0c8cb641b3b263c8ae55284cc9a30b3012103d1610375df79aaa608d9484b6540239477f9bfc6ad4593d6717a231e4b2576f1, prev output script bytes 76a914ecc89e8bdccf753ee4ed82106ab2d9f821a7aeda88ac)" module=submitter

Conclusion: It seems the we need to re-sign transaction after changing fee. I kinda anticipated that, as it would be weird if one could just change tx in mempool without proper signature.

vitsalis · 2023-06-20T11:22:32Z

did a quick experiment with this and got logs in form of:
time="2023-06-20T12:53:34+02:00" level=error msg="Failed to submit the raw checkpoint for 1: failed to re-send tx1 of the checkpoint 1: -26: TX rejected: already have transaction 4cfff6e42a15a7f7be10f1176749c15a3fc552f1c60ef9810e2dbb66315f5b47" module=submitter
So maybe there is some node config option which needs to be set up ?

How did you trigger this experiment? Given that we have replace by fee enabled we should be able to trigger the replacement I think, so this needs some more investigation.

Overall, I suggest we update the e2e tests as well as verify that this works based on some local experiments (e.g. through the local deployment?) as we want to deploy this on the mainnet node soon. We will of course initially try it on the devnet with a resend-interval of a few seconds to trigger the resending often and check that this works as expected.

KonradStaniec · 2023-06-20T11:26:21Z

So the experiment I did was exeactly that:

set up resendIntervalSeconds to some low value like 5s
start test
wait for submitter initial submission to be included in mempool
do not mine this transaction and wait for re-submission
check that old transaction is evicted from mempool and the new one has taken its place

I setup up tm.Config.Submitter.ResendIntervalSeconds = 5 and in test after mempool inclusion added sleep time.Sleep(10 * time.Second)

It is enough to trigger re-submission.

As mentioned in next comment this error was kinda red herring due to test enviroment:

initial error was due to the fact that in tests fee returned by wallet is 0 (there is not much transactionc), so we effectively re-sent the same transaction which is treated as duplicate and rejected by mempool.

gitferry · 2023-06-20T13:48:58Z

Thanks @KonradStaniec for the quick experiments and conclusion. The error is much likely caused by not signing the changed tx. Will fix this and test it locally to make sure the re-sending works

KonradStaniec · 2023-06-20T14:13:09Z

@gitferry I have pushed commit which shows test which I would expect to work. I saw you wanted to have an example before you eddited the comment 😅

gitferry · 2023-06-21T14:34:22Z

Hi @vitsalis and @KonradStaniec, thanks for your previous feedback. Now the PR passed the e2e tests and is ready for review again. Some changes I made:

resign the tx after changing the value of the output (adjusting the tx fee)
some refactoring on composing the BTC tx to make the code cleaner and less duplicate
only resend the second tx of the checkpoint other than both the two txs increasing the fee of the tx2 is sufficient for our purpose and can save another tx
in our tests, the fee estimation is always the same, so a hack in this PR is to double the previous fee as the new fee to ensure that the new fee is higher. we can adjust this before deploying it to the dev net

vitsalis

Only bumping the second transaction needs to take care of the effective fee, not only the fee of the second transaction.

vitsalis · 2023-06-21T14:59:39Z

submitter/relayer/relayer.go

-		return errors.New("checkpoint is not Sealed")
+		log.Logger.Errorf("The checkpoint for epoch %v is not sealed", ckptEpoch)
+		// we do not consider this case as a failed submission but a software bug
+		// TODO: add metrics for alerting


Maybe we could add this alert metric as part of this PR given that metrics infrastructure is already in place?

Maybe I missed something, what do you mean by the metrics infrastructure already in place? I suppose we still need another PR to export BTC usage metrics to the dashboard

Yep sure we can do it in another PR

vitsalis · 2023-06-21T15:00:17Z

submitter/relayer/relayer.go

+		log.Logger.Errorf("The checkpoint for epoch %v is lower than the last submission for epoch %v",
+			ckptEpoch, lastSubmittedEpoch)
+		// we do not consider this case as a failed submission but a software bug
+		// TODO: add metrics for alerting


vitsalis · 2023-06-21T15:04:39Z

submitter/relayer/relayer.go

+func (rl *Relayer) resendSecondTxOfCheckpointToBTC(btcTxInfo *types.BtcTxInfo) (*types.BtcTxInfo, error) {
+	// re-estimate the tx fee based on the current load
+	// TODO: re-estimate the fee for the real network
+	// fee := rl.GetTxFee(btcTxInfo.Size)


If we resend only the second transaction, then this does not bump the fee as we want it to.

When having chained transactions that are not included, the miners see their fee as not their individual fee, but as their effective fee.

For example if we have transactions A->B and A has fee a and B has fee b, then the effective fee of both transactions is (a + b) / (size(A) + size(B)). Therefore, if we want to bump the fee of only the second transaction, the fee should be bumped as a' - a + b' - b, where a' and b' are the new fees that were calculated. Now we only bump the fee by b' - b.

We could either resend both transactions or implement the more complex logic I described above. I'm ok with both.

This is a very nice explanation, thanks! I think only bumping the second tx can also be effective if we estimate the fee of the second tx by taking both size(tx1) and size(tx2) into account. For example, bumpedFee := rl.GetTxFee(tx1_size + tx2_size).

Further, I think bumping only the second tx has several benefits compared with bumping the two:

the logic would be simpler because if we bump the two we need to let tx2' reference tx1' not tx1

we can have less cost in the case where tx1 is already included in a BTC block but tx2 isn't. In this case, we cannot replace either of them if we bump the two txs because tx2' references tx1', not tx1. But, if we only bump the second tx, we could replace tx2 and save tx1.

vitsalis · 2023-06-21T15:04:41Z

submitter/relayer/relayer.go

+	// TODO: re-estimate the fee for the real network
+	// fee := rl.GetTxFee(btcTxInfo.Size)
+	// hack: make sure the fee higher than the previous fee for testing purpose
+	fee := btcTxInfo.Fee * 2


Seems that the above is leftover.

vitsalis · 2023-06-21T15:06:58Z

submitter/relayer/relayer.go

+		// ensure that the fee is not greater than the output value
+		fee = outputValue
+	}
+	tx.TxOut[1].Value = int64(outputValue - fee)


If fee == outputValue we will have an output with 0 satoshi. Is this allowed?

Yep, it is allowed. Recall that we have set TxOut[0].Value=0 already for storing data in OP_RETURN. Also, see this

vitsalis · 2023-06-21T15:07:25Z

submitter/relayer/relayer.go

+	// if not segwit, add signature; otherwise, add witness
+	segwit, err := isSegWit(utxo.Addr)
+	if err != nil {
+		panic(err)


Why do we panic instead of returning the error?

Returning the error indicates an unsupported address type in the UTXO. Yep, returning the error is better. Thanks!

vitsalis · 2023-06-21T15:08:20Z

submitter/relayer/utils.go

+	// if not segwit, add signature; otherwise, add witness
+	segwit, err := isSegWit(utxo.Addr)
+	if err != nil {
+		panic(err)


Why panic instead of returning error?

vitsalis

Looks good! Only a slight concern about the precision. Let's wait for @KonradStaniec input as well as this is an important change.

vitsalis · 2023-06-22T07:58:23Z

submitter/relayer/relayer.go

-		return errors.New("checkpoint is not Sealed")
+		log.Logger.Errorf("The checkpoint for epoch %v is not sealed", ckptEpoch)
+		// we do not consider this case as a failed submission but a software bug
+		// TODO: add metrics for alerting


Yep sure we can do it in another PR

vitsalis · 2023-06-22T08:00:34Z

submitter/relayer/relayer.go

+func (rl *Relayer) resendSecondTxOfCheckpointToBTC(btcTxInfo *types.BtcTxInfo) (*types.BtcTxInfo, error) {
+	// re-estimate the tx fee based on the current load considering the size of both tx1 and tx2
+	// here we double the fee for simplicity as the size of the two txs is roughly the same
+	fee := rl.GetTxFee(btcTxInfo.Size) * 2


I suggest we be more precise and not just double the fee. We can do rl.GetTxFee(btcTxInfo1.Size + btcTxInfo2.Size)

vitsalis · 2023-06-22T08:02:16Z

submitter/relayer/relayer.go

+	outputValue := uint64(utxo.Amount.ToUnit(btcutil.AmountSatoshi))
+	if outputValue < fee {
+		// ensure that the fee is not greater than the output value
+		fee = outputValue


A concern might be that in times of congestion we might get stuck trying to resubmit a low UTXO transaction because we won't have enough funds to pay for the fees. But not sure if we can do much about it without introducing complex logic.

I don't think it will happen often because when we pick UTXO we pick the highest one. Maybe this can be addressed if we periodically aggregate UTXOs as you suggested earlier. We'll see whether it's a real concern from forthcoming metrics in our BTC dashboard.

KonradStaniec

In general looks good ! Few general remarks for the future:

SendCheckpointToBTC is starting becoming a bit too complex for my taste and handles too much. Imo resendSecondTxOfCheckpointToBTC should be extracted to separate function and handled in main event loop.
before doing rbf, we could check if previous tx is in mempool still. This way we could make sure we are really sending tx with larger fee
All this things can be done in next pr-s imo

gitferry · 2023-06-22T10:08:19Z

Thanks @vitsalis and @KonradStaniec for your final reviews. As I'm merging the PR, is it ready for testing in our dev net? What should I do?

gitferry added 3 commits June 16, 2023 11:02

enable rbf

979581e

fix lint

c013363

add more comments

45af63e

gitferry requested review from vitsalis and KonradStaniec June 19, 2023 10:11

vitsalis reviewed Jun 19, 2023

View reviewed changes

gitferry added 3 commits June 20, 2023 15:07

replace submitted checkpoint list with the last submitted checkpoint

2fc82aa

rename

55ff88e

fix

997751c

gitferry force-pushed the use-rbf branch from 85bc40e to 997751c Compare June 20, 2023 07:28

vitsalis approved these changes Jun 20, 2023

View reviewed changes

minor

e53e8fb

KonradStaniec reviewed Jun 20, 2023

View reviewed changes

Test example

5209b07

gitferry added 2 commits June 21, 2023 10:33

resign tx after changing the fee

7d3fb5c

solve resigning issue and refactoring

c3151ea

gitferry requested review from vitsalis and KonradStaniec June 21, 2023 14:34

vitsalis reviewed Jun 21, 2023

View reviewed changes

minor

1a58754

vitsalis approved these changes Jun 22, 2023

View reviewed changes

make fee estimation precise

fea7da9

KonradStaniec approved these changes Jun 22, 2023

View reviewed changes

gitferry merged commit 108024b into dev Jun 22, 2023

gitferry deleted the use-rbf branch June 22, 2023 10:08

Use Replace-by-Fee if the previous sent checkpoint has not been included for long time #176

Use Replace-by-Fee if the previous sent checkpoint has not been included for long time #176

Conversation

gitferry commented Jun 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gitferry commented Jun 20, 2023

vitsalis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gitferry commented Jun 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KonradStaniec commented Jun 20, 2023

KonradStaniec commented Jun 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KonradStaniec commented Jun 20, 2023

vitsalis commented Jun 20, 2023

KonradStaniec commented Jun 20, 2023

gitferry commented Jun 20, 2023 • edited Loading

KonradStaniec commented Jun 20, 2023

gitferry commented Jun 21, 2023

vitsalis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vitsalis left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KonradStaniec left a comment

Choose a reason for hiding this comment

gitferry commented Jun 22, 2023

gitferry commented Jun 20, 2023 •

edited

Loading

gitferry commented Jun 20, 2023 •

edited

Loading

vitsalis left a comment •

edited

Loading