Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Replace-by-Fee if the previous sent checkpoint has not been included for long time #176

Merged
merged 12 commits into from
Jun 22, 2023

Conversation

gitferry
Copy link
Collaborator

This PR fixes #175.

A previous PR #170 enables the RBF feature but is not used. This PR allows the use of RBF if a previous checkpoint has been sent but not included on BTC for a long period of time. Particularly, we save the meta information (i.e., timestamp, BTC txs) of each submitted checkpoint. Whenever a submitted-but-not-included-for-long checkpoint is found, we will resend the checkpoint using the same BTC txs used in the last submission. Note that we will re-estimate the tx fee and change the value part in the second output of each BTC tx.

Please note that this PR does not include end-to-end tests. @KonradStaniec could you please take a look and see how we can simulate some rough scenarios (e.g., BTC txs stuck in the mempool) and add relevant e2e test?

tag btctxformatter.BabylonTag
version btctxformatter.FormatVersion
submitterAddress sdk.AccAddress
submittedCheckpoints map[uint64]*types.CheckpointInfo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to keep submission information about all the past checkpoints? I feel this is just using extra memory.

From my understanding, we only need to keep information about the last submission that we sent and the epoch number it corresponds to. Then,

  • If the last SEALED epoch remains the same as our last submission and the resend interval has not passed: Do nothing
  • If the last SEALED epoch remains the same as our last submission and the resend interval has passed: Bump the fee of the last submission
  • If the last SEALED epoch changed, then create a new checkpoint and reset the variables to the new checkpoints and epochs.

Unless there's another utility of having a list of the submitted checkpoints.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. After a second thought, indeed there's no need to keep a list of the submitted checkpoints as the checkpoint is submitted in the monotonically increasing order. Further, the next checkpoint will not be submitted if the previous one is still Sealed.

log.Logger.Debugf("Skip submitting the raw checkpoint for epoch %v", ckpt.Ckpt.EpochNum)

epoch := ckpt.Ckpt.EpochNum
ckptInfo, hasSubmitted := rl.submittedCheckpoints[epoch]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we only care about the last epoch, so I guess we can remove information about any epoch' < epoch etc. Overall, I suggest we do not keep a mapping, but only the last submitted checkpoint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic can be converted to:

  1. Get the last checkpoint submitted.
  2. If the epoch of the last checkpoint submitted does not match the latest sealed checkpoint of Babylon, we have a new checkpoint to submit and we submit it. Set the last checkpoint submitted to the new one.
  3. If the epoch matches, check if the resend interval has passed. If yes, bump the fee of the checkpoint and set the last checkpoint submitted to the new one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Tx *wire.MsgTx
ChangeAddress btcutil.Address
Size uint64 // the size of the BTC tx
UtxoAmount uint64 // the amount of the UTXO used in the BTC tx
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't those values be calculated by the transaction itself?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For calculating the Size, we need the UTXO and the priv key etc., see here. So I thought it would be easier to just save size and utxoAmount in the BtcTxInfo

tag btctxformatter.BabylonTag
version btctxformatter.FormatVersion
submitterAddress sdk.AccAddress
resendIntervals uint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
resendIntervals uint
resendIntervalSeconds uint

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way (that might be a bit cleaner) is

Suggested change
resendIntervals uint
resendIntervals time.Duration

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that changing the interval to time.Duration is cleaner and more flexible, but it increases the chance of having invalid configurations as it requires the time unit specified, for example, 100s vs. 100.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's a big issue, we already do that in other places as well (e.g. the API).

@@ -37,11 +39,12 @@ func New(
resendIntervals uint,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
resendIntervals uint,
resendIntervalSeconds uint,

@gitferry
Copy link
Collaborator Author

Thanks, @vitsalis and @SebastianElvis for your comments. I have replaced the submitted checkpoint list with the last submitted checkpoint as we agree that there's no need to keep all the historical checkpoints.

Copy link
Member

@vitsalis vitsalis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than some minor stuff, looks good! How did you test this? Once this is merged is it ready for deployment?

submitter/relayer/relayer.go Show resolved Hide resolved
submitter/relayer/relayer.go Show resolved Hide resolved
log.Logger.Debugf("The checkpoint for epoch %v was sent more than %v seconds ago but not included on BTC, resending the checkpoint",
ckptEpoch, rl.resendIntervalSeconds)

err := rl.resendCheckpointToBTC(rl.lastSubmittedCheckpoint)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that this method returns the checkpoint and we set rl.lastSubmittedCheckpoint here.

if rl.lastSubmittedCheckpoint == nil || rl.lastSubmittedCheckpoint.Epoch < ckptEpoch {
log.Logger.Debugf("Submitting a raw checkpoint for epoch %v for the first time", ckptEpoch)

err := rl.convertCkptToTwoTxAndSubmit(ckpt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest this method returns the checkpoint and we set rl.lastSubmittedCheckpoint here.

@gitferry
Copy link
Collaborator Author

gitferry commented Jun 20, 2023

Thanks, @vitsalis for your suggestions! I tested this PR in local deployment using babylon-deployment. It ran well in the happy path, but the resending is not triggered. I'm not sure adding unit tests on this would be meaningful as other components are all mocked. So I think maybe we can add some e2e tests where BTC txs are stuck in the mempool for a long time. Wdyt @KonradStaniec?

Anyway, I think we can deploy it in the devnet first and see how it goes. Is it what the devnet is for? Can we create scenarios where the BTC mempool is busy?

// resend tx1 of the checkpoint
tx1Fee := rl.GetTxFee(ckptInfo.Tx1.Size)
tx1 := ckptInfo.Tx1
tx1.Tx.TxOut[1].Value = int64(ckptInfo.Tx1.UtxoAmount - tx1Fee)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this work ? Doesn't changing output value require signing transaction once again before sending?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, this is a nice catch. We definitely need to sign the tx again

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to check if the new fee is larger than the previous fee. If it is not, then the transaction won't be accepted as RBF only works if we set a higher fee.

@KonradStaniec
Copy link
Contributor

Thanks, @vitsalis for your suggestions! I tested this PR in local deployment using babylon-deployment. It ran well in the happy path, but the resending is not triggered. I'm not sure adding unit tests on this would be meaningful as other components are all mocked. So I think maybe we can add some e2e tests where BTC txs are stuck in the mempool for a long time. Wdyt @KonradStaniec?

Probably the easiest way to test (at least partially) it would be with e2e_test.go. In those tests mining of the block is triggered manualy with mineBlockWithTxes so most obvious test would be:

  • set up resendIntervalSeconds to some low value like 5s
  • start test
  • wait for submitter initial submission to be included in mempool
  • do not mine this transaction and wait for re-submission
  • check that old transaction is evicted from mempool and the new one has taken its place

@KonradStaniec
Copy link
Contributor

did a quick experiment with this and got logs in form of:

time="2023-06-20T12:53:34+02:00" level=error msg="Failed to submit the raw checkpoint for 1: failed to re-send tx1 of the checkpoint 1: -26: TX rejected: already have transaction 4cfff6e42a15a7f7be10f1176749c15a3fc552f1c60ef9810e2dbb66315f5b47" module=submitter

So maybe there is some node config option which needs to be set up ?

// if the resend interval has passed, should resend
durSeconds := uint(time.Since(rl.lastSubmittedCheckpoint.Ts).Seconds())
if durSeconds >= rl.resendIntervalSeconds {
log.Logger.Debugf("The checkpoint for epoch %v was sent more than %v seconds ago but not included on BTC, resending the checkpoint",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we known that at this point the checkpoint was really not included in btc ? Is this implicit, due too the fact that babylon still perceive this checkpoint as SEALED ?

If yes, I would add a comment explaining that and also mentioning that it is not perfect way, but it is also possible that submission succeeded and was yet not reported to Babylon.

The only foolproof way would be to have separate go routine which checks wheter sent checkpoint transactions have been included into the block. And tracking those submission from k blocks to defend against rollbacks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, we can only be sure that the checkpoint has not been reported without knowing whether the BTC txs have been included on BTC. Maybe for now it's not an issue because we run both the reporter and the submitter by ourselves (we will know which part is broken).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, but given that typically the resend interval is in the range of hours, this will not be an issue. Even if the resend interval was low, we would only bump the fee (to a probably similar value or one that more closely relates to the latest reality). If the transaction has been included and we're not aware of it yet (because Babylon still considers it as SEALED), we will bump the fee of an existing checkpoint so our transaction will be rejected.

@KonradStaniec
Copy link
Contributor

Another set of experiment results:

  • initial error was due to the fact that in tests fee returned by wallet is 0 (there is not much transactionc), so we effectively re-sent the same transaction which is treated as duplicate and rejected by mempool. We need to make sure that when re-sending that fee is larger that in replaced transaction.
  • after tweaking fee, the error I got is:
time="2023-06-20T13:10:32+02:00" level=error msg="Failed to submit the raw checkpoint for 1: failed to re-send tx1 of the checkpoint 1: -26: TX rejected: replacement transaction d0dbe6046a8520e8e91a9bd2ce849351a0b1bb09df582a7bf3e58bbaf94baeee has an insufficient fee rate: needs more than 3773, has 3594" module=submitter

Conclusion: We need to somehow make sure that we provide enough fee for replacement.

  • After tweaking fee once again, the error I got is:
time="2023-06-20T13:12:21+02:00" level=error msg="Failed to submit the raw checkpoint for 1: failed to re-send tx1 of the checkpoint 1: -25: TX rejected: failed to validate input a9ece0b726cf4e82000a6ab3745786137aac7672e9fa95ce80e1460663300347:0 which references output 1cf7c0cfde61f1c8a98a74d8de6db5c24a7058c46436758ea9907bfca27f2177:0 - signature not empty on failed checksig (input witness [], input script bytes 473044022029289cf04781a7d8344d367b49d204b0641bf196255795baf61fe640ee6ea0df022067033b25236bcf53645a6e2d6ff8ecabf0c8cb641b3b263c8ae55284cc9a30b3012103d1610375df79aaa608d9484b6540239477f9bfc6ad4593d6717a231e4b2576f1, prev output script bytes 76a914ecc89e8bdccf753ee4ed82106ab2d9f821a7aeda88ac)" module=submitter

Conclusion: It seems the we need to re-sign transaction after changing fee. I kinda anticipated that, as it would be weird if one could just change tx in mempool without proper signature.

@vitsalis
Copy link
Member

did a quick experiment with this and got logs in form of:

time="2023-06-20T12:53:34+02:00" level=error msg="Failed to submit the raw checkpoint for 1: failed to re-send tx1 of the checkpoint 1: -26: TX rejected: already have transaction 4cfff6e42a15a7f7be10f1176749c15a3fc552f1c60ef9810e2dbb66315f5b47" module=submitter

So maybe there is some node config option which needs to be set up ?

How did you trigger this experiment? Given that we have replace by fee enabled we should be able to trigger the replacement I think, so this needs some more investigation.

Overall, I suggest we update the e2e tests as well as verify that this works based on some local experiments (e.g. through the local deployment?) as we want to deploy this on the mainnet node soon. We will of course initially try it on the devnet with a resend-interval of a few seconds to trigger the resending often and check that this works as expected.

@KonradStaniec
Copy link
Contributor

So the experiment I did was exeactly that:

set up resendIntervalSeconds to some low value like 5s
start test
wait for submitter initial submission to be included in mempool
do not mine this transaction and wait for re-submission
check that old transaction is evicted from mempool and the new one has taken its place

I setup up tm.Config.Submitter.ResendIntervalSeconds = 5 and in test after mempool inclusion added sleep time.Sleep(10 * time.Second)

It is enough to trigger re-submission.

As mentioned in next comment this error was kinda red herring due to test enviroment:

initial error was due to the fact that in tests fee returned by wallet is 0 (there is not much transactionc), so we effectively re-sent the same transaction which is treated as duplicate and rejected by mempool. 

@gitferry
Copy link
Collaborator Author

gitferry commented Jun 20, 2023

Thanks @KonradStaniec for the quick experiments and conclusion. The error is much likely caused by not signing the changed tx. Will fix this and test it locally to make sure the re-sending works

@KonradStaniec
Copy link
Contributor

@gitferry I have pushed commit which shows test which I would expect to work. I saw you wanted to have an example before you eddited the comment 😅

@gitferry
Copy link
Collaborator Author

Hi @vitsalis and @KonradStaniec, thanks for your previous feedback. Now the PR passed the e2e tests and is ready for review again. Some changes I made:

  1. resign the tx after changing the value of the output (adjusting the tx fee)
  2. some refactoring on composing the BTC tx to make the code cleaner and less duplicate
  3. only resend the second tx of the checkpoint other than both the two txs increasing the fee of the tx2 is sufficient for our purpose and can save another tx
  4. in our tests, the fee estimation is always the same, so a hack in this PR is to double the previous fee as the new fee to ensure that the new fee is higher. we can adjust this before deploying it to the dev net

Copy link
Member

@vitsalis vitsalis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only bumping the second transaction needs to take care of the effective fee, not only the fee of the second transaction.

return errors.New("checkpoint is not Sealed")
log.Logger.Errorf("The checkpoint for epoch %v is not sealed", ckptEpoch)
// we do not consider this case as a failed submission but a software bug
// TODO: add metrics for alerting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could add this alert metric as part of this PR given that metrics infrastructure is already in place?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I missed something, what do you mean by the metrics infrastructure already in place? I suppose we still need another PR to export BTC usage metrics to the dashboard

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep sure we can do it in another PR

log.Logger.Errorf("The checkpoint for epoch %v is lower than the last submission for epoch %v",
ckptEpoch, lastSubmittedEpoch)
// we do not consider this case as a failed submission but a software bug
// TODO: add metrics for alerting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

func (rl *Relayer) resendSecondTxOfCheckpointToBTC(btcTxInfo *types.BtcTxInfo) (*types.BtcTxInfo, error) {
// re-estimate the tx fee based on the current load
// TODO: re-estimate the fee for the real network
// fee := rl.GetTxFee(btcTxInfo.Size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we resend only the second transaction, then this does not bump the fee as we want it to.

When having chained transactions that are not included, the miners see their fee as not their individual fee, but as their effective fee.

For example if we have transactions A->B and A has fee a and B has fee b, then the effective fee of both transactions is (a + b) / (size(A) + size(B)). Therefore, if we want to bump the fee of only the second transaction, the fee should be bumped as a' - a + b' - b, where a' and b' are the new fees that were calculated. Now we only bump the fee by b' - b.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could either resend both transactions or implement the more complex logic I described above. I'm ok with both.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very nice explanation, thanks! I think only bumping the second tx can also be effective if we estimate the fee of the second tx by taking both size(tx1) and size(tx2) into account. For example, bumpedFee := rl.GetTxFee(tx1_size + tx2_size).

Further, I think bumping only the second tx has several benefits compared with bumping the two:

  1. the logic would be simpler because if we bump the two we need to let tx2' reference tx1' not tx1
  2. we can have less cost in the case where tx1 is already included in a BTC block but tx2 isn't. In this case, we cannot replace either of them if we bump the two txs because tx2' references tx1', not tx1. But, if we only bump the second tx, we could replace tx2 and save tx1.

// TODO: re-estimate the fee for the real network
// fee := rl.GetTxFee(btcTxInfo.Size)
// hack: make sure the fee higher than the previous fee for testing purpose
fee := btcTxInfo.Fee * 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that the above is leftover.

// ensure that the fee is not greater than the output value
fee = outputValue
}
tx.TxOut[1].Value = int64(outputValue - fee)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If fee == outputValue we will have an output with 0 satoshi. Is this allowed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it is allowed. Recall that we have set TxOut[0].Value=0 already for storing data in OP_RETURN. Also, see this

// if not segwit, add signature; otherwise, add witness
segwit, err := isSegWit(utxo.Addr)
if err != nil {
panic(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we panic instead of returning the error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning the error indicates an unsupported address type in the UTXO. Yep, returning the error is better. Thanks!

// if not segwit, add signature; otherwise, add witness
segwit, err := isSegWit(utxo.Addr)
if err != nil {
panic(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why panic instead of returning error?

Copy link
Member

@vitsalis vitsalis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Only a slight concern about the precision. Let's wait for @KonradStaniec input as well as this is an important change.

return errors.New("checkpoint is not Sealed")
log.Logger.Errorf("The checkpoint for epoch %v is not sealed", ckptEpoch)
// we do not consider this case as a failed submission but a software bug
// TODO: add metrics for alerting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep sure we can do it in another PR

func (rl *Relayer) resendSecondTxOfCheckpointToBTC(btcTxInfo *types.BtcTxInfo) (*types.BtcTxInfo, error) {
// re-estimate the tx fee based on the current load considering the size of both tx1 and tx2
// here we double the fee for simplicity as the size of the two txs is roughly the same
fee := rl.GetTxFee(btcTxInfo.Size) * 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we be more precise and not just double the fee. We can do rl.GetTxFee(btcTxInfo1.Size + btcTxInfo2.Size)

outputValue := uint64(utxo.Amount.ToUnit(btcutil.AmountSatoshi))
if outputValue < fee {
// ensure that the fee is not greater than the output value
fee = outputValue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A concern might be that in times of congestion we might get stuck trying to resubmit a low UTXO transaction because we won't have enough funds to pay for the fees. But not sure if we can do much about it without introducing complex logic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it will happen often because when we pick UTXO we pick the highest one. Maybe this can be addressed if we periodically aggregate UTXOs as you suggested earlier. We'll see whether it's a real concern from forthcoming metrics in our BTC dashboard.

Copy link
Contributor

@KonradStaniec KonradStaniec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general looks good ! Few general remarks for the future:

  • SendCheckpointToBTC is starting becoming a bit too complex for my taste and handles too much. Imo resendSecondTxOfCheckpointToBTC should be extracted to separate function and handled in main event loop.
  • before doing rbf, we could check if previous tx is in mempool still. This way we could make sure we are really sending tx with larger fee
    All this things can be done in next pr-s imo

@gitferry
Copy link
Collaborator Author

Thanks @vitsalis and @KonradStaniec for your final reviews. As I'm merging the PR, is it ready for testing in our dev net? What should I do?

@gitferry gitferry merged commit 108024b into dev Jun 22, 2023
@gitferry gitferry deleted the use-rbf branch June 22, 2023 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Waste of previous BTC checkpoint transactions when new UTXOs are spent
4 participants