Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The solana-web3.js transaction confirmation logic is very broken #23949

Closed
mvines opened this issue Mar 25, 2022 · 49 comments · Fixed by #25227
Closed

The solana-web3.js transaction confirmation logic is very broken #23949

mvines opened this issue Mar 25, 2022 · 49 comments · Fixed by #25227
Assignees

Comments

@mvines
Copy link
Member

mvines commented Mar 25, 2022

The correct approach to confirming a normal transaction is to fetch the last valid block height for the transaction blockhash, then wait until the chain has advanced beyond that block height before declaring the transaction expired.

A transaction that uses a nonce is a little different, since those transactions are technically valid until the nonce is advanced. For nonce-based transactions, the right approach is to check the corresponding nonce account to see if the nonce has been advanced while periodically retrying the transaction (although some form of timeout or mechanism for the caller to abort seems nice)

The big offenders:

1. confirmTransaction

This method simply waits for up to 60 seconds. This is wrong because a transaction's blockhash may be valid for longer than 60 seconds, so it's very possible for the transaction to execute after this method returns a failure. The signature of confirmTransaction does not include the blockhash of the transaction so it's literally impossible to do the right thing without a signature change. This method probably just should be deprecated.

2. sendAndConfirmTransaction and sendAndConfirmRawTransaction

These two just need to be reimplemented to not use confirmTransaction and instead follow the approach outlined at the start of this issue.

@steveluscher steveluscher self-assigned this Mar 25, 2022
@steveluscher
Copy link
Contributor

@marcnjaramillo
Copy link
Contributor

@stellaw1 and I spent some time last night talking about this issue and we have an idea of what we should do and wanted to get some feedback and clarification on a few things. Based on what we've been able to research, it seems like we're going to need to use getLatestBlockhash because it returns a blockhash and its lastValidBlockHeight along with getBlockHeight to compare with lastValidBlockHeight. We're assuming that lastValidBlockHeight is essentially the cut-off for how long a transaction has to be finalized before it expires (based on the wording in the docs).

We also took a look at confirmTransaction to understand it better, and it looks like it does something with the various commitment statuses and sets the response accordingly, throwing an error if 60 seconds pass and the response is null. Based on all this, we're thinking we need to use the methods mentioned above to track blocks rather than using a 60-second timeout.

Does this seem reasonable?

@t-nelson
Copy link
Contributor

t-nelson commented Apr 1, 2022

it seems like we're going to need to use getLatestBlockhash

There's also isBlockhashValid

We're assuming that lastValidBlockHeight is essentially the cut-off for how long a transaction has to be finalized before it expires (based on the wording in the docs).

Not finalized, just included in a block (processed commitment) that's eventually finalized. Once the blockhash is expired, the transaction can never be included in a later block, the block it's included in can continue to gain stronger commitment though

Based on all this, we're thinking we need to use the methods mentioned above to track blocks rather than using a 60-second timeout.

Precisely. Human time is the enemy

@marcnjaramillo
Copy link
Contributor

There's also isBlockhashValid

We saw this one too and wondered if it would work in this case.

Not finalized, just included in a block (processed commitment) that's eventually finalized. Once the blockhash is expired, the transaction can never be included in a later block, the block it's included in can continue to gain stronger commitment though

Thanks for clarifying this. I'm still trying to better understand what commitment means, so this helps. Still a lot to learn!

@mvines
Copy link
Member Author

mvines commented Apr 1, 2022

isBlockhashValid is weaker than determining the lastValidBlockHeight and just waiting for that block height. isBlockhashValid may return false if you’re talking to a load balanced server and hit a bad backend, even though the blockhash is still valid (= true). I’ve had this happen to me in the wild.

@t-nelson
Copy link
Contributor

t-nelson commented Apr 1, 2022

isBlockhashValid is weaker than determining the lastValidBlockHeight and just waiting for that block height. isBlockhashValid may return false if you’re talking to a load balanced server and hit a bad backend, even though the blockhash is still valid. I’ve had this happen to me in the wild.

polling with getBlockHeight suffers the same fate, no?

getBlockHeight would be an optimization when tracking multiple transactions (with mixed recent blockhashes) simultaneously though

@mvines
Copy link
Member Author

mvines commented Apr 1, 2022

No, polling getBlockHeight won’t fail if the RPC endpoint backend you hit is behind

@marcnjaramillo
Copy link
Contributor

@mvines you also mentioned nonce-based transactions in the issue, but neither of us has any experience with nonce accounts. Would getNonceAndContext give us enough information to determine if the nonce has been advanced?

@elliott-w
Copy link
Contributor

Based on what we've been able to research, it seems like we're going to need to use getLatestBlockhash because it returns a blockhash and its lastValidBlockHeight along with getBlockHeight to compare with lastValidBlockHeight.

Could somebody kindly provide a code sample of how this might work in practice? Cheers

@jacobcreech
Copy link
Contributor

@mvines you also mentioned nonce-based transactions in the issue, but neither of us has any experience with nonce accounts. Would getNonceAndContext give us enough information to determine if the nonce has been advanced?

You can check for reference how the Rust client sends and confirms transactions.

The logic is as follows in psuedocode:

If nonce based transaction {
  use latest processed blockhash
} else {
  use blockhash stated in transaction
}
while isBlockhashValid {
  getSignatureStatus
  If signatureStatus == commitment level {
    return signature
  }
}

Using isBlockhashValid and getSignatureStatus

There is some logic about retries in there that you can add as well.

No, polling getBlockHeight won’t fail if the RPC endpoint backend you hit is behind

Shouldn't this fail the same way if you hit a bad backend?

@marcnjaramillo
Copy link
Contributor

Thanks @jacobcreech this makes sense. I was just checking the web3.js source code and right now we can't use isBlockhashValid since it hasn't been added. We would have to get that added first to use it here.

@mvines
Copy link
Member Author

mvines commented Apr 3, 2022

No, polling getBlockHeight won’t fail if the RPC endpoint backend you hit is behind
Shouldn't this fail the same way if you hit a bad backend?

Nope. If the RPC endpoint backend is behind:

  • getBlockHeight returns a lower block height. No issue. ✔️
  • isBlockhashValid may return false which causes the client to incorrectly consider the transaction to be expired. ✖️

This is why isBlockhashValid can be harmful.

you also mentioned nonce-based transactions in the issue, but neither of us has any experience with nonce accounts.
Would getNonceAndContext give us enough information to determine if the nonce has been advanced?

Yes I think so. The NonceAccount.nonce field will change when the nonce is advanced and that's your only indication save indication that the previous nonce transaction is now invalid. The nonce can be advanced explicitly to expire a previous unexecuted transaction that was signed with that nonce as well, but this is probably best left as a client-level decision.

@marcnjaramillo
Copy link
Contributor

You can check for reference how the Rust client sends and confirms transactions.

The logic is as follows in psuedocode:

If nonce based transaction {
  use latest processed blockhash
} else {
  use blockhash stated in transaction
}
while isBlockhashValid {
  getSignatureStatus
  If signatureStatus == commitment level {
    return signature
  }
}

Using isBlockhashValid and getSignatureStatus

There is some logic about retries in there that you can add as well.

I've been looking at the Rust client code and this is what I understand so far. If I've got anything wrong, please let me know:

I see where send_and_confirm_transaction uses nonce if it's available. It looks like get_latest_blockhash_with_commitment is only used if there is a nonce; otherwise it uses the recentBlockhash that was included in the transaction's message (though from what I know about creating a message we use getLatestBlockhash anyway, so I guess it ends up doing the same thing - correct me if I'm wrong). It looks like get_signature_statuses to get the status, and if it's None it uses is_blockhash_valid when retrying (this part is fuzzy for me, so I would appreciate any additional insight).

@mvines
Copy link
Member Author

mvines commented Apr 4, 2022

Please note that the Rust client logic is also broken in a similar way, in that it uses isBlockhashValid instead of waiting for the last valid block height

@marcnjaramillo
Copy link
Contributor

I'm curious about lastValidBlockHeight and how we can use it. When a transaction is created with a message, that message has to have a recentBlockhash, and I know we use getLatestBlockhash to set that; however, we don't include the lastValidBlockHeight value in that message. When we were working on this, we used getLatestBlockhash to determine the lastValidBlockHeight inside sendAndConfirmTransaction but now I'm wondering if that's correct because it's different from the value associated with the transaction blockhash. I've been testing things in the console, and the values differ only slightly so does it not really matter?

@marcnjaramillo
Copy link
Contributor

Please note that the Rust client logic is also broken in a similar way, in that it uses isBlockhashValid instead of waiting for the last valid block height

I was curious about that. I also noticed that the Rust client sendAndConfirmTransaction doesn't use confirmTransaction like the web3.js logic does, and the Rust client confirmTransaction uses getSignatureStatuses while web3.js doesn't. Would it make a difference if the web3.js logic was updated to mirror the Rust logic? Or is the Rust logic also broken here like it is for sendAndConfirmTransaction?

@mvines
Copy link
Member Author

mvines commented Apr 4, 2022

The JS and Rust clients are both broken in different ways I'm afraid.

@elliott-w
Copy link
Contributor

elliott-w commented Apr 4, 2022

@mvines
Is this then what we should be aiming for in the rust and js client code? (excluding nonces because I don't understand them yet)

latest = getLatestBlockHash() // rpc call
transaction = createTransaction(latest.blockhash)
signature = sendTransaction(transaction) // rpc call

function hasTransactionSucceeded(signature, lastValidBlockHeight, commitmentLevel) {
    while true {
        currentBlockHeight = getBlockHeight() // rpc call
        isBlockhashValid = currentBlockHeight < lastValidBlockHeight
        If isBlockhashValid {
            signatureStatus = getSignatureStatus() // rpc call
            If isError(signatureStatus) {
                return false
            }
            If signatureStatus == commitmentLevel {
                return true
            }
        } else {
            return false
        }
    }
}

result = hasTransactionSucceeded(signature, latest.lastValidBlockHeight, "finalized")

@stellaw1
Copy link
Contributor

stellaw1 commented Apr 5, 2022

We came up with two solutions for getting an accurate lastValidBlockHeight for a transaction's recentBlockhash:

  1. Naive approach of calling getLatestBlockhash again in sendAndConfirmTransaction and subtracting 10-15 (what we found the average discrepancy between the lastValidBlockHeight value at the time of creating the transaction and confirming the transaction was)
  2. More correct approach of adding an RPC call to allow calling get_blockhash_last_valid_block_height with a passed in blockhash instead of the latest blockhash like in getLatestBlockhash

@jacobcreech
Copy link
Contributor

  1. Naive approach of calling getLatestBlockhash again in sendAndConfirmTransaction and subtracting 10-15 (what we found the average discrepancy between the lastValidBlockHeight value at the time of creating the transaction and confirming the transaction was)

Don't do any magic guessing on numbers. A lot can happen between transaction creation and sending.

  1. More correct approach of adding an RPC call to allow calling get_blockhash_last_valid_block_height with a passed in blockhash instead of the latest blockhash like in getLatestBlockhash

This requires RPC changes. You'll still run into an issue where the blockhash doesn't exist on a bad backend and you get an error.

I believe the solution lies in changing TransactionCtorFields to accept

recentBlockhash?: Blockhash | LatestBlockhashResult | null;

Where LatestBlockhashResult is

{
    blockhash: string(),
    lastValidBlockHeight: number(),
}

And then use that to determine a transaction's validity

@marcnjaramillo
Copy link
Contributor

marcnjaramillo commented Apr 5, 2022

@jacobcreech sounds good, we'll get this done and update here.

UPDATE: looking into this more, if we edit TransactionCtorFields to accept

recentBlockhash?: Blockhash | LatestBlockhashResult | null;

we would also have to make changes to Message because right now MessageArgs only accepts a string for recentBlockhash.

@marcnjaramillo
Copy link
Contributor

@steveluscher we wanted to get your thoughts on our approach for this issue. I spent some time on this last night and this is what I ended up with so far:

Updating TransactionCtorFields to accept the full result from getLatestBlockHash would require making changes to Message since it's structured right now to only use blockhash. What I ended up doing was adding an optional argument to the populate method to accept lastValidBlockHeight. This isn't the most elegant solution, but it might serve as a sufficient band-aid.

We spent a lot of time looking into this and figuring out a way to do this correctly, but it seems like the only way to do this right is to introduce breaking changes.

@steveluscher
Copy link
Contributor

Sure thing! I’ll catch up on this thread tomorrow.

@steveluscher
Copy link
Contributor

steveluscher commented Apr 6, 2022

On the topic of breaking changes

Rather than to patch in lastValidBlockhash here and there, I think it's reasonable to make the breaking changes that involve adding it to the Transaction and Message constructors but support the old signature for a while. There's a way to use TypeScript to deprecate the old arguments gracefully, while signalling loudly to devs that the removal is on the way!

  1. Write the code defensively to presume that it's either going to receive the new-style or the old-style constructor options.
  2. Write a TypeScript definition that permits both values, but shows you that one is deprecated in your IDE. Check it out: (playground link)
  3. Warn, on the console, when people use the old one.
  4. Eventually remove the old one in a future version.

image

On the topic of block height / nonce tracking

I think we've come up with some good strategies for tracking the current block height and nonce status, but all of them require a lot of logic on the client and a lot of chatter over the wire.

Crazy idea here: @t-nelson, @mvines, what would you think about adding an argument to the signatureSubscribe API to indicate the block height or nonce upon which the transaction with that signature depends. Essentially add a dimension to SubscriptionsTracker to express ‘these subscribers are interested to know when the block height exceeds X’ and ‘these people are interested to know when these nonces are advanced.’ The subscription would resolve with err when the block height is exceeded or the nonce is advanced.

  1. ✅ The only client change we would need would be to add that arg to the existing signature subscription, here.
  2. ✅ No chatter over the wire; no timers on the client. The client just patiently waits for the subscription to resolve.
  3. ✅ Pushes all of the logic down into the RPC, so that any change to that logic only needs to be made in one place to be deployed to the whole world.

@marcnjaramillo
Copy link
Contributor

@steveluscher This is really helpful. Thank you! Do you think for now we should go ahead with the current strategies for tracking block height and nonce status until there can be more discussion about changing the signatureSubscribe API?

@steveluscher
Copy link
Contributor

Let's wait just a bit to hear from the RPC/validator folks. The best code is the code you didn't have to write :)

@steveluscher
Copy link
Contributor

steveluscher commented Apr 6, 2022

On the topic of cancellation

Oh, I forgot to address one note above. @mvines mentioned that if confirmTransaction() is going to live even longer than 60s, then there should be some way for an app to abort it.

The way is to make confirmTransaction() accept and make use of an AbortController. Check it out! https://developer.mozilla.org/en-US/docs/Web/API/AbortController

@mvines
Copy link
Member Author

mvines commented Apr 6, 2022

Just note that an abort doesn’t mean that the transaction won’t execute after the front-end aborts it

@steveluscher
Copy link
Contributor

Please note that the Rust client logic is also broken in a similar way, in that it uses isBlockhashValid instead of waiting for the last valid block height

What change should we make to the Rust client @mvines? Is the idea to add a last_valid_blockheight argument to the Rust API's send_and_confirm_transaction, rather than to let it use is_blockhash_valid internally?

@steveluscher
Copy link
Contributor

Seems like we could build a helper around existing stuff plus a subscribeBlock (with all details disabled).

Yeah, I looked at blockSubscribe as a way to track the current block height, @t-nelson. I was scared off though by this though:

blockSubscribe - Unstable, disabled by default

This subscription is unstable and only available if the validator was started with the `--rpc-pubsub-enable-block-subscription` flag.

Should we not be scared off by that? Is it good to start relying upon?

@steveluscher
Copy link
Contributor

Thanks for that, @t-nelson. You've given me a lot to think about. I've written and delete this response multiple times as I've been diving in and out of both implementations trying to map them out in my head.

In general, what I've found is an imperative API for building and sending transactions – in both Rust and JS – that give devs lots of opportunities to ‘get it wrong.’

  • The Rust API makes it harder to hold on to a stale signature because there's no way to sign a transaction without supplying a recent blockhash right at the sign() callsite. You can still hold on to that transaction for long enough before sending it for it to expire, though.
  • The JS API creates ample opportunity for time to elapse between pairing a recent blockhash with a Transaction and signing that transaction. If you don't sign and send it shortly after creating it, it will expire.

I want to create pits of success for Solana developers, rather than to leave them footguns. The combination of a stateful Transaction class that may or may not be signed, the inability to know if it's expired just by inspecting it (ie. it's missing last valid blockheight metadata), and the ability to supply invalid combinations of options (eg. there's nothing stopping you from supplying both a nonce and a recent blockhash) leaves lots of opportunity for astonishment.

I'm going to continue to think on this.


In the meantime, let's make the smallest changes possible to materially improve confirmation successes with the current API. I propose the following.

First PR: Update the Transaction constructor

The goal here is to make the TypeScript types clear that you should either supply a nonce when you create a Transaction or a (recentBlockhash, lastValidBlockheight), but never both. Leave in the deprecated signature, but deprecate.

Example TypeScript playground link

Second PR: Teach confirmTransaction to confirm using latestBlockhash

  1. Deprecate method calls that only supply TransactionSignature
  2. Add a method signature where the first argument is {signature, latestBlockhash}
  3. When someone calls the method using the deprecated signature, continue to run the broken 60 timeout logic
  4. When someone calls the method using the new signature, write a new implementation that fails when latestBlockhash.lastValidBlockheight passes us by

Third PR: Teach confirmTransaction to confirm using nonceInfo

  1. Add a method signature where the first argument is {signature, nonceInfo}
  2. When someone calls the method using the new signature, write a new implementation that fails when the nonce is advanced.

Endgame

The endgame would look like this:

// Recent blockhashes.
const latestBlockhash = await connection.getLatestBlockhash();
const transaction = new Transaction({latestBlockhash}).add(/* ... */);
const signature = connection.sendTransaction(transaction, /* ... */);
await confirmTransaction({
  latestBlockhash,
  signature,
});

// Nonces
const nonceInfo = /* ... */;
const transaction = new Transaction({nonceInfo}).add(/* ... */);
const signature = connection.sendTransaction(transaction, /* ... */);
await confirmTransaction({
  nonceInfo,
  signature,
});

cc/ @jordansexton, @joncinque, @jstarry for a dummy check on these incremental steps.

@mvines
Copy link
Member Author

mvines commented Apr 7, 2022

A couple other things to consider:

  • When/where do transaction retries occur. We’re trying to get away from RPC-level retries, so this burden will fall on the client.
  • The client should retain their keypair, so signing ideally happens before handing off to sendTransaction/etc for transmission and retries

@marcnjaramillo
Copy link
Contributor

marcnjaramillo commented Apr 8, 2022

@steveluscher and @mvines - @stellaw1 and I have started working on the Transaction constructor to update this, and we have updated Message as well to accept the new arguments. This does work, although we haven't quite figured out how to get the nonce part working so we can take another look at that. We can submit a draft PR to give everyone an opportunity to see what has been done so far and how it can be improved.

@marcnjaramillo
Copy link
Contributor

Draft PR has been created. Looking forward to your thoughts on how we can make this work!
cc/ @steveluscher, @mvines, @t-nelson

@steveluscher
Copy link
Contributor

Can’t wait to look this over; sorry for the delay! I’ll be there soon.

@marcnjaramillo
Copy link
Contributor

No worries! So far, we've only made changes to sendAndConfirmTransaction but not sendAndConfirmRawTransaction. We wanted to focus on one first, make sure we were headed in the right direction and get it polished, then move on to the next function. Looking forward to your feedback!

@ryoqun
Copy link
Member

ryoqun commented Apr 13, 2022

haha, it looks like tx expiration and retries needs more love... ;)

cc: according to @jstarry, it looks like we should get back to use slot height instead of block height? ref: #24242

@ryoqun
Copy link
Member

ryoqun commented Apr 13, 2022

No, polling getBlockHeight won’t fail if the RPC endpoint backend you hit is behind
Shouldn't this fail the same way if you hit a bad backend?

Nope. If the RPC endpoint backend is behind:

* `getBlockHeight` returns a lower block height. No issue. heavy_check_mark

* `isBlockhashValid` may return `false` which causes the client to incorrectly consider the transaction to be expired. heavy_multiplication_x

This is why isBlockhashValid can be harmful.

oh, poor isBlockhashValid. so, that's why i want to get rid of these hard-to-reason read consistency thing at our generic jsonrpc subsystem level... (but couldn't find time still...) ref: #15705

@jstarry
Copy link
Member

jstarry commented Apr 13, 2022

I think we should also consider switching the runtime to actually expire transactions using block heights as we all assumed it did. That approach seems better because then skipped slots won't impact how many blocks a transaction could be committed to

@marcnjaramillo
Copy link
Contributor

marcnjaramillo commented Apr 13, 2022

cc: according to @jstarry, it looks like we should get back to use slot height instead of block height? ref: #24242

This is interesting. I took a look at the conversation on the issue. Learning something new every day. Thanks for referencing this.

I think we should also consider switching to expiring transactions using block heights. It seems better because then skipped slots won't impact how many blocks a transaction could be committed to

cc: @jstarry just curious, did you mean using slots? So rather than polling getBlockHeight we would be using getSlot?

@jstarry
Copy link
Member

jstarry commented Apr 13, 2022

cc: @jstarry just curious, did you mean using slots? So rather than polling getBlockHeight we would be using getSlot?

My comment wasn't very clear so I just updated it. I meant that we can keep the planned RPC and client behavior and update the runtime to use the behavior we thought it was using.

@marcnjaramillo
Copy link
Contributor

Got it. Thanks for the update!

@mschneider
Copy link
Contributor

Endgame

The endgame would look like this:

// Recent blockhashes.
const latestBlockhash = await connection.getLatestBlockhash();
const transaction = new Transaction({latestBlockhash}).add(/* ... */);
const signature = connection.sendTransaction(transaction, /* ... */);
await confirmTransaction({
  latestBlockhash,
  signature,
});

// Nonces
const nonceInfo = /* ... */;
const transaction = new Transaction({nonceInfo}).add(/* ... */);
const signature = connection.sendTransaction(transaction, /* ... */);
await confirmTransaction({
  nonceInfo,
  signature,
});

cc/ @jordansexton, @joncinque, @jstarry for a dummy check on these incremental steps.

It would be really important to expand the documentation to show good examples how to handle the different outcomes confirmTransaction can have. Important to consider:

  • http error codes / connection timeouts / parsing errors - loose grouped as "bad endpoint"
  • blockhash expiry - aka "bad tps"
  • confirmation (including commitment level & slot) "success"

@steveluscher
Copy link
Contributor

Opening a separate issue to track the development of the nonce-based confirmation strategy: #25303.

@AndonMitev
Copy link

// Recent blockhashes.
const latestBlockhash = await connection.getLatestBlockhash();
const transaction = new Transaction({latestBlockhash}).add(/* ... */);
const signature = connection.sendTransaction(transaction, /* ... */);
await confirmTransaction({
  latestBlockhash,
  signature,
});

From where this confirmTransaction not able to find it in 1.44

@steveluscher
Copy link
Contributor

steveluscher commented Jun 14, 2022

That call signature needs one correction:

const transaction = new Transaction({...latestBlockhash}).add(/* ... */);
/* ... */
await confirmTransaction({
  ...latestBlockhash,
  signature,
});

@frankandrobot
Copy link

Just confirming that this has been resolved because i'm hitting related errors so want to confirm it's an issue w/ my code and not the underlying library

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet