EVM-800 nonce too low from eth_getTransactionCount #1853
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
BlockNumber
structure has these supported strings as special kind of a block numbers:"pending"
- this returns information from the block that is currently being mined, which might not yet be part of the blockchain."latest"
- refers to the latest block in state"earliest"
- refers to the genesis blockWe should allow
pending
blocks for theeth_getTransactionCount
endpoint only. All the other parts of the current system should treatpending
as thelatest
.When we do check in validate, we assume that
txpool
does not have any transaction for that specific address. So we are checking if tx nonce is lower than expected nonce from state.Later on, we load(or create) object where we store all the
txpool
information for specific address (including nonce). Then we check additionally (if there was already some tx for that address in a pool) correctness of a nonce.There is no bug here. I just added metrics counter which was missing.
Problem was only because
getTransactionCount
always returned nonce from the state (even for pending). Nonce from the state and nonce from the txpool account can differ in situations when there is already more than one tx for same address in a pool.This fix will also fix https://polygon.atlassian.net/browse/EVM-801.
Important note!
: two or more consecutive calls toeth_getTransactionCount(address, "pending")
can result in same value and that’s is just because nonce updating in txpool is not immediate. So, user of this endpoint must handle this case.Question @jp have you tried configuring --max-enqueued or --max-slots on the rpc? When I ran your scripts, I saw a lot of maximum number of enqueued transactions reached errors and afterward the RPC would reject txs with the correct nonce. I changed the setup-rpc.sh script to look like this./polygon-edge server --data-dir ./data-rpc/data-1 --chain ./data/genesis.json --grpc-address :50000 --libp2p :30305 --jsonrpc :50002 --price-limit 0 --log-level DEBUG --max-enqueued 100000 --max-slots 10000 > rpc-005.log 2>&1 &The submit_multiple.sh script will still print out some errors but no more errors about max enqueued tx. After cast gives up trying to send the 500 tx, the rpc will work as expected and accept transactions with the correct nonce. So a few thoughts
I agree this is a bug of some kind in the node. The key features might be
It seems to only show up in the RPC / non-validating nodes
It occurs after exceeding the configured queue size
The nonce for that particular account will not be accepted by the node anymore
Mitigations
Requires restarting the node to clear the state
Configure a large queue size
Even with a large queue, I assume the bug could still occur, but it becomes less likely
I didn’t try increasing the account or pool limits. I agree it might help but, as you point out, it probably doesn’t resolve the issue, it just hides it a bit more. I’m also concerned about the memory requirements associated with increasing those limits.I believe the issue is that Polygon Edge checks the nonce in two different places. In txpool.validateTx (using p.store.GetNonce(stateRoot, tx.From) and txpool.addTx (using account.getNonce()). Not sure why it’s tracked in two different places.If you enable prometheus in the RPC node you’ll notice that nonce_too_low_tx metric is not reported, which means that our issue comes from the addTx check (which probably should update the prometheus gauge, but that’s another issue). The addTx check uses the nonce value it finds in account.getNonce(). Calls to account.setNonce happen in account.reset, account.promote and txpool.Drop.My bet is the reproduction steps I shared are causing the issue in txpool.Drop, but all three should be properly reviewed.
Changes include
Breaking changes
It should not be
Checklist
Testing
Manual tests
premine
addresseth_getTransactionCount(address, 'pending')
eth_getTransactionCount
calls can return same value, so in that case just wait some time and continue (skip this iteration) with a loop. this is ok, because updatingnonce
in txpool is not immediate operation, it took some (small) amount of timeeth_getTransactionCount(address, 'latest')
will in most case return same value that is smaller than value frometh_getTransactionCount(address, 'pending')
. This is totally normal and as soon as all pending txs from txpool are included in some blocks those values will be the same.I do not think this e2e is necessary but i can write it if needed