Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

leader panic on restart (LastIdNotFound) #1171

Closed
rob-solana opened this issue Sep 10, 2018 · 5 comments
Closed

leader panic on restart (LastIdNotFound) #1171

rob-solana opened this issue Sep 10, 2018 · 5 comments
Assignees
Milestone

Comments

@rob-solana
Copy link
Contributor

rob-solana commented Sep 10, 2018

https://metrics.solana.com:3000/d/testnet/testnet-hud?orgId=2&var-testnet=testnet-master&from=1536591929833&to=1536599129833

Indications are that the leader OOM'd, then panicked trying to read the ledger that was left after the crash.

Note the OOM event at 9:26 and subsequent panics every ~10min thereafter while reading the ledger.

@rob-solana rob-solana added this to the v0.8 Windansea milestone Sep 10, 2018
@rob-solana
Copy link
Contributor Author

cc #1164

@rob-solana
Copy link
Contributor Author

rob-solana commented Sep 14, 2018

I have a "short" (87k entries) ledger that fails to verify with the error LastIdNotFound. I've used ledger-tool to dig around in it. The ledger format is uncorrupted: the index and the data file agree, but the bank is unhappy.

verify failed at entry[65826], err: LastIdNotFound(3WPtcXNYsTUwWYenZ3Ge2awmqUhP6QT6mHowRAz7mgm7)

The last_id that's "not found" (let's call it 3WPtc) first appears in the ledger at index 49,199. The first use of 3WPtc as a last_id comes at ledger index 51,282.

The entry that fails to verify is not the last entry with ``3WPtc``` listed as last_id, there is just one more. I thought there might be some significance to the "off by 2" this represents, but in another (larger) ledger with the same issue, the number of entries that would fail to verify is much larger (586 entries).

Immediately following the last entry to use ```3WPtc``, there are 2,872 empty entries. Similarly in the larger log, after the last use of the "bad" last_id, there are also lots of empty entries. #1217 to track.

65828:Entry { num_hashes: 0, id: HAoC4NRuqYBSL3T6xFTaqE5rG8Njae9tu1C3MLdvy57o, transactions: [], has_more: false }

Tidbit: There is a block of 14,006 entries (comprising 168,690 transactions) that use 3Wptc as their last_id.

@rob-solana rob-solana self-assigned this Sep 14, 2018
@rob-solana
Copy link
Contributor Author

When a full node is running, register_last_entry() is called from the record stage, but there may be transactions in flight between the banking stage and record stage that have been verified against a last_id that is about to be pushed out of last_ids by the record stage. When a bank is being initialized from a ledger, register_last_id() is called synchronously.

@rob-solana rob-solana changed the title leader panic on restart (OOM?) leader panic on restart (LastIdNotFound) Sep 14, 2018
@garious garious modified the milestones: v0.8 Windansea, v0.9 Swamis Sep 14, 2018
@garious garious assigned garious and unassigned rob-solana Sep 17, 2018
rob-solana added a commit that referenced this issue Sep 20, 2018
…er to hashes (#1281)

step one of lastidnotfound

* record_stage->record_service, trim recorder to hashes
* doc updates, hash multiple without alloc()

cc #1171
rob-solana added a commit to rob-solana/solana that referenced this issue Sep 21, 2018
rewrite entry_next_hash in terms of Poh
simplify and unify transaction hashing (no embedded nulls)
register_last_entry from banking stage, fixes solana-labs#1171
rob-solana added a commit to rob-solana/solana that referenced this issue Sep 22, 2018
rewrite entry_next_hash in terms of Poh
simplify and unify transaction hashing (no embedded nulls)
register_last_entry from banking stage, fixes solana-labs#1171
@rob-solana
Copy link
Contributor Author

still an issue at ca96237

STR, using multinode-demo:

setup
start drone and leader
run the client (90 seconds)
start the validator -> boom

@rob-solana rob-solana reopened this Sep 26, 2018
@rob-solana
Copy link
Contributor Author

nevermind, red herring. issue was with write_stage(), which was reversing entry vectors before writing them

#1366

vkomenda pushed a commit to vkomenda/solana that referenced this issue Aug 29, 2021
…olana-labs#1171)

Bumps [eslint](https://github.com/eslint/eslint) from 7.18.0 to 7.19.0.
- [Release notes](https://github.com/eslint/eslint/releases)
- [Changelog](https://github.com/eslint/eslint/blob/master/CHANGELOG.md)
- [Commits](eslint/eslint@v7.18.0...v7.19.0)

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants