Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ProjectTracking] Recovery of archival node missing data #11895

Open
18 of 19 tasks
Trisfald opened this issue Aug 6, 2024 · 4 comments
Open
18 of 19 tasks

[ProjectTracking] Recovery of archival node missing data #11895

Trisfald opened this issue Aug 6, 2024 · 4 comments
Assignees

Comments

@Trisfald
Copy link
Contributor

Trisfald commented Aug 6, 2024

Summary

In early 2024 archival nodes failed to persist a subset of data from hot to cold storage. As a consequence, after five epoch such data has been deleted from the DB and lost.

The root cause causes for the failure are mainly two: erroneous manual operations on nodes and issues during the two resharding procedures.

Action plan

  • Identify failing queries
  • Develop tooling to re-apply blocks in the past
  • Develop tooling to perform resharding in the past
  • Create a recovery archival node
  • Recovery of data lost around ~109913255
  • Recovery of data lost during first resharding (114580308)
    • shard0.v1
    • shard1.v1
    • shard2.v1
    • shard3.v1
  • Recovery of data lost during second resharding (115185108)
    • shard0.v2
    • shard1.v2
    • shard2.v2
    • shard3.v2
    • shard4.v2
  • Check that failing queries are fixed
  • Finalize recovery instructions
  • Publish recovered DB snapshot
@Trisfald Trisfald self-assigned this Aug 6, 2024
@Trisfald
Copy link
Contributor Author

Trisfald commented Aug 6, 2024

Problematic heights

Operational issue(s): block 109913255

First resharding: block 114580308

Second resharding: block 115185108

Known failing queries

Height 109913260

JSON query:

curl -X POST https://archival-rpc.mainnet.near.org \
        -H "Content-Type: application/json" \
        -H "Referer: https://beta.rpc.mainnet.near.org" \
        -d '
        { "id": "dontcare", "jsonrpc": "2.0", "method": "query", "params": { "account_id": "b001b461c65aca5968a0afab3302a5387d128178c99ff5b2592796963407560a", "block_id": 109913260, "request_type": "view_account" } }'

Storage query:

./neard view-state -t cold view-trie --shard-id 2 --shard-version 1 --max-depth 1000 --hash 36SkUU8tgetUtVL2a5JPwKB6F29yKBFjF5PFukZ8HRFH --from b001aea591ef68681e59a4149b1ab8bc56d8f22e34be24 --to b001c0de4c6929c5289b65044249830466ffea27680bc1 --format pretty --record-type account

Height 114580308

JSON query:

curl -X POST https://archival-rpc.mainnet.near.org \
        -H "Content-Type: application/json" \
        -H "Referer: https://beta.rpc.mainnet.near.org" \
        -d '
        { "id": "dontcare", "jsonrpc": "2.0", "method": "query", "params": { "account_id": "token2.near", "block_id": 114580308, "request_type": "view_account" } }'

Storage query:

./neard view-state -t cold view-trie --shard-id 4 --shard-version 2 --max-depth 1000 --hash Fe7oLHaqNq5kWnNkDdZatWRY8CRHzBvBKbeACt8JKQsr  --from token1.near -
-to token3.near --format pretty --record-type account

Height 115185110

JSON query:

curl -X POST https://archival-rpc.mainnet.near.org \
        -H "Content-Type: application/json" \
        -H "Referer: https://beta.rpc.mainnet.near.org" \
        -d '
        { "id": "dontcare", "jsonrpc": "2.0", "method": "query", "params": { "account_id": "timpanic.tg", "block_id": 115185110, "request_type": "view_account" } }'

Storage query:

./neard view-state -t cold view-trie --shard-id 5 --shard-version 3 --max-depth 1000 --hash HucDNVVACPC59SQW9SSmfao5tNjqiaFgZ5mvUrW4xVr3  --from timp8b4kqpff.users.kaiching --to timpanium.tg --format pretty --record-type account

Other failing queries

curl -X POST https://archival-rpc.mainnet.near.org \
        -H "Content-Type: application/json" \
        -H "Referer: https://beta.rpc.mainnet.near.org" \
        -d '
{ "jsonrpc": "2.0", "id": "dontcare", "method": "query", "params": { "request_type": "call_function", "finality": "final", "account_id": "bisontrails.poolv1.near", "block_id": 114580308, "method_name": "get_reward_fee_fraction", "args_base64": "" }}'
curl -X POST https://archival-rpc.mainnet.near.org \
        -H "Content-Type: application/json" \
        -H "Referer: https://beta.rpc.mainnet.near.org" \
        -d '
{ "jsonrpc": "2.0", "id": "dontcare", "method": "query", "params": { "request_type": "call_function", "account_id": "consensus_finoa_00.poolv1.near", "block_id": 120308219, "method_name": "get_account", "args_base64": "eyJhY2NvdW50X2lkIjoicmVzdGFrZS5uZWFyIn0=" }}'

@walnut-the-cat
Copy link
Contributor

Aug 30th report:

  • Finished recovery of data lost after 1st resharding
  • Ran sanity checks to verify integrity of tries
  • Finish re-playing on second historical resharding to recover the remaining lost data
  • Healthy node is now catching up with mainnet

@walnut-the-cat
Copy link
Contributor

  • Andrea OOO.
  • Recovery node has been running the recovery job, but SRE team discovered that the updating binary to the 2.2.0 causes an issue. Andrea to investigate once he's back

@Trisfald
Copy link
Contributor Author

Done:

  • Recovery of all data
  • Catching up node to mainnet tip
  • Reduced cold DB storage through Tayfun's changes

Ongoing:

  • Recovering cold DB write lag
    • no human action
  • Publishing snapshot
    • to be handled by SRE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants