Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node can't sync with the network #493

Closed
2 tasks done
nazar-pc opened this issue Aug 4, 2023 · 30 comments · Fixed by #1812 or autonomys/subspace#2089
Closed
2 tasks done

Node can't sync with the network #493

nazar-pc opened this issue Aug 4, 2023 · 30 comments · Fixed by #1812 or autonomys/subspace#2089

Comments

@nazar-pc
Copy link
Contributor

nazar-pc commented Aug 4, 2023

Is there an existing issue?

  • I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

  • This is not a support question.

Description of bug

We have discovered with somewhat recent Susbtrate (319a7bd) nodes sometimes fail to keep up with the network and even after restart they are unable to sync.

We have some heavy blocks on the network and we decreased block response size in our Substrate fork (autonomys/substrate@7fceb6c), but all blocks are strictly smaller than this lower limit anyway, so that is not an issue.

Here is logs of a node that was previously unable to sync with -l sync=trace:
dariia-node-log.txt

I can see that it tries to download blocks backwards, but stops before reaching correctly identified common ancestor block 9715.

Steps to reproduce

Requires "broken" state of the node (that I have access to) in order to reliably reproduce the issue.

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Aug 4, 2023

Clarification: it does seem to download all the blocks after all and even tries to import them (multiple times), but fails to find parent block that according to logs it seems to receive without issues.

This happens over and over again:

2023-08-04 15:20:30.261 TRACE tokio-runtime-worker sync::import-queue: [Consensus] Scheduling 984 blocks for import
2023-08-04 15:20:30.261 TRACE tokio-runtime-worker sync::import-queue: [Consensus] Starting import of 984 blocks  (9719..10702)
2023-08-04 15:20:30.261 TRACE tokio-runtime-worker sync::import-queue: [Consensus] Header 0x2339…7b90 has 4 logs
2023-08-04 15:20:30.261 DEBUG tokio-runtime-worker sync::import-queue: [Consensus] Block with unknown parent 9719: 0x2339e5a4f3f81345855649305a789b1c8956504b5b6e657d1d0385aeb0457b90, parent: 0xa7150f5889ee194ea2fb4ce45d22df11658afc4f278da9dbb3c68dfb586efdaf
2023-08-04 15:20:30.268 TRACE tokio-runtime-worker sync: [Consensus] Imported 0 of 984

Explorer (currently live) for this version of the network: https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Frpc.devnet.subspace.network%2Fws#/explorer

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Aug 4, 2023

Parent is block 9718, which also happens to be the best block number the node is at, but from a different fork. Basically node stopped at non-canonical block and can't get back onto canonical chain. The common ancestor is just 3 blocks deep and not finalized.

@altonen
Copy link
Contributor

altonen commented Aug 4, 2023

Yeah I agree, that is probably the core issue. If you do a sync from genesis with a clean DB does it work then?

Could possibly be related to this paritytech/substrate#12830

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Aug 4, 2023

Yeah I agree, that is probably the core issue. If you do a sync from genesis with a clean DB does it work then?

Yes, just synced a fresh node from genesis successfully. There are also multiple nodes that were live all the time and work fine, but logs are from one of the two nodes so far that got stuck at different heights.

Could possibly be related to this paritytech/substrate#12830

Yes, I also saw similar logs in paritytech/substrate#13295 which might be related/have a similar root cause.

@altonen
Copy link
Contributor

altonen commented Aug 4, 2023

@dmitry-markin

Before you continue with the sync refactoring, could you take a look at this problem?

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Aug 8, 2023

Hey folks, how high is it on your priority list?
I'll have to look into this myself if it is not resolved in next few days, it happens surprisingly often on our chain for some reason.
Just want to manage expectations on our end better, I understand you might have more pressing things to work on.

@altonen
Copy link
Contributor

altonen commented Aug 8, 2023

@dmitry-markin is looking into it, I'll give it a go as well if there is no definite answer soonish but there are other quite high priority tasks in the pipeline as well.

Did bisect reveal paritytech/substrate@319a7bd to be the offending commit because it sounds a little suspicious? If this started failing for you recently, have you had a chance to bisect the history of Subspace, in case there is a commit in your history that triggers this behavior in Substrate?

How easy is this to reproduce? Just running a Subspace node with default settings will do?

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Aug 8, 2023

It happens eventually, likely related to restarts (at least once happened on node that working fine prior to restart), not yet sure how to trigger it reliably, but it does happen quite often given tiny size of this test network. We have been doing many major changes lately, so bisecting will be trickier than to debug directly with local Substrate fork.

I have uploaded zip archive with broken node using which I was generating above logs: https://nextcloud.mokrynskyi.com/s/qCoWgZ6Td49xwHf

Should be 100% reproducible with above database and https://github.com/subspace/subspace/releases/tag/snapshot-2023-aug-07 when running like this:

subspace-node --base-path xyz --chain devnet --execution wasm --validator --blocks-pruning 256 --state-pruning archive --sync-from-dsn false

It is already at a certain height and will never move any higher than that, it is stuck. Above release uses our Substrate fork at https://github.com/subspace/substrate/tree/subspace-v5, --sync-from-dsn false should disable most of relevant customizations we did to it.

@dmitry-markin
Copy link
Contributor

To share my findings so far — I was able to reproduce in tests what looks quite similar to the problem described in paritytech/substrate#12830. If it's the underlying issue, we now should at least have an easy way to reproduce and debug it.

@dmitry-markin
Copy link
Contributor

P.S. Here is the failing test if you want to look into: paritytech/substrate#14735.

@dmitry-markin
Copy link
Contributor

It's not completely clear yet, but may be what's happening is a weird combination of syncing logic and peer disconnection due to reputation system. At least with your subspace snapshot/database it looks like this (copying only the relevant lines):

  1. A remote peer found to have a different fork with common block 9715 and is requested to provide 64 blocks:
2023-08-09 17:17:12.774 TRACE tokio-runtime-worker sync: [Consensus] BlockResponse 0 from 12D3KooWSrQ1Qq4TKbAw1ZeSUwUc84SH32GgCag1JrUJLvpYsvoN with 1 blocks  (9715)    
2023-08-09 17:17:12.774 TRACE tokio-runtime-worker sync: [Consensus] Got ancestry block paritytech/substrate#9715 (0xdce2…a1ea) from peer 12D3KooWSrQ1Qq4TKbAw1ZeSUwUc84SH32GgCag1JrUJLvpYsvoN    
2023-08-09 17:17:12.774 TRACE tokio-runtime-worker sync: [Consensus] Ancestry search complete. Ours=0x9aaa…a206 (9718), Theirs=0xf914…1cf2 (52738), Common=Some(0xdce23caa41389cefbc0883ac3d2f060bda389d89c4b6e6eb0558984bcdc0a1ea) (9715)    
2023-08-09 17:17:12.858 TRACE tokio-runtime-worker sync: [Consensus] New block request for 12D3KooWSrQ1Qq4TKbAw1ZeSUwUc84SH32GgCag1JrUJLvpYsvoN, (best:52738, common:9715) BlockRequest { id: 0, fields: HEADER | BODY | JUSTIFICATION, from: Number(9779), direction: Descending, max: Some(64) }
  1. All the blocks probably do not fit into one response, so lesser amount is downloaded in chunks:
2023-08-09 17:17:13.426 TRACE tokio-runtime-worker sync: [Consensus] BlockResponse 0 from 12D3KooWSrQ1Qq4TKbAw1ZeSUwUc84SH32GgCag1JrUJLvpYsvoN with 39 blocks  (9779..9741)
2023-08-09 17:17:13.434 TRACE tokio-runtime-worker sync: [Consensus] New block request for 12D3KooWSrQ1Qq4TKbAw1ZeSUwUc84SH32GgCag1JrUJLvpYsvoN, (best:52738, common:9715) BlockRequest { id: 0, fields: HEADER | BODY | JUSTIFICATION, from: Number(9740), direction: Descending, max: Some(25) }
2023-08-09 17:17:14.398 TRACE tokio-runtime-worker sync: [Consensus] BlockResponse 0 from 12D3KooWSrQ1Qq4TKbAw1ZeSUwUc84SH32GgCag1JrUJLvpYsvoN with 1 blocks  (9740)
2023-08-09 17:17:14.419 TRACE tokio-runtime-worker sync: [Consensus] New block request for 12D3KooWSrQ1Qq4TKbAw1ZeSUwUc84SH32GgCag1JrUJLvpYsvoN, (best:52738, common:9715) BlockRequest { id: 0, fields: HEADER | BODY | JUSTIFICATION, from: Number(9739), direction: Descending, max: Some(24) }
2023-08-09 17:17:14.826 TRACE tokio-runtime-worker sync: [Consensus] BlockResponse 0 from 12D3KooWSrQ1Qq4TKbAw1ZeSUwUc84SH32GgCag1JrUJLvpYsvoN with 1 blocks  (9739)
...
2023-08-09 17:17:24.260 TRACE tokio-runtime-worker sync: [Consensus] New block request for 12D3KooWSrQ1Qq4TKbAw1ZeSUwUc84SH32GgCag1JrUJLvpYsvoN, (best:52738, common:9715) BlockRequest { id: 0, fields: HEADER | BODY | JUSTIFICATION, from: Number(9728), direction: Descending, max: Some(13) }
2023-08-09 17:17:25.208 TRACE tokio-runtime-worker sync: [Consensus] BlockResponse 0 from 12D3KooWSrQ1Qq4TKbAw1ZeSUwUc84SH32GgCag1JrUJLvpYsvoN with 2 blocks  (9728..9727)
2023-08-09 17:17:25.424 TRACE tokio-runtime-worker sync: [Consensus] New block request for 12D3KooWSrQ1Qq4TKbAw1ZeSUwUc84SH32GgCag1JrUJLvpYsvoN, (best:52738, common:9715) BlockRequest { id: 0, fields: HEADER | BODY | JUSTIFICATION, from: Number(9726), direction: Descending, max: Some(11) }
  1. At this point due to the remote peer also requesting blocks from us, the reputation system decides to disconnect it:
2023-08-09 17:17:25.889 DEBUG tokio-runtime-worker sync: [Consensus] Handling block request from 12D3KooWSrQ1Qq4TKbAw1ZeSUwUc84SH32GgCag1JrUJLvpYsvoN: Starting at `BlockId::Hash(0x9aaa72d9718d997c6894cf3e62c1e48620e3f14db7ad2cc25807814f9285a206)` with maximum blocks of `128`, reputation_change: `Some(ReputationChange { value: -2147483648, reason: "Same block request multiple times" })`, small_request `false`, direction `Descending` and attributes `HEADER | BODY | JUSTIFICATION`.    
2023-08-09 17:17:25.889 DEBUG tokio-runtime-worker sync: [Consensus] Sending result of block request from 12D3KooWSrQ1Qq4TKbAw1ZeSUwUc84SH32GgCag1JrUJLvpYsvoN starting at `BlockId::Hash(0x9aaa72d9718d997c6894cf3e62c1e48620e3f14db7ad2cc25807814f9285a206)`: blocks: None, data: None    
2023-08-09 17:17:25.889 DEBUG tokio-runtime-worker sync: [Consensus] Handled block request from 12D3KooWSrQ1Qq4TKbAw1ZeSUwUc84SH32GgCag1JrUJLvpYsvoN.    
2023-08-09 17:17:25.890  WARN tokio-runtime-worker peerset: [Consensus] Ignoring request to disconnect reserved peer 12D3KooWSrQ1Qq4TKbAw1ZeSUwUc84SH32GgCag1JrUJLvpYsvoN from SetId(2).    
2023-08-09 17:17:25.890 DEBUG tokio-runtime-worker sync: [Consensus] 12D3KooWSrQ1Qq4TKbAw1ZeSUwUc84SH32GgCag1JrUJLvpYsvoN disconnected

And we never finish filling the gap.

What I also find quite annoying is that the block chunks are requested by block number and not by block hash — while this should not cause issues with downloading of a canonical chain, it can interfere with downloading forks.

Also, it looks like the issue is not directly related to paritytech/substrate#12830, because in that case the problem was connected with inability to fill fork gaps > 128 blocks with one request, and in our case we have multiple successful "gap" requests.

What I would try doing next is disabling "Same block request multiple times" reputation punishment to see if it cuts the tie.

@nazar-pc By the way, how do you patch subspace to build against local substrate source? I tried naive patching with diener but hit the issue of building against multiple version of substrate crates.

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Aug 10, 2023

@dmitry-markin diener is what I use as well, just need to update Cargo.toml and replace https://github.com/paritytech/substrate with https://github.com/subspace/substrate.

@dmitry-markin
Copy link
Contributor

It turned out the trick was to patch both https://github.com/paritytech/substrate & https://github.com/subspace/substrate. Most likely I was sidetracked by this other issue reported in the previous comment as other peers are not disconnected and could fulfill block requests.

It looks like the block download continues normally in a descending manner towards block 9715 at one block per response. Unfortunately, as soon as the client downloads block 9719, the sync state machine decides it's ready to import blocks (as it already has 9718), and tries to import all downloaded blocks. Unfortunately, they are on a different fork so the import fails, the blocks are discarded, and the process is restarted.

So, IMO, the problem is that block import is triggered by block number (incorrectly, as the parent is different) instead of the block hash with respect to the parent hash. At the same time the block download request is seemingly correct, just the block import is triggered too soonish before all the blocks in the range are downloaded. What likely makes the issue manifest here is that the blocks are big in this part of the chain and only fit one block per response, so the range is not fulfilled completely via one response.

@nazar-pc
Copy link
Contributor Author

Yep, that sounds exactly the way I expected it to be

@dmitry-markin
Copy link
Contributor

We clearly have issues with forks downloading. Here is the original range request:

2023-08-11 14:44:54.028 TRACE tokio-runtime-worker sync: [Consensus] Ancestry search complete. Ours=0x9aaa…a206 (9718), Theirs=0x4e1e…4c9b (66055), Common=Some(0xdce23caa41389cefbc0883ac3d2f060bda389d89c4b6e6eb0558984bcdc0a1ea) (9715)    
2023-08-11 14:44:54.041 TRACE tokio-runtime-worker sync: [Consensus] New block request for 12D3KooW9rBrhvZuc8C7ff3sD3sdu9crpZ6hiYDtU2RedWddfjQi, (best:66055, common:9715) BlockRequest { id: 0, fields: HEADER | BODY | JUSTIFICATION, from: Number(9779), direction: Descending, max: Some(64) }

There is no message "Added fork target" in between, because the condition
https://github.com/paritytech/substrate/blob/aedfdc977b28a4029ec658dff960a86220f7622b/client/network/sync/src/lib.rs#L912-L913
is false. And even if it was true, the "fork" download would fail due to paritytech/substrate#12830 because the "fork" is longer than 128 blocks.

And due to the logic of downloading of the canonical chain, block import is triggered (as I wrote in the previous message) before we actually have all the blocks from the range.

@nazar-pc I will look into the issue more today, but I'm afraid there is no quick fix available. Unfortunately, I'm not available next week, so we'll either need to postpone the fix, or I'll need to ask you too look into it.

@nazar-pc
Copy link
Contributor Author

Understandable, thanks for all the help so far, I appreciate it!

We have an alternative sync in our protocol, so node will get un-stuck eventually (though might take a long time). I'll post an update here if I get to fixing this now that we have a better understanding why it happens.

@rahulksnv
Copy link
Contributor

I have a unit test that reproduces the issue. To summarize the problem:

  1. Node is at 9718 on a non-canonical chain, comes back online.
  2. Other nodes are on the canonical chain, with 9715 being the common ancestor
  3. Node ries to sync from the other nodes (in chunks up to 64 at a time).
  4. It so happens that one of the peers was requested [9716 .. 9719], but only returned 9719, possibly because block was too big:
New block request for 12D3KooWGpmCSuWK9nCAc3ZiZJ5wZKnk38A4huEunf1NmwTHoG2n, (best:10699, common:9715) BlockRequest { id: 0, fields: HEADER | BODY | JUSTIFICATION, from: Number(9719), direction: Descending, max: Some(4) }
BlockResponse 0 from 12D3KooWGpmCSuWK9nCAc3ZiZJ5wZKnk38A4huEunf1NmwTHoG2n with 1 blocks  (9719)
  1. At this point the node has downloaded blocks [9719 .. 10699]. Since these downloads don't happen on one of the fork targets, it concludes the chain can be extended, tries importing and fails. This is because even though the block numbers are the same, these are on different chains (i.e) the parent of the downloaded 9719 != local (non-canonical) 9718.

There are several ways to address this. But it seems the issue is because this is not treated as a fork target upon completion of ancestry search (because of the condition: peer.best_number < self.best_queued_number). The fork target download path handles exactly this scenario, where the common ancestor would be used as base to get ready blocks.

Can you please share some context around why this check is enforced? Is this to use the fork target path only for smaller branches?

@dmitry-markin
Copy link
Contributor

To be honest, I don't know the reason behind this check. Did you try removing it and checking if it helps to unstall the syncing?

rahulksnv added a commit to rahulksnv/polkadot-sdk that referenced this issue Sep 24, 2023
Context: paritytech#493 (comment)

Problem: When downloading from a fork, the peer may send a partial response
that extends the current chain. Even though the response starts at a block
number after the current best number, the parent block from the fork is not
yet downloaded/imported. We currently try to import the partial response
in this scenario, which causes import to fail as parent is not found in the
backend. This causes sync to be aborted/restarted which may not converge.

Fix: when we receive a partial response in this scenario, hold on to the
partial blocks and try to download the missing blocks. And import when the
full range is downloaded.
@rahulksnv
Copy link
Contributor

I have #1686, please take a look.

I went down the path of trying to reuse the fork targets. But discovered it may not be suitable, as it does not handle cases like partial responses, etc and requires more extensive changes

@dmitry-markin
Copy link
Contributor

dmitry-markin commented Oct 7, 2023

@arkpar To supplement our discussion, this is the branch that is triggered when selecting which blocks to download in block_requests():
https://github.com/paritytech/substrate/blob/aedfdc977b28a4029ec658dff960a86220f7622b/client/network/sync/src/lib.rs#L2398
and if we look at the implementation of peer_block_requets(), the blocks are downloaded in the descending direction:
https://github.com/paritytech/substrate/blob/aedfdc977b28a4029ec658dff960a86220f7622b/client/network/sync/src/lib.rs#L2940-L2946, and this is one of the reasons why we do not reach the common number (together with blocks being large and being essentially requested one-at-a-time + import triggering once we have downloaded "best queued + 1", but not from our fork).

@dmitry-markin
Copy link
Contributor

This makes me curious if just switching the direction of the range requested to ascending will fix the issue not breaking syncing somewhere else. (The obvious issue we might run into is that downloading backwards from a given hash makes clear what fork we are requesting, while downloading forward says nothing about the fork preference.)

@rahulksnv
Copy link
Contributor

I believe I have a simpler fix for the issue: #1812, PTAL

@dmitry-markin
Copy link
Contributor

I believe I have a simpler fix for the issue: #1812, PTAL

Cool, looks very nice!

(Switching the range direction in fact broke 5 tests in syncing.)

@rahulksnv
Copy link
Contributor

rahulksnv commented Oct 24, 2023

I was able to isolate the issue from the logs, have opened #1999 that adds a unit test for the failure scenario.

The basic failure scenario is described in the PR. This also shows up in different ways, for instance:

  1. Node has finished downloading up to block 1000 from the peers, from the canonical chain.
  2. Peers are undergoing re-org around this time. One of the peers is switching back and forth between the canonical and other chains
  3. Peer announces 1000 from the canonical chain, but already appears to have announced 1001 from a non-canonical chain. The latter appears out of order, and gets downloaded and ends up in same situation.

This out of order scenario could be potentially mitigated by including time stamps in block announcements, so that older announcements can be ignored.

Overall, the issue is the same that was addressed by 1812: relying on block numbers instead of hashes. There could be several solutions to this

@rahulksnv
Copy link
Contributor

The obvious fix is to check if the announced block extends peer's best block. But there may be legit cases that may get filtered by this check. Also, if the announcement is for a block not immediately after the current best block, this check is not possible

@dmitry-markin
Copy link
Contributor

@rahulksnv thanks for investigation. But why does the import work correctly without the check for the parent block being on chain? Shouldn't it fail as we are importing only part of the fork?

@rahulksnv
Copy link
Contributor

@rahulksnv thanks for investigation. But why does the import work correctly without the check for the parent block being on chain? Shouldn't it fail as we are importing only part of the fork?

We don't know yet if the import works (we haven't deployed it after 1812 was rolled back). But I guess this is what would happen: import would fail, sync would get reset as before 1812. Then depending on which peers the blocks are downloaded from next time around, import of the whole chain may complete successfully.

1812 was basically preventing import happening in the first place. We had a head of line blocking kind of scenario: 1001 was from wrong fork, and we get stuck as the peer common_number also gets advanced. If the common_number was not advanced, we would have tried to download the from the other fork and made progress

@rahulksnv
Copy link
Contributor

We don't know yet if the import works (we haven't deployed it after 1812 was rolled back). But I guess this is what would happen: import would fail, sync would get reset as before 1812. Then depending on which peers the blocks are downloaded from next time around, import of the whole chain may complete successfully.

Can confirm, it happened few times after 1812 was rolled back as well. So it ends up discarding blocks and restarting the sync, eventually somehow completes. Here was an instance where it ends up with block 20715 from a branch, 20716 from different branch after a re-org. There is a burst of import failures and sync restarts

 2023-10-25T17:15:13.202908Z DEBUG tokio-runtime-worker sync::import-queue: Block with unknown parent 20716: 0x006a0a1f03368799a4c9a0679dd4f9167f4d1f6077b7e855df04d5d284735836, parent: 0x03e4495c9ee2eb4e916cf9aee8df
e09744032961b48eaf42e2e988658148413e    
2023-10-25T17:15:13.202950Z DEBUG tokio-runtime-worker sync::import-queue: Block with unknown parent 20717: 0x41331bc3c2f6fc3a23b16b3663e45d59238079f874da9f16b8274629547bd4a7, parent: 0x006a0a1f03368799a4c9a0679dd4
f9167f4d1f6077b7e855df04d5d284735836    
2023-10-25T17:15:13.202981Z  WARN tokio-runtime-worker sync: [Consensus] 💔 Error importing block 0x006a0a1f03368799a4c9a0679dd4f9167f4d1f6077b7e855df04d5d284735836: block has an unknown parent    
2023-10-25T17:15:13.202991Z TRACE tokio-runtime-worker sync: [Consensus] Restarted sync at #20715 (0x2b8d2f7eb5ebb88eb6ca338d90b0fa40628c046dfe281d4fae77cf9a822099be)    
2023-10-25T17:15:13.202991Z DEBUG tokio-runtime-worker sync::import-queue: Block with unknown parent 20718: 0xf8ca81226814b9cfa1c337cfd8bdb07f065950162a69873f85d89e5061f344a3, parent: 0x41331bc3c2f6fc3a23b16b3663e4
5d59238079f874da9f16b8274629547bd4a7    
2023-10-25T17:15:13.202993Z DEBUG tokio-runtime-worker sync: [Consensus] Restarted with 20715 (0x2b8d…99be)    
2023-10-25T17:15:13.203026Z DEBUG tokio-runtime-worker sync::import-queue: Block with unknown parent 20719: 0x532816958384984b3760f23c3ba7c3aca8ea60ee7a756b112c0b603864917f2c, parent: 0xf8ca81226814b9cfa1c337cfd8bd
b07f065950162a69873f85d89e5061f344a3    
2023-10-25T17:15:13.203029Z  WARN tokio-runtime-worker sync: [Consensus] 💔 Error importing block 0x41331bc3c2f6fc3a23b16b3663e45d59238079f874da9f16b8274629547bd4a7: block has an unknown parent    
2023-10-25T17:15:13.203031Z TRACE tokio-runtime-worker sync: [Consensus] Restarted sync at #20715 (0x2b8d2f7eb5ebb88eb6ca338d90b0fa40628c046dfe281d4fae77cf9a822099be)    
2023-10-25T17:15:13.203032Z DEBUG tokio-runtime-worker sync: [Consensus] Restarted with 20715 (0x2b8d…99be)    
2023-10-25T17:15:13.203052Z  WARN tokio-runtime-worker sync: [Consensus] 💔 Error importing block 0xf8ca81226814b9cfa1c337cfd8bdb07f065950162a69873f85d89e5061f344a3: block has an unknown parent    
2023-10-25T17:15:13.203054Z TRACE tokio-runtime-worker sync: [Consensus] Restarted sync at #20715 (0x2b8d2f7eb5ebb88eb6ca338d90b0fa40628c046dfe281d4fae77cf9a822099be)    
2023-10-25T17:15:13.203055Z DEBUG tokio-runtime-worker sync: [Consensus] Restarted with 20715 (0x2b8d…99be)    
2023-10-25T17:15:13.203075Z  WARN tokio-runtime-worker sync: [Consensus] 💔 Error importing block 0x532816958384984b3760f23c3ba7c3aca8ea60ee7a756b112c0b603864917f2c: block has an unknown parent    
2023-10-25T17:15:13.203076Z TRACE tokio-runtime-worker sync: [Consensus] Restarted sync at #20715 (0x2b8d2f7eb5ebb88eb6ca338d90b0fa40628c046dfe281d4fae77cf9a822099be)    
2023-10-25T17:15:13.203078Z DEBUG tokio-runtime-worker sync: [Consensus] Restarted with 20715 (0x2b8d…99be)    
2023-10-25T17:15:13.203104Z DEBUG tokio-runtime-worker sync::import-queue: Block with unknown parent 20720: 0xefb53cbb30a4793670f8cf72cbac864cfe7d2ac57513c3de6fbbe0379bc3870a, parent: 0x532816958384984b3760f23c3ba7
c3aca8ea60ee7a756b112c0b603864917f2c    
2023-10-25T17:15:13.203138Z DEBUG tokio-runtime-worker sync::import-queue: Block with unknown parent 20721: 0x47c5a8395968ff801544aec70044eeaaa7cdaa6c6555c54d72172eed990569b9, parent: 0xefb53cbb30a4793670f8cf72cbac
864cfe7d2ac57513c3de6fbbe0379bc3870a    
2023-10-25T17:15:13.203177Z  WARN tokio-runtime-worker sync: [Consensus] 💔 Error importing block 0xefb53cbb30a4793670f8cf72cbac864cfe7d2ac57513c3de6fbbe0379bc3870a: block has an unknown parent    
2023-10-25T17:15:13.203182Z TRACE tokio-runtime-worker sync: [Consensus] Restarted sync at #20715 (0x2b8d2f7eb5ebb88eb6ca338d90b0fa40628c046dfe281d4fae77cf9a822099be)    
2023-10-25T17:15:13.203183Z DEBUG tokio-runtime-worker sync: [Consensus] Restarted with 20715 (0x2b8d…99be)    
2023-10-25T17:15:13.203191Z DEBUG tokio-runtime-worker sync::import-queue: Block with unknown parent 20722: 0xcbf931749c9bd848e7b615d30c6bb61d381a50d95477fbbec38b89303f7a17ee, parent: 0x47c5a8395968ff801544aec70044
eeaaa7cdaa6c6555c54d72172eed990569b9    
2023-10-25T17:15:13.203229Z DEBUG tokio-runtime-worker sync::import-queue: Block with unknown parent 20723: 0x2b0cce001c84db17391811593d641e4f21f6d76e600127f580aebc2fdd1d3021, parent: 0xcbf931749c9bd848e7b615d30c6b
b61d381a50d95477fbbec38b89303f7a17ee    
2023-10-25T17:15:13.203248Z  WARN tokio-runtime-worker sync: [Consensus] 💔 Error importing block 0x47c5a8395968ff801544aec70044eeaaa7cdaa6c6555c54d72172eed990569b9: block has an unknown parent    
2023-10-25T17:15:13.203252Z TRACE tokio-runtime-worker sync: [Consensus] Restarted sync at #20715 (0x2b8d2f7eb5ebb88eb6ca338d90b0fa40628c046dfe281d4fae77cf9a822099be)    
2023-10-25T17:15:13.203254Z DEBUG tokio-runtime-worker sync: [Consensus] Restarted with 20715 (0x2b8d…99be)    
2023-10-25T17:15:13.203277Z  WARN tokio-runtime-worker sync: [Consensus] 💔 Error importing block 0xcbf931749c9bd848e7b615d30c6bb61d381a50d95477fbbec38b89303f7a17ee: block has an unknown parent    
2023-10-25T17:15:13.203279Z TRACE tokio-runtime-worker sync: [Consensus] Restarted sync at #20715 (0x2b8d2f7eb5ebb88eb6ca338d90b0fa40628c046dfe281d4fae77cf9a822099be)    
2023-10-25T17:15:13.203281Z DEBUG tokio-runtime-worker sync: [Consensus] Restarted with 20715 (0x2b8d…99be)    
2023-10-25T17:15:13.203283Z DEBUG tokio-runtime-worker sync::import-queue: Block with unknown parent 20724: 0x5ef789377e5309c4e3b0190981d0ed4d9a452458cee6fd0a4554084e5dba5e27, parent: 0x2b0cce001c84db17391811593d64
1e4f21f6d76e600127f580aebc2fdd1d3021    
2023-10-25T17:15:13.203306Z  WARN tokio-runtime-worker sync: [Consensus] 💔 Error importing block 0x2b0cce001c84db17391811593d641e4f21f6d76e600127f580aebc2fdd1d3021: block has an unknown parent    
2023-10-25T17:15:13.203309Z TRACE tokio-runtime-worker sync: [Consensus] Restarted sync at #20715 (0x2b8d2f7eb5ebb88eb6ca338d90b0fa40628c046dfe281d4fae77cf9a822099be)    
2023-10-25T17:15:13.203310Z DEBUG tokio-runtime-worker sync: [Consensus] Restarted with 20715 (0x2b8d…99be)    
2023-10-25T17:15:13.203335Z DEBUG tokio-runtime-worker sync::import-queue: Block with unknown parent 20725: 0x3449fec8e400095b4ef0dc48332dfdafbe38a3d44b055c63c083d1c833e921fa, parent: 0x5ef789377e5309c4e3b0190981d0
ed4d9a452458cee6fd0a4554084e5dba5e27    
2023-10-25T17:15:13.203342Z  WARN tokio-runtime-worker sync: [Consensus] 💔 Error importing block 0x5ef789377e5309c4e3b0190981d0ed4d9a452458cee6fd0a4554084e5dba5e27: block has an unknown parent    
2023-10-25T17:15:13.203345Z TRACE tokio-runtime-worker sync: [Consensus] Restarted sync at #20715 (0x2b8d2f7eb5ebb88eb6ca338d90b0fa40628c046dfe281d4fae77cf9a822099be)    
2023-10-25T17:15:13.203346Z DEBUG tokio-runtime-worker sync: [Consensus] Restarted with 20715 (0x2b8d…99be)    
2023-10-25T17:15:13.203370Z  WARN tokio-runtime-worker sync: [Consensus] 💔 Error importing block 0x3449fec8e400095b4ef0dc48332dfdafbe38a3d44b055c63c083d1c833e921fa: block has an unknown parent    
2023-10-25T17:15:13.203372Z TRACE tokio-runtime-worker sync: [Consensus] Restarted sync at #20715 (0x2b8d2f7eb5ebb88eb6ca338d90b0fa40628c046dfe281d4fae77cf9a822099be)    
2023-10-25T17:15:13.203374Z DEBUG tokio-runtime-worker sync: [Consensus] Restarted with 20715 (0x2b8d…99be)    
2023-10-25T17:15:13.203393Z DEBUG tokio-runtime-worker sync::import-queue: Block with unknown parent 20726: 0x2e13a30aa9e2307bfb22e5236d1f7cd0a2719f82aa36598125abad9e812aabfe, parent: 0x3449fec8e400095b4ef0dc48332d
fdafbe38a3d44b055c63c083d1c833e921fa    
2023-10-25T17:15:13.203454Z DEBUG tokio-runtime-worker sync::import-queue: Block with unknown parent 20727: 0x86777e6dc691163b8d3b78a7df4484546f014d8467e281a4108a4848ee284c14, parent: 0x2e13a30aa9e2307bfb22e5236d1f
7cd0a2719f82aa36598125abad9e812aabfe    
2023-10-25T17:15:13.203481Z  WARN tokio-runtime-worker sync: [Consensus] 💔 Error importing block 0x2e13a30aa9e2307bfb22e5236d1f7cd0a2719f82aa36598125abad9e812aabfe: block has an unknown parent    
2023-10-25T17:15:13.203484Z TRACE tokio-runtime-worker sync: [Consensus] Restarted sync at #20715 (0x2b8d2f7eb5ebb88eb6ca338d90b0fa40628c046dfe281d4fae77cf9a822099be)    
2023-10-25T17:15:13.203485Z DEBUG tokio-runtime-worker sync: [Consensus] Restarted with 20715 (0x2b8d…99be)    
2023-10-25T17:15:13.203506Z  WARN tokio-runtime-worker sync: [Consensus] 💔 Error importing block 0x86777e6dc691163b8d3b78a7df4484546f014d8467e281a4108a4848ee284c14: block has an unknown parent    
2023-10-25T17:15:13.203508Z TRACE tokio-runtime-worker sync: [Consensus] Restarted sync at #20715 (0x2b8d2f7eb5ebb88eb6ca338d90b0fa40628c046dfe281d4fae77cf9a822099be)    
2023-10-25T17:15:13.203509Z DEBUG tokio-runtime-worker sync: [Consensus] Restarted with 20715 (0x2b8d…99be)

@rahulksnv
Copy link
Contributor

#1999 - adds unit test to demonstrate the issue

dmitry-markin added a commit that referenced this issue Oct 31, 2023
The change adds a test to show the failure scenario that caused #1812 to
be rolled back (more context:
#493 (comment))

Summary of the scenario:
1. Node has finished downloading up to block 1000 from the peers, from
the canonical chain.
2. Peers are undergoing re-org around this time. One of the peers has
switched to a non-canonical chain, announces block 1001 from that chain
3. Node downloads 1001 from the peer, and tries to import which would
fail (as we don't have the parent block 1000 from the other chain)

---------

Co-authored-by: Dmitry Markin <[email protected]>
bgallois pushed a commit to duniter/duniter-polkadot-sdk that referenced this issue Mar 25, 2024
When retrieving the ready blocks, verify that the parent of the first
ready block is on chain. If the parent is not on chain, we are
downloading from a fork. In this case, keep downloading until we have a
parent on chain (common ancestor).

Resolves paritytech#493.

---------

Co-authored-by: Aaro Altonen <[email protected]>
bgallois pushed a commit to duniter/duniter-polkadot-sdk that referenced this issue Mar 25, 2024
The change adds a test to show the failure scenario that caused paritytech#1812 to
be rolled back (more context:
paritytech#493 (comment))

Summary of the scenario:
1. Node has finished downloading up to block 1000 from the peers, from
the canonical chain.
2. Peers are undergoing re-org around this time. One of the peers has
switched to a non-canonical chain, announces block 1001 from that chain
3. Node downloads 1001 from the peer, and tries to import which would
fail (as we don't have the parent block 1000 from the other chain)

---------

Co-authored-by: Dmitry Markin <[email protected]>
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
* Expose two nodes publicly through brucke.link

* Use flags for GNU `sed` instead of BSD `sed`

* Update Substrate relay entrypoint scripts to initialize bridge

* Add Rialto to Millau relay to Compose deployment

* Stop initializing Rialto chain through chainspec

* Include logging for Substrate pallet

* Make Rialto to Millau entrypoint executable

* Use YAML references for relay components

* Use published Substrate Relay image

* Relay messages from Millau to Rialto

* Use Bob nodes to serve message lane

* Fix some port number issues for PoA-Rialto deployment

* Stop directly referencing `environment` anchor in nodes

* Add probable cause to relayer error message

* Edit monitoring config file in-place

* Add some sleep time between bridge init call and starting relays

* Expose grafana.

* Use Root key as bridge pallet owner

In our case that's going to be Alice since she's Root for our Dev
and Local chains.

Co-authored-by: Tomasz Drwięga <[email protected]>
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
* Expose two nodes publicly through brucke.link

* Use flags for GNU `sed` instead of BSD `sed`

* Update Substrate relay entrypoint scripts to initialize bridge

* Add Rialto to Millau relay to Compose deployment

* Stop initializing Rialto chain through chainspec

* Include logging for Substrate pallet

* Make Rialto to Millau entrypoint executable

* Use YAML references for relay components

* Use published Substrate Relay image

* Relay messages from Millau to Rialto

* Use Bob nodes to serve message lane

* Fix some port number issues for PoA-Rialto deployment

* Stop directly referencing `environment` anchor in nodes

* Add probable cause to relayer error message

* Edit monitoring config file in-place

* Add some sleep time between bridge init call and starting relays

* Expose grafana.

* Use Root key as bridge pallet owner

In our case that's going to be Alice since she's Root for our Dev
and Local chains.

Co-authored-by: Tomasz Drwięga <[email protected]>
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
* Expose two nodes publicly through brucke.link

* Use flags for GNU `sed` instead of BSD `sed`

* Update Substrate relay entrypoint scripts to initialize bridge

* Add Rialto to Millau relay to Compose deployment

* Stop initializing Rialto chain through chainspec

* Include logging for Substrate pallet

* Make Rialto to Millau entrypoint executable

* Use YAML references for relay components

* Use published Substrate Relay image

* Relay messages from Millau to Rialto

* Use Bob nodes to serve message lane

* Fix some port number issues for PoA-Rialto deployment

* Stop directly referencing `environment` anchor in nodes

* Add probable cause to relayer error message

* Edit monitoring config file in-place

* Add some sleep time between bridge init call and starting relays

* Expose grafana.

* Use Root key as bridge pallet owner

In our case that's going to be Alice since she's Root for our Dev
and Local chains.

Co-authored-by: Tomasz Drwięga <[email protected]>
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
* Expose two nodes publicly through brucke.link

* Use flags for GNU `sed` instead of BSD `sed`

* Update Substrate relay entrypoint scripts to initialize bridge

* Add Rialto to Millau relay to Compose deployment

* Stop initializing Rialto chain through chainspec

* Include logging for Substrate pallet

* Make Rialto to Millau entrypoint executable

* Use YAML references for relay components

* Use published Substrate Relay image

* Relay messages from Millau to Rialto

* Use Bob nodes to serve message lane

* Fix some port number issues for PoA-Rialto deployment

* Stop directly referencing `environment` anchor in nodes

* Add probable cause to relayer error message

* Edit monitoring config file in-place

* Add some sleep time between bridge init call and starting relays

* Expose grafana.

* Use Root key as bridge pallet owner

In our case that's going to be Alice since she's Root for our Dev
and Local chains.

Co-authored-by: Tomasz Drwięga <[email protected]>
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
* Expose two nodes publicly through brucke.link

* Use flags for GNU `sed` instead of BSD `sed`

* Update Substrate relay entrypoint scripts to initialize bridge

* Add Rialto to Millau relay to Compose deployment

* Stop initializing Rialto chain through chainspec

* Include logging for Substrate pallet

* Make Rialto to Millau entrypoint executable

* Use YAML references for relay components

* Use published Substrate Relay image

* Relay messages from Millau to Rialto

* Use Bob nodes to serve message lane

* Fix some port number issues for PoA-Rialto deployment

* Stop directly referencing `environment` anchor in nodes

* Add probable cause to relayer error message

* Edit monitoring config file in-place

* Add some sleep time between bridge init call and starting relays

* Expose grafana.

* Use Root key as bridge pallet owner

In our case that's going to be Alice since she's Root for our Dev
and Local chains.

Co-authored-by: Tomasz Drwięga <[email protected]>
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
* Expose two nodes publicly through brucke.link

* Use flags for GNU `sed` instead of BSD `sed`

* Update Substrate relay entrypoint scripts to initialize bridge

* Add Rialto to Millau relay to Compose deployment

* Stop initializing Rialto chain through chainspec

* Include logging for Substrate pallet

* Make Rialto to Millau entrypoint executable

* Use YAML references for relay components

* Use published Substrate Relay image

* Relay messages from Millau to Rialto

* Use Bob nodes to serve message lane

* Fix some port number issues for PoA-Rialto deployment

* Stop directly referencing `environment` anchor in nodes

* Add probable cause to relayer error message

* Edit monitoring config file in-place

* Add some sleep time between bridge init call and starting relays

* Expose grafana.

* Use Root key as bridge pallet owner

In our case that's going to be Alice since she's Root for our Dev
and Local chains.

Co-authored-by: Tomasz Drwięga <[email protected]>
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
* Expose two nodes publicly through brucke.link

* Use flags for GNU `sed` instead of BSD `sed`

* Update Substrate relay entrypoint scripts to initialize bridge

* Add Rialto to Millau relay to Compose deployment

* Stop initializing Rialto chain through chainspec

* Include logging for Substrate pallet

* Make Rialto to Millau entrypoint executable

* Use YAML references for relay components

* Use published Substrate Relay image

* Relay messages from Millau to Rialto

* Use Bob nodes to serve message lane

* Fix some port number issues for PoA-Rialto deployment

* Stop directly referencing `environment` anchor in nodes

* Add probable cause to relayer error message

* Edit monitoring config file in-place

* Add some sleep time between bridge init call and starting relays

* Expose grafana.

* Use Root key as bridge pallet owner

In our case that's going to be Alice since she's Root for our Dev
and Local chains.

Co-authored-by: Tomasz Drwięga <[email protected]>
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
* Expose two nodes publicly through brucke.link

* Use flags for GNU `sed` instead of BSD `sed`

* Update Substrate relay entrypoint scripts to initialize bridge

* Add Rialto to Millau relay to Compose deployment

* Stop initializing Rialto chain through chainspec

* Include logging for Substrate pallet

* Make Rialto to Millau entrypoint executable

* Use YAML references for relay components

* Use published Substrate Relay image

* Relay messages from Millau to Rialto

* Use Bob nodes to serve message lane

* Fix some port number issues for PoA-Rialto deployment

* Stop directly referencing `environment` anchor in nodes

* Add probable cause to relayer error message

* Edit monitoring config file in-place

* Add some sleep time between bridge init call and starting relays

* Expose grafana.

* Use Root key as bridge pallet owner

In our case that's going to be Alice since she's Root for our Dev
and Local chains.

Co-authored-by: Tomasz Drwięga <[email protected]>
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
* Expose two nodes publicly through brucke.link

* Use flags for GNU `sed` instead of BSD `sed`

* Update Substrate relay entrypoint scripts to initialize bridge

* Add Rialto to Millau relay to Compose deployment

* Stop initializing Rialto chain through chainspec

* Include logging for Substrate pallet

* Make Rialto to Millau entrypoint executable

* Use YAML references for relay components

* Use published Substrate Relay image

* Relay messages from Millau to Rialto

* Use Bob nodes to serve message lane

* Fix some port number issues for PoA-Rialto deployment

* Stop directly referencing `environment` anchor in nodes

* Add probable cause to relayer error message

* Edit monitoring config file in-place

* Add some sleep time between bridge init call and starting relays

* Expose grafana.

* Use Root key as bridge pallet owner

In our case that's going to be Alice since she's Root for our Dev
and Local chains.

Co-authored-by: Tomasz Drwięga <[email protected]>
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
* Expose two nodes publicly through brucke.link

* Use flags for GNU `sed` instead of BSD `sed`

* Update Substrate relay entrypoint scripts to initialize bridge

* Add Rialto to Millau relay to Compose deployment

* Stop initializing Rialto chain through chainspec

* Include logging for Substrate pallet

* Make Rialto to Millau entrypoint executable

* Use YAML references for relay components

* Use published Substrate Relay image

* Relay messages from Millau to Rialto

* Use Bob nodes to serve message lane

* Fix some port number issues for PoA-Rialto deployment

* Stop directly referencing `environment` anchor in nodes

* Add probable cause to relayer error message

* Edit monitoring config file in-place

* Add some sleep time between bridge init call and starting relays

* Expose grafana.

* Use Root key as bridge pallet owner

In our case that's going to be Alice since she's Root for our Dev
and Local chains.

Co-authored-by: Tomasz Drwięga <[email protected]>
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 10, 2024
* Expose two nodes publicly through brucke.link

* Use flags for GNU `sed` instead of BSD `sed`

* Update Substrate relay entrypoint scripts to initialize bridge

* Add Rialto to Millau relay to Compose deployment

* Stop initializing Rialto chain through chainspec

* Include logging for Substrate pallet

* Make Rialto to Millau entrypoint executable

* Use YAML references for relay components

* Use published Substrate Relay image

* Relay messages from Millau to Rialto

* Use Bob nodes to serve message lane

* Fix some port number issues for PoA-Rialto deployment

* Stop directly referencing `environment` anchor in nodes

* Add probable cause to relayer error message

* Edit monitoring config file in-place

* Add some sleep time between bridge init call and starting relays

* Expose grafana.

* Use Root key as bridge pallet owner

In our case that's going to be Alice since she's Root for our Dev
and Local chains.

Co-authored-by: Tomasz Drwięga <[email protected]>
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 10, 2024
* Expose two nodes publicly through brucke.link

* Use flags for GNU `sed` instead of BSD `sed`

* Update Substrate relay entrypoint scripts to initialize bridge

* Add Rialto to Millau relay to Compose deployment

* Stop initializing Rialto chain through chainspec

* Include logging for Substrate pallet

* Make Rialto to Millau entrypoint executable

* Use YAML references for relay components

* Use published Substrate Relay image

* Relay messages from Millau to Rialto

* Use Bob nodes to serve message lane

* Fix some port number issues for PoA-Rialto deployment

* Stop directly referencing `environment` anchor in nodes

* Add probable cause to relayer error message

* Edit monitoring config file in-place

* Add some sleep time between bridge init call and starting relays

* Expose grafana.

* Use Root key as bridge pallet owner

In our case that's going to be Alice since she's Root for our Dev
and Local chains.

Co-authored-by: Tomasz Drwięga <[email protected]>
bkchr pushed a commit that referenced this issue Apr 10, 2024
* Expose two nodes publicly through brucke.link

* Use flags for GNU `sed` instead of BSD `sed`

* Update Substrate relay entrypoint scripts to initialize bridge

* Add Rialto to Millau relay to Compose deployment

* Stop initializing Rialto chain through chainspec

* Include logging for Substrate pallet

* Make Rialto to Millau entrypoint executable

* Use YAML references for relay components

* Use published Substrate Relay image

* Relay messages from Millau to Rialto

* Use Bob nodes to serve message lane

* Fix some port number issues for PoA-Rialto deployment

* Stop directly referencing `environment` anchor in nodes

* Add probable cause to relayer error message

* Edit monitoring config file in-place

* Add some sleep time between bridge init call and starting relays

* Expose grafana.

* Use Root key as bridge pallet owner

In our case that's going to be Alice since she's Root for our Dev
and Local chains.

Co-authored-by: Tomasz Drwięga <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
4 participants