Repaired shreds may not trigger erasure recovery #7450

pgarg66 · 2019-12-12T18:47:20Z

Problem

The repair protocol serves only data shreds, i.e. the nodes only request repair of data shreds, and coding shreds are never sent as a response to repair requests. The data shreds do not carry FEC block identity in their headers. So, the receiver of repaired data shred cannot detect if enough shreds (data + code) have been received for the FEC block and a recovery can be triggered.

For example,

FEC block is 32 data + 32 code shreds
Node received N data, and M code shreds through turbine, such that N + M < 32. Erasure recovery cannot be triggered at this point.
The node request repair of a subset of 32 - N missing data shreds.
The node receives some more data shreds (X) due to repairs, such that N + M + X >= 32. This should be enough to recover the remaining (32 - X - N) data shreds.

However, the data shreds, in general, cannot find their FEC block. So, blocktree does not load their Erasure meta, and recovery is not triggered.

The issue sounds more critical than reality. Generally, repair of all 32 -N shreds will come through around same time, and blocktree would deem the FEC block to be complete (and won't need a recovery). Also, if all the missing data shreds are already being repaired, at times it may be better to process the received (repaired) shred than read other shreds for the FEC blocks from RocksDB and trigger a recovery.

Proposed Solution

Repair code shreds
Let repair continue to handle this case
Include FEC block identity in data shreds, and load erasure meta for data shreds as well

1st approach would not always work. We no longer store unnecessary code shreds in RocksDB. 2nd approach works, but it leaves some optimization on the table. 3rd approach is ideal, but requires changes to shred structure, and end to end testing. Streaming data shreds, and generating codes at a latter time will also bring complications (albeit, solvable) to the 3rd approach.

Tag: @aeyakovenko @carllin @sagar-solana

sagar-solana · 2019-12-12T19:43:33Z

I vote for 2. Repair is the fallback in our user-space windowing so makes sense to let it handle this.

This was referenced Dec 12, 2019

Allow coding shred index to be different than data shred index #7438

Merged

Perform erasure recovery when repaired data shreds are received #7463

Merged

solana-grimes closed this as completed in #7463 Dec 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repaired shreds may not trigger erasure recovery #7450

Repaired shreds may not trigger erasure recovery #7450

pgarg66 commented Dec 12, 2019

sagar-solana commented Dec 12, 2019 •

edited

Loading

Repaired shreds may not trigger erasure recovery #7450

Repaired shreds may not trigger erasure recovery #7450

Comments

pgarg66 commented Dec 12, 2019

Problem

Proposed Solution

sagar-solana commented Dec 12, 2019 • edited Loading

sagar-solana commented Dec 12, 2019 •

edited

Loading