Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repaired shreds may not trigger erasure recovery #7450

Closed
pgarg66 opened this issue Dec 12, 2019 · 1 comment · Fixed by #7463
Closed

Repaired shreds may not trigger erasure recovery #7450

pgarg66 opened this issue Dec 12, 2019 · 1 comment · Fixed by #7463

Comments

@pgarg66
Copy link
Contributor

pgarg66 commented Dec 12, 2019

Problem

The repair protocol serves only data shreds, i.e. the nodes only request repair of data shreds, and coding shreds are never sent as a response to repair requests. The data shreds do not carry FEC block identity in their headers. So, the receiver of repaired data shred cannot detect if enough shreds (data + code) have been received for the FEC block and a recovery can be triggered.

For example,

  1. FEC block is 32 data + 32 code shreds
  2. Node received N data, and M code shreds through turbine, such that N + M < 32. Erasure recovery cannot be triggered at this point.
  3. The node request repair of a subset of 32 - N missing data shreds.
  4. The node receives some more data shreds (X) due to repairs, such that N + M + X >= 32. This should be enough to recover the remaining (32 - X - N) data shreds.

However, the data shreds, in general, cannot find their FEC block. So, blocktree does not load their Erasure meta, and recovery is not triggered.

The issue sounds more critical than reality. Generally, repair of all 32 -N shreds will come through around same time, and blocktree would deem the FEC block to be complete (and won't need a recovery). Also, if all the missing data shreds are already being repaired, at times it may be better to process the received (repaired) shred than read other shreds for the FEC blocks from RocksDB and trigger a recovery.

Proposed Solution

  1. Repair code shreds
  2. Let repair continue to handle this case
  3. Include FEC block identity in data shreds, and load erasure meta for data shreds as well

1st approach would not always work. We no longer store unnecessary code shreds in RocksDB. 2nd approach works, but it leaves some optimization on the table. 3rd approach is ideal, but requires changes to shred structure, and end to end testing. Streaming data shreds, and generating codes at a latter time will also bring complications (albeit, solvable) to the 3rd approach.

Tag: @aeyakovenko @carllin @sagar-solana

@sagar-solana
Copy link
Contributor

sagar-solana commented Dec 12, 2019

I vote for 2. Repair is the fallback in our user-space windowing so makes sense to let it handle this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants