Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1567] Support throw FetchFailedException when Data corruption detected #2691

Closed
wants to merge 2 commits into from

Conversation

cxzl25
Copy link
Contributor

@cxzl25 cxzl25 commented Aug 16, 2024

What changes were proposed in this pull request?

Why are the changes needed?

#2655 (review)

Does this PR introduce any user-facing change?

No

How was this patch tested?

GA

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Btw, support for checksum/validation of data would a good feature to add IMO ... there were corner cases where this helped catch issues in spark (instead of relying on compression/deserialization failing ... which need not always happen).

Copy link
Contributor

@RexXiong RexXiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cfmcgrady
Copy link
Contributor

I have a question: should we retry fetching another replication before throwing a FetchFailedException when the conf celeborn.client.push.replicate.enabled is set to true?

@cxzl25
Copy link
Contributor Author

cxzl25 commented Aug 18, 2024

I have a question: should we retry fetching another replication before throwing a FetchFailedException when the conf celeborn.client.push.replicate.enabled is set to true?

This is not necessarily safe, because the Task may have read part of the data, so it is safer to retry the Task. This is how Spark handles it.

@cxzl25
Copy link
Contributor Author

cxzl25 commented Aug 18, 2024

support for checksum/validation of data would a good feature

It looks like we've already done this.

if ((int) checksum.getValue() != check) {
logger.error("Checksum not equal! expected: {}, actual: {}.", check, checksum.getValue());
return -1;
}

if ((int) checksum.getValue() != check) {
logger.error("Checksum not equal! expected: {}, actual: {}.", check, checksum.getValue());
return -1;
}

Copy link
Contributor

@FMX FMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Contributor

@cfmcgrady cfmcgrady left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cfmcgrady
Copy link
Contributor

Thank you, merging to main(v0.6.0)/branch-0.5(v0.5.2)/branch-0.4(v0.4.3).

@cxzl25 cxzl25 closed this in b8f275d Aug 20, 2024
cxzl25 added a commit that referenced this pull request Aug 20, 2024
…on detected

### What changes were proposed in this pull request?

### Why are the changes needed?
#2655 (review)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2691 from cxzl25/CELEBORN-1567.

Authored-by: sychen <[email protected]>
Signed-off-by: Shaoyun Chen <[email protected]>
(cherry picked from commit b8f275d)
Signed-off-by: Shaoyun Chen <[email protected]>
cxzl25 added a commit that referenced this pull request Aug 20, 2024
…on detected

### What changes were proposed in this pull request?

### Why are the changes needed?
#2655 (review)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2691 from cxzl25/CELEBORN-1567.

Authored-by: sychen <[email protected]>
Signed-off-by: Shaoyun Chen <[email protected]>

(cherry picked from commit b8f275d)
wankunde pushed a commit to wankunde/celeborn that referenced this pull request Oct 11, 2024
…on detected

### What changes were proposed in this pull request?

### Why are the changes needed?
apache#2655 (review)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes apache#2691 from cxzl25/CELEBORN-1567.

Authored-by: sychen <[email protected]>
Signed-off-by: Shaoyun Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants