-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bugs in CheckpointReadWorker #2
Conversation
This PR fixes bugs in CheckpointReadWorker. 1)remove unnecessary training in the processCheckpointIteration method as the previous call of processEntityCheckpoint has already done the same thing. 2)In MultiGetResponse, not found doc is not an exception but a boolean value. Move the logic of dealing with not found doc out of the exception block and guard them by GetResponse.isExists() instead. 3)deal with the overloaded cluster issue in each GetResponse. Testing done: 1. added unit tests for these bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
} else if (ExceptionUtil.isRetryAble(failure)) { | ||
if (retryableRequests == null) { | ||
retryableRequests = new HashSet<>(); | ||
} | ||
retryableRequests.add(modelId); | ||
} else if (ExceptionUtil.isOverloaded(failure)) { | ||
LOG.error("too many get AD model checkpoint requests or shard not available"); | ||
setCoolDownStart(); | ||
} else { | ||
LOG.info("Unexpected failure", failure); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For another PR: Should this be logged as an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, fixed
This PR is a conglomerate of the following PRs. #60 #64 #65 #67 #68 #69 #70 #71 #74 #75 #76 #77 #78 #79 #82 #83 #84 #92 #94 #93 #95 kaituo#1 kaituo#2 kaituo#3 kaituo#4 kaituo#5 kaituo#6 kaituo#7 kaituo#8 kaituo#9 kaituo#10 This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included): https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167
…ject#121) This PR is a conglomerate of the following PRs. opensearch-project#60 opensearch-project#64 opensearch-project#65 opensearch-project#67 opensearch-project#68 opensearch-project#69 opensearch-project#70 opensearch-project#71 opensearch-project#74 opensearch-project#75 opensearch-project#76 opensearch-project#77 opensearch-project#78 opensearch-project#79 opensearch-project#82 opensearch-project#83 opensearch-project#84 opensearch-project#92 opensearch-project#94 opensearch-project#93 opensearch-project#95 kaituo#1 kaituo#2 kaituo#3 kaituo#4 kaituo#5 kaituo#6 kaituo#7 kaituo#8 kaituo#9 kaituo#10 This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included): https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167
…ject#121) This PR is a conglomerate of the following PRs. opensearch-project#60 opensearch-project#64 opensearch-project#65 opensearch-project#67 opensearch-project#68 opensearch-project#69 opensearch-project#70 opensearch-project#71 opensearch-project#74 opensearch-project#75 opensearch-project#76 opensearch-project#77 opensearch-project#78 opensearch-project#79 opensearch-project#82 opensearch-project#83 opensearch-project#84 opensearch-project#92 opensearch-project#94 opensearch-project#93 opensearch-project#95 kaituo#1 kaituo#2 kaituo#3 kaituo#4 kaituo#5 kaituo#6 kaituo#7 kaituo#8 kaituo#9 kaituo#10 This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included): https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167
This PR is a conglomerate of the following PRs. #60 #64 #65 #67 #68 #69 #70 #71 #74 #75 #76 #77 #78 #79 #82 #83 #84 #92 #94 #93 #95 kaituo#1 kaituo#2 kaituo#3 kaituo#4 kaituo#5 kaituo#6 kaituo#7 kaituo#8 kaituo#9 kaituo#10 This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included): https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167
Description
This PR fixes bugs in CheckpointReadWorker.
1)remove unnecessary training in the processCheckpointIteration method as the previous call of processEntityCheckpoint has already done the same thing.
2)In MultiGetResponse, not found doc is not an exception but a boolean value. Move the logic of dealing with not found doc out of the exception block and guard them by GetResponse.isExists() instead.
3)deal with the overloaded cluster issue in each GetResponse.
Testing done:
Note: this PR has some changes related to #3. Since I cannot get a correct diff, I sent separate PRs.
Issues Resolved
opensearch-project#85
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.