-
Notifications
You must be signed in to change notification settings - Fork 411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: Refactor disaggregated read flow #7530
Conversation
Signed-off-by: Wish <[email protected]>
Signed-off-by: Wish <[email protected]>
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
Signed-off-by: Wish <[email protected]>
Signed-off-by: Wish <[email protected]>
Signed-off-by: Wish <[email protected]>
Signed-off-by: Wish <[email protected]>
Signed-off-by: Wish <[email protected]>
@@ -643,7 +643,7 @@ void StorageRemoteCacheConfig::parse(const String & content, const LoggerPtr & l | |||
readConfig(table, "dtfile_level", dtfile_level); | |||
RUNTIME_CHECK(dtfile_level <= 100); | |||
readConfig(table, "delta_rate", delta_rate); | |||
RUNTIME_CHECK(std::isgreaterequal(delta_rate, 0.1) && std::islessequal(delta_rate, 1.0), delta_rate); | |||
RUNTIME_CHECK(std::isgreaterequal(delta_rate, 0.0) && std::islessequal(delta_rate, 1.0), delta_rate); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we set delta_rate
to be "0.0" to totally disable storing pages in the delta layer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this change is to allow setting to 0.0 in order to bypass the Delta Cache checking.
However, currently the Delta Cache works as "no limit" when limit is set to 0. It is another story though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need "no limit" to bypass the Delta Cache checking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to metrics, occupy space 80% usually takes ~500ms for now. Allowing unlimited page cache help us benchmark the performance without occupy space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As 0.0
is tends to allow for performance testing. But it will lead to disk full is prod-env. Maybe add a warning log and some comments when it is set to 0.0
.
Signed-off-by: Wish <[email protected]>
Signed-off-by: Wish <[email protected]>
Co-authored-by: JaySon <[email protected]>
Signed-off-by: Wish <[email protected]>
…nto wenxuan/parallel-is
Signed-off-by: Wish <[email protected]>
Signed-off-by: Wish <[email protected]>
Signed-off-by: Wish <[email protected]>
Signed-off-by: Wish <[email protected]>
Signed-off-by: Wish <[email protected]>
Stopwatch w_occupy; | ||
auto occupy_result = page_cache->occupySpace(cf_tiny_oids, seg_task->meta.delta_tinycf_page_sizes); | ||
// This metric is per-segment. | ||
GET_METRIC(tiflash_disaggregated_breakdown_duration_seconds, type_cache_occupy).Observe(w_occupy.elapsedSeconds()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A part of tiflash_disaggregated_breakdown_duration_seconds
is per-segment while a part of it is per-request.
Splitting them into two metrics and Grafana panels will be clearer than mixing them into one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to refine it in later PR
@@ -643,7 +643,7 @@ void StorageRemoteCacheConfig::parse(const String & content, const LoggerPtr & l | |||
readConfig(table, "dtfile_level", dtfile_level); | |||
RUNTIME_CHECK(dtfile_level <= 100); | |||
readConfig(table, "delta_rate", delta_rate); | |||
RUNTIME_CHECK(std::isgreaterequal(delta_rate, 0.1) && std::islessequal(delta_rate, 1.0), delta_rate); | |||
RUNTIME_CHECK(std::isgreaterequal(delta_rate, 0.0) && std::islessequal(delta_rate, 1.0), delta_rate); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As 0.0
is tends to allow for performance testing. But it will lead to disk full is prod-env. Maybe add a warning log and some comments when it is set to 0.0
.
LOG_INFO( | ||
log, | ||
"Finished reading remote segments, rows={} read_segments={} total_wait_ready_task={:.3f}s total_read={:.3f}s", | ||
action.totalRows(), | ||
processed_seg_tasks, | ||
duration_wait_ready_task_sec, | ||
duration_read_sec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems log too frequently?
Co-authored-by: Lloyd-Pottiger <[email protected]>
/run-integration-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/rebuild |
/merge |
@JaySon-Huang: It seems you want to merge this PR, I will help you trigger all the tests: /run-all-tests You only need to trigger If you have any questions about the PR merge process, please refer to pr process. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
This pull request has been accepted and is ready to merge. Commit hash: 1ff1d13
|
I've manual tested that this PR can resolve #7576
|
@breezewish: Your PR was out of date, I have automatically updated it for you. At the same time I will also trigger all tests for you: /run-all-tests
If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
In response to a cherrypick label: new pull request created to branch |
Signed-off-by: ti-chi-bot <[email protected]>
What problem does this PR solve?
Issue Number: ref #6827, close #7576
Problem Summary:
What is changed and how it works?
Re-worked the workflow of disaggregated read. The new workflow is mainly based on multiple MPMCQueues flowing to each other (a.k.a. ThreadedWorker, see below), instead of a shared task map + conditional_variables.
ThreadedWorker: concurrently takes tasks from a SrcQueue, works with it, and pushes the result to the ResultQueue. The src task and the result does not need to be the same type.
ThreadedWorker can be chainned:
ThreadedWorker populates FINISH state to the result channel only when the source channel is finished and all tasks are processed. ThreadedWorker itself never produce FINISH state to the result channel:
ThreadedWorker populates CANCEL state (when there are errors) to both result channel and source channel, so that chainned ThreadedWorkers can be all cancelled:
New Read flow: After EstablishDisaggTask, all segment tasks need to work through these steps in order to be "Ready for read":
Currently there are only two steps. It is easy if we want to add another step, for example, prepare delta index.
Check List
Tests
After testing, I discover that this PR still cannot resolves the issue of disaggregated read may freeze when cache capacity is low (e.g. 32MB). The possible reason is: when MPP tasks are distributed to multiple TiFlash nodes, each MPP task may stuck due to waiting available space. These stuck tasks cannot proceed, because available space is already occupied by ReadSegmentTasks in the queue. Additionally, these ReadSegmentsTasks cannot be scheduled, because active MPP readings are not yet finished.
Considering that this dead-lock seems to be hard to resolve, We may need some re-work (simplification) for the local page cache. For example, throwing errors seems to be better than simply deadlocking...
Side effects
Documentation
Release note