storage: Refactor disaggregated read flow #7530

breezewish · 2023-05-23T07:18:15Z

What problem does this PR solve?

Issue Number: ref #6827, close #7576

Problem Summary:

What is changed and how it works?

Merge first: Add EstablishDisaggTask tikv/client-c#142
Merge first: update client-c to support choosing non pending peer when dispatch task #7574

Re-worked the workflow of disaggregated read. The new workflow is mainly based on multiple MPMCQueues flowing to each other (a.k.a. ThreadedWorker, see below), instead of a shared task map + conditional_variables.

ThreadedWorker: concurrently takes tasks from a SrcQueue, works with it, and pushes the result to the ResultQueue. The src task and the result does not need to be the same type.

ThreadedWorker can be chainned:

ThreadedWorker populates FINISH state to the result channel only when the source channel is finished and all tasks are processed. ThreadedWorker itself never produce FINISH state to the result channel:

ThreadedWorker populates CANCEL state (when there are errors) to both result channel and source channel, so that chainned ThreadedWorkers can be all cancelled:

New Read flow: After EstablishDisaggTask, all segment tasks need to work through these steps in order to be "Ready for read":

Try to fetch related pages, keep a CacheGuard.
Build a InputStream (in this step, S3 files will be downloaded).

Currently there are only two steps. It is easy if we want to add another step, for example, prepare delta index.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)

After testing, I discover that this PR still cannot resolves the issue of disaggregated read may freeze when cache capacity is low (e.g. 32MB). The possible reason is: when MPP tasks are distributed to multiple TiFlash nodes, each MPP task may stuck due to waiting available space. These stuck tasks cannot proceed, because available space is already occupied by ReadSegmentTasks in the queue. Additionally, these ReadSegmentsTasks cannot be scheduled, because active MPP readings are not yet finished.

Considering that this dead-lock seems to be hard to resolve, We may need some re-work (simplification) for the local page cache. For example, throwing errors seems to be better than simply deadlocking...

No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

None

Signed-off-by: Wish <[email protected]>

ti-chi-bot · 2023-05-23T07:18:16Z

[REVIEW NOTIFICATION]

This pull request has been approved by:

JaySon-Huang
Lloyd-Pottiger

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

Signed-off-by: Wish <[email protected]>

JaySon-Huang · 2023-05-24T09:53:39Z

dbms/src/Server/StorageConfigParser.cpp

@@ -643,7 +643,7 @@ void StorageRemoteCacheConfig::parse(const String & content, const LoggerPtr & l
    readConfig(table, "dtfile_level", dtfile_level);
    RUNTIME_CHECK(dtfile_level <= 100);
    readConfig(table, "delta_rate", delta_rate);
-    RUNTIME_CHECK(std::isgreaterequal(delta_rate, 0.1) && std::islessequal(delta_rate, 1.0), delta_rate);
+    RUNTIME_CHECK(std::isgreaterequal(delta_rate, 0.0) && std::islessequal(delta_rate, 1.0), delta_rate);


can we set delta_rate to be "0.0" to totally disable storing pages in the delta layer?

Yes, this change is to allow setting to 0.0 in order to bypass the Delta Cache checking.

However, currently the Delta Cache works as "no limit" when limit is set to 0. It is another story though.

Why we need "no limit" to bypass the Delta Cache checking?

According to metrics, occupy space 80% usually takes ~500ms for now. Allowing unlimited page cache help us benchmark the performance without occupy space.

As 0.0 is tends to allow for performance testing. But it will lead to disk full is prod-env. Maybe add a warning log and some comments when it is set to 0.0.

dbms/src/Storages/DeltaMerge/Remote/RNSegmentInputStream.cpp

dbms/src/Storages/DeltaMerge/Remote/RNWorkerFetchPages.cpp

Signed-off-by: Wish <[email protected]>

Co-authored-by: JaySon <[email protected]>

Signed-off-by: Wish <[email protected]>

…nto wenxuan/parallel-is

Signed-off-by: Wish <[email protected]>

dbms/src/Storages/DeltaMerge/Remote/RNReadTask.cpp

JaySon-Huang · 2023-05-29T04:16:05Z

dbms/src/Storages/DeltaMerge/Remote/RNWorkerFetchPages.cpp

+    Stopwatch w_occupy;
+    auto occupy_result = page_cache->occupySpace(cf_tiny_oids, seg_task->meta.delta_tinycf_page_sizes);
+    // This metric is per-segment.
+    GET_METRIC(tiflash_disaggregated_breakdown_duration_seconds, type_cache_occupy).Observe(w_occupy.elapsedSeconds());


A part of tiflash_disaggregated_breakdown_duration_seconds is per-segment while a part of it is per-request.
Splitting them into two metrics and Grafana panels will be clearer than mixing them into one.

I will try to refine it in later PR

JaySon-Huang · 2023-05-29T04:20:00Z

dbms/src/Server/StorageConfigParser.cpp

@@ -643,7 +643,7 @@ void StorageRemoteCacheConfig::parse(const String & content, const LoggerPtr & l
    readConfig(table, "dtfile_level", dtfile_level);
    RUNTIME_CHECK(dtfile_level <= 100);
    readConfig(table, "delta_rate", delta_rate);
-    RUNTIME_CHECK(std::isgreaterequal(delta_rate, 0.1) && std::islessequal(delta_rate, 1.0), delta_rate);
+    RUNTIME_CHECK(std::isgreaterequal(delta_rate, 0.0) && std::islessequal(delta_rate, 1.0), delta_rate);


As 0.0 is tends to allow for performance testing. But it will lead to disk full is prod-env. Maybe add a warning log and some comments when it is set to 0.0.

dbms/src/Common/tests/gtest_threaded_worker.cpp

Lloyd-Pottiger · 2023-05-29T04:33:25Z

dbms/src/Storages/DeltaMerge/Remote/RNSegmentInputStream.cpp

+    LOG_INFO(
+        log,
+        "Finished reading remote segments, rows={} read_segments={} total_wait_ready_task={:.3f}s total_read={:.3f}s",
+        action.totalRows(),
+        processed_seg_tasks,
+        duration_wait_ready_task_sec,
+        duration_read_sec);


seems log too frequently?

dbms/src/Storages/DeltaMerge/Remote/RNWorkers_fwd.h

dbms/src/Storages/DeltaMerge/Remote/RNWorkers.h

Co-authored-by: Lloyd-Pottiger <[email protected]>

CalvinNeo · 2023-06-01T03:07:58Z

/run-integration-test

JaySon-Huang

LGTM

JaySon-Huang · 2023-06-01T06:28:42Z

/rebuild

JaySon-Huang · 2023-06-01T06:28:50Z

/merge

ti-chi-bot · 2023-06-01T06:28:52Z

@JaySon-Huang: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot · 2023-06-01T06:28:54Z

This pull request has been accepted and is ready to merge.

Commit hash: 1ff1d13

JaySon-Huang · 2023-06-01T06:37:07Z

I've manual tested that this PR can resolve #7576

If EstablishDisaggTask request fail on one store, (mocked by DBGInvoke enable_fail_point(force_remote_read_for_batch_cop_once)), the compute node will wait for request returned from all write nodes. After that it will rebuild the task and send next EstablishDisaggTask to all write nodes
If FetchPages request fail(mocked by DBGInvoke enable_fail_point(exception_when_fetch_disagg_pages)), it can return the error message to tidb instead of a incorrect result

ti-chi-bot · 2023-06-01T07:04:51Z

@breezewish: Your PR was out of date, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

trigger some heavy tests which will not run always when PR updated.

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot · 2023-06-01T07:47:20Z

In response to a cherrypick label: new pull request created to branch release-7.1: #7582.

Signed-off-by: ti-chi-bot <[email protected]>

breezewish added 6 commits May 23, 2023 10:20

wip

5ec3412

Merge remote-tracking branch 'origin/master' into wenxuan/parallel-is

ef5077e

wip2

acf80ba

Fix

0bb57ab

Fix again

fef2ae7

Signed-off-by: Wish <[email protected]>

Remove some logs

bf09bd4

Signed-off-by: Wish <[email protected]>

ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 23, 2023

breezewish added 5 commits May 23, 2023 15:23

Remove unused comments

a2d81ea

Signed-off-by: Wish <[email protected]>

Fix worker finish early

d0fb6b7

Signed-off-by: Wish <[email protected]>

Don't count DMFileReader ctor, because it is now called in Workers

97e0bd6

Signed-off-by: Wish <[email protected]>

More timing info

0898aae

Signed-off-by: Wish <[email protected]>

Some debug info

b82df30

Signed-off-by: Wish <[email protected]>

JaySon-Huang reviewed May 24, 2023

View reviewed changes

dbms/src/Storages/DeltaMerge/Remote/RNSegmentInputStream.cpp Outdated Show resolved Hide resolved

dbms/src/Storages/DeltaMerge/Remote/RNWorkerFetchPages.cpp Outdated Show resolved Hide resolved

breezewish and others added 5 commits May 25, 2023 10:21

Improve cancel

42ae3d3

Signed-off-by: Wish <[email protected]>

Populate last_exception

b6b7e99

Signed-off-by: Wish <[email protected]>

Update dbms/src/Storages/DeltaMerge/Remote/RNSegmentInputStream.cpp

2f0c324

Co-authored-by: JaySon <[email protected]>

Add some unit tests

fd97e25

Signed-off-by: Wish <[email protected]>

Merge branch 'wenxuan/parallel-is' of github.com:breezewish/tiflash i…

da19b92

…nto wenxuan/parallel-is

breezewish mentioned this pull request May 25, 2023

storage: Support ReadThread for Disaggregated mode #7457

Closed

16 tasks

Use latest client-c

c604fd8

Signed-off-by: Wish <[email protected]>

ti-chi-bot bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 25, 2023

breezewish added 3 commits May 25, 2023 16:35

Fix comment

c432d42

Merge remote-tracking branch 'origin/master' into wenxuan/parallel-is

818d630

Signed-off-by: Wish <[email protected]>

Fix merge issue

0a9e5fa

Signed-off-by: Wish <[email protected]>

ti-chi-bot bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 25, 2023

breezewish added 2 commits May 25, 2023 16:41

Fix more

387612e

Signed-off-by: Wish <[email protected]>

Add / update a couple of metrics

ad3faef

Signed-off-by: Wish <[email protected]>

JaySon-Huang reviewed May 29, 2023

View reviewed changes

dbms/src/Storages/DeltaMerge/Remote/RNReadTask.cpp Show resolved Hide resolved

JaySon-Huang reviewed May 29, 2023

View reviewed changes

Lloyd-Pottiger approved these changes May 29, 2023

View reviewed changes

ti-chi-bot bot added the status/LGT1 Indicates that a PR has LGTM 1. label May 29, 2023

Apply suggestions from code review

3397bc0

Co-authored-by: Lloyd-Pottiger <[email protected]>

ti-chi-bot bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 31, 2023

JaySon-Huang added 4 commits June 1, 2023 14:12

Add failpoints

9213983

Fix a restart bug

65542bd

Add warnings when delta_rate==0.0

4813649

Merge remote-tracking branch 'upstream/master' into wenxuan/parallel-is

1ff1d13

ti-chi-bot bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 1, 2023

JaySon-Huang approved these changes Jun 1, 2023

View reviewed changes

ti-chi-bot bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Jun 1, 2023

ti-chi-bot bot added the status/can-merge Indicates a PR has been approved by a committer. label Jun 1, 2023

ti-chi-bot bot added the needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. label Jun 1, 2023

Merge branch 'master' into wenxuan/parallel-is

bbe92d8

ti-chi-bot bot merged commit edf82d2 into pingcap:master Jun 1, 2023

ti-chi-bot mentioned this pull request Jun 1, 2023

storage: Refactor disaggregated read flow (#7530) #7582

Closed

14 tasks

ti-chi-bot pushed a commit to ti-chi-bot/tiflash that referenced this pull request Jun 1, 2023

This is an automated cherry-pick of pingcap#7530

4b48b97

Signed-off-by: ti-chi-bot <[email protected]>

breezewish deleted the wenxuan/parallel-is branch June 1, 2023 09:26

JaySon-Huang mentioned this pull request Apr 19, 2024

*: Refine the code in settings #8970

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: Refactor disaggregated read flow #7530

storage: Refactor disaggregated read flow #7530

breezewish commented May 23, 2023 •

edited by JaySon-Huang

Loading

ti-chi-bot bot commented May 23, 2023 •

edited

Loading

JaySon-Huang May 24, 2023 •

edited

Loading

breezewish May 25, 2023

JaySon-Huang May 26, 2023

breezewish May 26, 2023

JaySon-Huang May 29, 2023

JaySon-Huang May 29, 2023

JaySon-Huang Jun 1, 2023

JaySon-Huang May 29, 2023

Lloyd-Pottiger May 29, 2023

CalvinNeo commented Jun 1, 2023

JaySon-Huang left a comment

JaySon-Huang commented Jun 1, 2023

JaySon-Huang commented Jun 1, 2023

ti-chi-bot bot commented Jun 1, 2023

ti-chi-bot bot commented Jun 1, 2023

JaySon-Huang commented Jun 1, 2023

ti-chi-bot bot commented Jun 1, 2023

ti-chi-bot commented Jun 1, 2023

storage: Refactor disaggregated read flow #7530

storage: Refactor disaggregated read flow #7530

Conversation

breezewish commented May 23, 2023 • edited by JaySon-Huang Loading

What problem does this PR solve?

What is changed and how it works?

Check List

Release note

ti-chi-bot bot commented May 23, 2023 • edited Loading

JaySon-Huang May 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CalvinNeo commented Jun 1, 2023

JaySon-Huang left a comment

Choose a reason for hiding this comment

JaySon-Huang commented Jun 1, 2023

JaySon-Huang commented Jun 1, 2023

ti-chi-bot bot commented Jun 1, 2023

ti-chi-bot bot commented Jun 1, 2023

JaySon-Huang commented Jun 1, 2023

ti-chi-bot bot commented Jun 1, 2023

ti-chi-bot commented Jun 1, 2023

breezewish commented May 23, 2023 •

edited by JaySon-Huang

Loading

ti-chi-bot bot commented May 23, 2023 •

edited

Loading

JaySon-Huang May 24, 2023 •

edited

Loading