Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Refactor disaggregated read flow #7530

Merged
merged 35 commits into from
Jun 1, 2023

Conversation

breezewish
Copy link
Member

@breezewish breezewish commented May 23, 2023

What problem does this PR solve?

Issue Number: ref #6827, close #7576

Problem Summary:

What is changed and how it works?

Re-worked the workflow of disaggregated read. The new workflow is mainly based on multiple MPMCQueues flowing to each other (a.k.a. ThreadedWorker, see below), instead of a shared task map + conditional_variables.

ThreadedWorker: concurrently takes tasks from a SrcQueue, works with it, and pushes the result to the ResultQueue. The src task and the result does not need to be the same type.

image

ThreadedWorker can be chainned:

image

ThreadedWorker populates FINISH state to the result channel only when the source channel is finished and all tasks are processed. ThreadedWorker itself never produce FINISH state to the result channel:

image

ThreadedWorker populates CANCEL state (when there are errors) to both result channel and source channel, so that chainned ThreadedWorkers can be all cancelled:

image

New Read flow: After EstablishDisaggTask, all segment tasks need to work through these steps in order to be "Ready for read":

  1. Try to fetch related pages, keep a CacheGuard.
  2. Build a InputStream (in this step, S3 files will be downloaded).
image

Currently there are only two steps. It is easy if we want to add another step, for example, prepare delta index.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)

After testing, I discover that this PR still cannot resolves the issue of disaggregated read may freeze when cache capacity is low (e.g. 32MB). The possible reason is: when MPP tasks are distributed to multiple TiFlash nodes, each MPP task may stuck due to waiting available space. These stuck tasks cannot proceed, because available space is already occupied by ReadSegmentTasks in the queue. Additionally, these ReadSegmentsTasks cannot be scheduled, because active MPP readings are not yet finished.

Considering that this dead-lock seems to be hard to resolve, We may need some re-work (simplification) for the local page cache. For example, throwing errors seems to be better than simply deadlocking...

  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented May 23, 2023

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • JaySon-Huang
  • Lloyd-Pottiger

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 23, 2023
@@ -643,7 +643,7 @@ void StorageRemoteCacheConfig::parse(const String & content, const LoggerPtr & l
readConfig(table, "dtfile_level", dtfile_level);
RUNTIME_CHECK(dtfile_level <= 100);
readConfig(table, "delta_rate", delta_rate);
RUNTIME_CHECK(std::isgreaterequal(delta_rate, 0.1) && std::islessequal(delta_rate, 1.0), delta_rate);
RUNTIME_CHECK(std::isgreaterequal(delta_rate, 0.0) && std::islessequal(delta_rate, 1.0), delta_rate);
Copy link
Contributor

@JaySon-Huang JaySon-Huang May 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we set delta_rate to be "0.0" to totally disable storing pages in the delta layer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this change is to allow setting to 0.0 in order to bypass the Delta Cache checking.

However, currently the Delta Cache works as "no limit" when limit is set to 0. It is another story though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need "no limit" to bypass the Delta Cache checking?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to metrics, occupy space 80% usually takes ~500ms for now. Allowing unlimited page cache help us benchmark the performance without occupy space.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As 0.0 is tends to allow for performance testing. But it will lead to disk full is prod-env. Maybe add a warning log and some comments when it is set to 0.0.

@ti-chi-bot ti-chi-bot bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 25, 2023
@ti-chi-bot ti-chi-bot bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 25, 2023
Stopwatch w_occupy;
auto occupy_result = page_cache->occupySpace(cf_tiny_oids, seg_task->meta.delta_tinycf_page_sizes);
// This metric is per-segment.
GET_METRIC(tiflash_disaggregated_breakdown_duration_seconds, type_cache_occupy).Observe(w_occupy.elapsedSeconds());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A part of tiflash_disaggregated_breakdown_duration_seconds is per-segment while a part of it is per-request.
Splitting them into two metrics and Grafana panels will be clearer than mixing them into one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to refine it in later PR

@@ -643,7 +643,7 @@ void StorageRemoteCacheConfig::parse(const String & content, const LoggerPtr & l
readConfig(table, "dtfile_level", dtfile_level);
RUNTIME_CHECK(dtfile_level <= 100);
readConfig(table, "delta_rate", delta_rate);
RUNTIME_CHECK(std::isgreaterequal(delta_rate, 0.1) && std::islessequal(delta_rate, 1.0), delta_rate);
RUNTIME_CHECK(std::isgreaterequal(delta_rate, 0.0) && std::islessequal(delta_rate, 1.0), delta_rate);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As 0.0 is tends to allow for performance testing. But it will lead to disk full is prod-env. Maybe add a warning log and some comments when it is set to 0.0.

dbms/src/Common/tests/gtest_threaded_worker.cpp Outdated Show resolved Hide resolved
Comment on lines +28 to +34
LOG_INFO(
log,
"Finished reading remote segments, rows={} read_segments={} total_wait_ready_task={:.3f}s total_read={:.3f}s",
action.totalRows(),
processed_seg_tasks,
duration_wait_ready_task_sec,
duration_read_sec);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems log too frequently?

dbms/src/Storages/DeltaMerge/Remote/RNWorkers_fwd.h Outdated Show resolved Hide resolved
dbms/src/Storages/DeltaMerge/Remote/RNWorkers.h Outdated Show resolved Hide resolved
@ti-chi-bot ti-chi-bot bot added the status/LGT1 Indicates that a PR has LGTM 1. label May 29, 2023
@ti-chi-bot ti-chi-bot bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 31, 2023
@CalvinNeo
Copy link
Member

/run-integration-test

@ti-chi-bot ti-chi-bot bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 1, 2023
Copy link
Contributor

@JaySon-Huang JaySon-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Jun 1, 2023
@JaySon-Huang
Copy link
Contributor

/rebuild

@JaySon-Huang
Copy link
Contributor

/merge

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jun 1, 2023

@JaySon-Huang: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jun 1, 2023

This pull request has been accepted and is ready to merge.

Commit hash: 1ff1d13

@ti-chi-bot ti-chi-bot bot added the status/can-merge Indicates a PR has been approved by a committer. label Jun 1, 2023
@JaySon-Huang
Copy link
Contributor

I've manual tested that this PR can resolve #7576

  • If EstablishDisaggTask request fail on one store, (mocked by DBGInvoke enable_fail_point(force_remote_read_for_batch_cop_once)), the compute node will wait for request returned from all write nodes. After that it will rebuild the task and send next EstablishDisaggTask to all write nodes
  • If FetchPages request fail(mocked by DBGInvoke enable_fail_point(exception_when_fetch_disagg_pages)), it can return the error message to tidb instead of a incorrect result

@ti-chi-bot ti-chi-bot bot added the needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. label Jun 1, 2023
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jun 1, 2023

@breezewish: Your PR was out of date, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

trigger some heavy tests which will not run always when PR updated.

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot ti-chi-bot bot merged commit edf82d2 into pingcap:master Jun 1, 2023
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.1: #7582.

ti-chi-bot pushed a commit to ti-chi-bot/tiflash that referenced this pull request Jun 1, 2023
@breezewish breezewish deleted the wenxuan/parallel-is branch June 1, 2023 09:26
@JaySon-Huang JaySon-Huang mentioned this pull request Apr 19, 2024
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TiFlash with s3 request result is not correct under disagg arch
5 participants