Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yield CPU for concurrent flush and concurrent mergeDelta (#5410) #5423

Conversation

ti-chi-bot
Copy link
Member

This is an automated cherry-pick of #5410

What problem does this PR solve?

Issue Number: close #5409

What is changed and how it works?

Add sleep for flushCache and mergeDeltaBySegment:

  • When flushCache is retrying, wait backoff will be 5ms ~ 100ms (considering that existing flushCache usually takes short time to finish).

  • When mergeDeltaBySegment is retrying, wait backoff will be 50ms ~ 1s (considering that split-prepare could take several seconds to finish).

Check List

Tests

  • Unit test
  • Integration test
  • Manual test

To test with the fix, I introduced a splitEachSegment debug function locally to manually trigger a split:

void DeltaMergeStore::splitEachSegment(const Context & db_context)
{
    auto dm_context = newDMContext(db_context, db_context.getSettingsRef(), /*tracing_id*/ "split_each_segment");
    std::vector<SegmentPtr> current_segments;
    {
        std::shared_lock lock(read_write_mutex);
        for (const auto & segment_it : segments)
        {
            current_segments.push_back(segment_it.second);
        }
    }
    for (const auto & segment : current_segments)
    {
        segmentSplit(*dm_context, segment, true);
    }
}

The test case is to trigger the split for a 1GB segment, and then perform a mergeDelta at the same time.

Before the fix (using release v6.1):

when there are both split (takes 10s) and mergeDelta (takes 20s in total, blocked by split for 10s), there are 211K retries in 10s when the mergeDelta is blocked:

❯ cat ~/.tiup/data/my_test_r61_no_fix/tiflash-0/log/tiflash.log | grep "Give up merge delta, segment" | head
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]

❯ cat ~/.tiup/data/my_test_r61_no_fix/tiflash-0/log/tiflash.log | grep "Give up merge delta, segment" | tail
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]

❯ cat ~/.tiup/data/my_test_r61_no_fix/tiflash-0/log/tiflash.log | grep "Give up merge delta, segment" | wc -l
  211848

The CPU usage is around 200% during the split+mergeDelta:

5.7   tiflash
128.0 tiflash  <--- split begin, mergeDelta begin
196.4 tiflash
185.1 tiflash
152.7 tiflash
184.9 tiflash
202.0 tiflash
191.3 tiflash
197.7 tiflash
192.1 tiflash
202.6 tiflash  <--- split is finished
115.7 tiflash
102.6 tiflash
103.0 tiflash
101.3 tiflash
101.0 tiflash
103.9 tiflash
100.8 tiflash
102.0 tiflash
102.5 tiflash
68.2  tiflash
4.7   tiflash
0.0   tiflash

After the fix:

there are only 14 retry attempts with exp backoff:

❯ cat ~/.tiup/data/my_test_fix/tiflash-0/log/tiflash.log | grep "Give up merge delta, segment"
[2022/07/19 20:58:44.240 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:44.295 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:44.399 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:44.604 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:45.007 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:45.812 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:46.815 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:47.821 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:48.825 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:49.826 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:50.831 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:51.832 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:52.838 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:53.844 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]

The CPU usage keeps around 100% (first 11s for split, next 10s for mergeDelta):

4.3  tiflash
21.4 tiflash  <--- split begin, mergeDelta begin
100.2 tiflash
103.1 tiflash
95.2  tiflash
89.1  tiflash
99.4  tiflash
100.8 tiflash
101.7 tiflash
103.2 tiflash
102.4 tiflash
97.8  tiflash  <--- split is finished
71.7  tiflash
102.8 tiflash
99.1  tiflash
102.6 tiflash
102.5 tiflash
95.5  tiflash
100.7 tiflash
105.1 tiflash
101.5 tiflash
103.4 tiflash
49.9  tiflash
4.2   tiflash

Note: As there is maximum 1s backoff, the CPU usage dropped for a short while when split was finished and the mergeDelta was not yet started.

  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Reduce unnecessary CPU usage in some edge cases

@ti-chi-bot
Copy link
Member Author

[REVIEW NOTIFICATION]

This pull request has not been approved.

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added do-not-merge/cherry-pick-not-approved release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. type/cherry-pick-for-release-5.4 This PR is cherry-picked to release-5.4 from a source PR. labels Jul 20, 2022
@VelocityLight VelocityLight added cherry-pick-approved Cherry pick PR approved by release team. and removed do-not-merge/cherry-pick-not-approved labels Jul 20, 2022
@breezewish
Copy link
Member

This PR is working in progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick-approved Cherry pick PR approved by release team. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. type/cherry-pick-for-release-5.4 This PR is cherry-picked to release-5.4 from a source PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants