Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor StorageServer stop #4034

Merged
merged 3 commits into from
Mar 30, 2022
Merged

Conversation

critical27
Copy link
Contributor

@critical27 critical27 commented Mar 15, 2022

What type of PR is this?

  • bug
  • feature
  • enhancement

What problem(s) does this PR solve?

Issue(s) number:

Description:

There are many cases that storage can't exit gracefully, we may stuck forever or just crash... There are several reason:

  1. There are four thrift server (raft/storage/admin/internal), all thread pool shared (both IOThreadPool and ThreadManager), so if the thread pool is not correctly joined, crash...
  2. There are other important components for example kvstore, RaftPart, TaskManager, MetaClient, the thread pool is the same one in rpc server. And the relationship between them could be quite complicated.

How do you solve it?

  1. Start thrift server with setup/cleanUp, instead of calling serve/stop. In short words, when we call stop, serve will be out of scope, and cleanUp will be triggered. If we concurrently call stop of several ThriftServer, we may crash because the thread pool are shared.
  2. With setup/cleanUp, we don't need the extra thread to wait forever, all of them are deleted in this PR.
  3. The stop order of different components will be exactly reversed when we start up.

As for point 1, the explanation is quite simple above, there are some points:

  1. When ThriftServer tear-down, three import function need to be called stopAcceptingAndJoinOutstandingRequests, stopCPUWorkers, stopWorkers
  2. Even worse, there are two parameters (stopWorkersOnStopListening_ and joinRequestsWhenServerStops_) which will impact when the three function above be called (they will be called either in cleanUp or ~ThriftServer or both...).
  3. Since all four thrift server share the same thread pool, if we don't exit correctly, crash...

Special notes for your reviewer, ex. impact of this fix, design document, etc:

Checklist:

Tests:

  • Unit test(positive and negative cases)
  • Function test
  • Performance test
  • N/A

Affects:

  • Documentation affected (Please add the label if documentation needs to be modified.)
  • Incompatibility (If it breaks the compatibility, please describe it and add the label.)
  • If it's needed to cherry-pick (If cherry-pick to some branches is required, please label the destination version(s).)
  • Performance impacted: Consumes more CPU/Memory

Release notes:

Please confirm whether to be reflected in release notes and how to describe:
Not only introduced in this version, not related.

@critical27 critical27 added the ready-for-testing PR: ready for the CI test label Mar 15, 2022
@critical27 critical27 changed the title Stop Refactor StorageServer stop Mar 16, 2022
@Sophie-Xie Sophie-Xie added this to the v3.1.0 milestone Mar 21, 2022
liwenhui-soul
liwenhui-soul previously approved these changes Mar 24, 2022
cangfengzhs
cangfengzhs previously approved these changes Mar 24, 2022
Copy link
Contributor

@cangfengzhs cangfengzhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

liwenhui-soul
liwenhui-soul previously approved these changes Mar 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-testing PR: ready for the CI test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants