-
Notifications
You must be signed in to change notification settings - Fork 728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Multi-tenant and Sharded TSO #5895
Comments
ref #5895 Signed-off-by: Ryan Leung <[email protected]> Co-authored-by: Ti Chi Robot <[email protected]>
ref #5895 Some code refinements for `serviceModeKeeper`. Signed-off-by: JmPotato <[email protected]> Co-authored-by: Ti Chi Robot <[email protected]>
ref #5895 Signed-off-by: Ryan Leung <[email protected]> Co-authored-by: Ti Chi Robot <[email protected]>
Meeting notes of TSO wg sync-up (attendees: @rleungx , @lhy1024 , @hnes , @binshi-bing ):
|
ref #5895 Signed-off-by: Ryan Leung <[email protected]> Co-authored-by: Ti Chi Robot <[email protected]>
After internal sync within the PD TSO Working Group, here is the roadmap to land on TSO microservice:
|
ref #5895 Fix the problem described above by serializing the tso stream creation. Signed-off-by: Bin Shi <[email protected]> Co-authored-by: lhy1024 <[email protected]>
ref #5895 Add general tso forward/dispatcher for independent pd(tso)/tso services and cross cluster forwarding. Signed-off-by: Bin Shi <[email protected]>
ref #5895 Support basic functions of multi-keyspace-group management Signed-off-by: Bin Shi <[email protected]>
ref #5895 Signed-off-by: Ryan Leung <[email protected]> Co-authored-by: Ti Chi Robot <[email protected]>
ref #5895 - Refine the TSO allocator manager parameters. - Always run `tsoAllocatorLoop` to advance the Global TSO. Signed-off-by: JmPotato <[email protected]>
ref tikv#5895 Add benchmarks for keyspace assignment patrol. Signed-off-by: JmPotato <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
…6514) ref tikv#5895 To keep the logging info in on-premises clean, we only print keyspace-group-id zap field for the non-default keyspace group id. Signed-off-by: Bin Shi <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ref tikv#5895, close tikv#6304 Rewrite TSO gPRC/HTTP server Close(). Signed-off-by: Bin Shi <[email protected]>
ref tikv#5895 mcs, tso: change keyspace group primary path. The path for non-default keyspace group primary election changes from "/ms/{cluster_id}/tso/{group}/primary" to "/ms/{cluster_id}/tso/keyspace_groups/election/{group}/primary". Default keyspace group keeps /ms/{cluster_id}/tso/00000/primary. Signed-off-by: Bin Shi <[email protected]>
ref tikv#5895 Add TestUpgradingAPIandTSOClusters to test the scenario that after we restart the API cluster then restart the TSO cluster, the TSO service can still serve TSO requests normally. Signed-off-by: Bin Shi <[email protected]>
ref tikv#5895 Signed-off-by: Ryan Leung <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ref tikv#5895 Signed-off-by: Ryan Leung <[email protected]>
ref tikv#5895, ref tikv#6390 Signed-off-by: Ryan Leung <[email protected]>
ref tikv#5895 Improve tso proxy reliability. 1. Add protection mechanisms to TSO Proxy. a. Throttle the concurrency of TSO Proxy streamings. Default 5000. b. If TSO Proxy didn't receive the TSO request from the client for 1 hour, close the stream. 2. Optimize forceLoad lock with RW lock. 3. Enable stress test. 4. Add deadline for API leader forwarding request to TSO service. 5. Make tso response channel more safely. 6. Move tso proxy stress test away from the test suite as it has impact on other test cases. 7. Fix grpc client connection pool (server side) resource leak problem. 8. Make MaxConcurrentTSOProxyStreamings (5000 as default) and TSOProxyClientRecvTimeout (1 hour as default) configurable. 9. Add metrics tsoProxyHandleDuration, tsoProxyBatchSize and tsoProxyForwardTimeoutCounter. Signed-off-by: Bin Shi <[email protected]>
…6581) ref tikv#5895 Add keyspace and keyspace group info to the time fallback log to help debugging time fallback issue in multi-timeline scenario. Signed-off-by: Bin Shi <[email protected]>
ref tikv#5895 Add failure test cases. Signed-off-by: Bin Shi <[email protected]>
…eyspace movement state change in the persistent store (tikv#6596) ref tikv#5895 fix potential inconsistency caused by non-atomic applying the state change in the persistent in the following cases: 1. Keyspace group split/merge 2. Keyspace movement across keyspace groups. Signed-off-by: Bin Shi <[email protected]>
…ikv#6654) ref tikv#5895 Add keyspace group info in the timestamp fallback log in the client. Signed-off-by: Bin Shi <[email protected]>
ref tikv#5895 Signed-off-by: Ryan Leung <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
…6657) ref tikv#5895 Fix the keyspace ID RW race inside `tsoServiceDiscovery`. Signed-off-by: JmPotato <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ref tikv#5895 Add more debugging info to time fallback log. [2023/06/27 10:50:54.196 -07:00] [PANIC] [tso_dispatcher.go:764] ["[tso] timestamp fallback"] [dc-location=global] [keyspace=4294967295] [last-ts="(1687888254152, 1)"] [cur-ts="(1687888254052, 2)"] [last-tso-server=127.0.0.1:3380] [cur-tso-server=127.0.0.1:3380] [last-keyspace-group-in-request=0] [cur-keyspace-group-in-request=0] [last-keyspace-group-in-response=0] [cur-keyspace-group-in-response=0] [last-response-received-at=2023/06/27 10:50:54.195 -07:00] [cur-response-received-at=2023/06/27 10:50:54.196 -07:00] Signed-off-by: Bin Shi <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ref tikv#5895 Signed-off-by: Ryan Leung <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ref tikv#5895, ref tikv#6706 Signed-off-by: Ryan Leung <[email protected]>
…v#6736) ref tikv#5895, close tikv#6696 Implement `groupSplitPatroller` to speed up the split process. Signed-off-by: JmPotato <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
…tso-bench (tikv#6608) ref tikv#5895 support multi-keyspace, fault injection and keyspace-name in pd-tso-bench Signed-off-by: Bin Shi <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ref tikv#5895 Signed-off-by: lhy1024 <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
…ikv#6654) ref tikv#5895 Add keyspace group info in the timestamp fallback log in the client. Signed-off-by: Bin Shi <[email protected]>
ref tikv#5895 Signed-off-by: Ryan Leung <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
…6657) ref tikv#5895 Fix the keyspace ID RW race inside `tsoServiceDiscovery`. Signed-off-by: JmPotato <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ref tikv#5895 Add more debugging info to time fallback log. [2023/06/27 10:50:54.196 -07:00] [PANIC] [tso_dispatcher.go:764] ["[tso] timestamp fallback"] [dc-location=global] [keyspace=4294967295] [last-ts="(1687888254152, 1)"] [cur-ts="(1687888254052, 2)"] [last-tso-server=127.0.0.1:3380] [cur-tso-server=127.0.0.1:3380] [last-keyspace-group-in-request=0] [cur-keyspace-group-in-request=0] [last-keyspace-group-in-response=0] [cur-keyspace-group-in-response=0] [last-response-received-at=2023/06/27 10:50:54.195 -07:00] [cur-response-received-at=2023/06/27 10:50:54.196 -07:00] Signed-off-by: Bin Shi <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ref tikv#5895 Signed-off-by: Ryan Leung <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ref tikv#5895, ref tikv#6706 Signed-off-by: Ryan Leung <[email protected]>
…v#6736) ref tikv#5895, close tikv#6696 Implement `groupSplitPatroller` to speed up the split process. Signed-off-by: JmPotato <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
…tso-bench (tikv#6608) ref tikv#5895 support multi-keyspace, fault injection and keyspace-name in pd-tso-bench Signed-off-by: Bin Shi <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Feature Request
Describe your feature request related problem
The current TSO solution is known for its poor scalability and single point of failure. As we are moving to 'Cloud' providing DBaaS and Serverless Computing, supporting multi-tenant is one of primary goals which brings more requirements to the TSO service, including:
Describe the feature you'd like
By sharding TSO service across tenants, we aim to achieve the following goals:
Big Picture
For more details, please refer to the RFC (TODO: add link to the RFC).
Milestone 1 - (goal) deliver single-group tso microservice (Completed)
All work are tracked in #5836
Milestone 2 - (goal) code complete for multi-group tso microservice (4/21/2023) (Completed)
Milestone 3 - (goal) deploy multi-group tso microservice in dev env (5/5/2023) (Note: May 1st - 3rd, holiday)
Milestone 4 - (goal) deploy multi-group tso microservice in staging env (5/29/2023)
GetMinTS
interface tidbcloud/kvproto#16 @rleungxGetMinTS
interface pingcap/kvproto#1116 @rleungx(Milestone 4 landing plan) timeline breakdown
Serverless service team confirmed that there are existing E2E test cases to cover BR & GCSafePoint scenarios.
Risk: unexpected compatibility issue
- [x] TSO server stuck issue caused PD out of service during EKS upgrading TSO Server Close() gets stuck sometimes #6530 Fix tso server close stuck issue #6529 Add test case to simulate EKS upgrading (restart the entire API cluster then TSO cluster) #6534 @binshi-bing
- [x] mcs, tso: change keyspace group primary path. #6526 @binshi-bing
- [x] The api leader got stuck at tso requests forwarding #6549 @binshi-bing
- [x] "Not enough replicas" caused keyspace group split to fail randomly #6550 @rleungx @lhy1024
Milestone 5 - (goal) deploy multi-group tso microservice in prod env (ETA: N/A) (the exact time will be decided by Serverless service team)
Describe alternatives you've considered
Please see the "The Alternative Architectures Considered" in the RFC (TODO: add link to the RFC).
Teachability, Documentation, Adoption, Migration Strategy
The text was updated successfully, but these errors were encountered: