Skip to content

Commit

Permalink
Update cloud native design (#7008)
Browse files Browse the repository at this point in the history
ref #6882
  • Loading branch information
flowbehappy authored Mar 9, 2023
1 parent 6d970dc commit 7071b5e
Showing 1 changed file with 26 additions and 34 deletions.
60 changes: 26 additions & 34 deletions docs/design/2023-02-23-cloud-native-architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,11 @@
The current TiFlash architecture is a typical shared nothing architecture, which brings some drawbacks:

1. Storage and computation cannot be scaled separately
2. Updating and reading operations are loaded on the same nodes and can affect each other, including using up IO, CPU, and other resources. AP query tasks are typically heavy and can suddenly spike in demand.
3. Scaling is slow because data must be synchronized between TiFlash and TiKV nodes.
4. The cost-efficiency is not optimal. For example, TiFlash instances may need to be kept running even during periods of low query demand, even if there are only queries at night.
2. Write and read operations land on the same nodes and can affect each other, including using up IO, CPU, and other resources.
* AP query tasks are typically heavy and can suddenly spike on demand.
3. Scaling is slow because data must be re-synchronized from TiKV nodes to TiFlash nodes.
4. The cost-efficiency is not optimal.
* For example, all TiFlash instances need to be kept running even during the periods of low query demand. Most of the computation resources are wasted.


## Basic ideas
Expand Down Expand Up @@ -56,19 +58,10 @@ The most critical issue of this architecture is: How can the data generated by W
Options:

* EFS cost is too high to achieve the purpose of cost reduction
* AWS's EBS does have iOS types that can be accessed by multiple nodes, but it has many limitations , does not meet the needs of distributed databases for shared storage, and is expensive.
* S3 provides high durability and high availability at a low cost. Although the throughput is high, the read and write latency is large
* The throughput of S3 can fill the internet bandwidth of an EC2 node, such as 16 Core ec2 internet bandwidth 625 MB/s
* But the delay is relatively high, the delay of the request is 10 ms~ 1 s

* JuiceFS
* According to the technical document of JuiceFS, I think JuiceFS can be used as the front end of the S3, to store IO latency insensitive data, e.g. stable data of DeltaTree. It could bring several benefits:
* However, I do NOT think JuiceFS is suitable for IO latency sensitive data, e,g. raft logs and delta data of DeltaTree.
* JuiceFS only provides "close-to-open" consistency guarantee. It is not enough for TiFlash / TiFlash Proxy(a TiKV fork). We keep a file open while there are write operations. Will not close it before it has finished writing. For example, WAL files, BlobFiles in PageStorage.
* As the updates will have to be persisted to the real storage before commit, i.e. write to S3. JuiceFS does NOT solve the latency issue of the write path.



* Although AWS's EBS does have IOPS types that can be accessed by multiple nodes, it has many limitations. So they do not meet the needs of distributed databases for shared storage. Further more, they are expensive.
* S3 provides high durability and high availability at a low cost. And the throughput is high. The main issue is that the read and write latency are too high.
* The throughput of S3 can fill the internet bandwidth of an EC2 node. For example, 16 Core ec2 node can easilly use up all the internet bandwidth (625 MB/s) by reading S3 files.
* The delay of the S3 request is 10 ms~ 1 s

| | S3 (Standard) | EBS (gp3) | Local disk |
| ---- | ---- | ---- | ---- |
Expand All @@ -77,17 +70,16 @@ The most critical issue of this architecture is: How can the data generated by W
| IO Latency | 10 ms ~ 2 s | <10ms | ~100 μs |


### Solution: Use different storage to store different data
### Solution: Use different storages to store different data

Based on the above analysis, it is impossible to find a perfect shared storage that meets all data requirements. A natural idea is to take the data into aparts and store them to different storage
Based on the above analysis, it is impossible to find a perfect shared storage that meets all data requirements. A natural idea is to split the data into aparts and store them to different storage

1. Most of the data is stored in S3
* Including various files written to disk in the DeltaTree storage engine
* S3 provides high durability and high availability. So this part of the data should not be sensitive to storage latency
* RN have to request WN for other data during queries.
2. IO latency sensitive data is directly written to Write Node's local disk and uploaded to S3 regularly. It is equivalent to using local disk as buffer
* S3 is used to store the latency insensitive data, including various files written to disk in the DeltaTree Storage Engine
* RN read most of the data from S3, and requests WN for other data (the latest data) during queries.
2. IO latency sensitive data is directly written to Write Node's local disk and uploaded to S3 periodically. I.e. it uses the local disk as a buffer
* About durability - if data from local disk is lost, recover from TiKV. Equivalent to relying on upstream TiKV to provide high durability.
* About availability - Data needs to be resynchronized after restart. And the time for service recovery is affected by the delay of synchronized data. Since we upload data regularly, the actual data to be resynchronized is incremental data relative to the last upload point, and the total amount should be small, so it is considered acceptable.
* About availability - Data needs to be resynchronized after restart. And the time for service recovery is affected by the delay of synchronized data. Since we upload data periodically (for example 30s), the actual data to be resynchronized is incremental data relative to the last upload point. The total amount of data should be small, so it should not take very long time. It means the available duration should not be a big issue.


### How to upload S3?
Expand All @@ -102,11 +94,11 @@ To answer the above questions, let's first look at the data distribution and flo

![](images/2023-02-23-cloud-native-architecture-2.png)

1. <b>Raft log</b> - raft log is the data that has just been synced from the TiKV raft leader to TiFlash , and has not yet been applied to the column storage engine (Delta Tree Engine). After applying successfully, it will be deleted.
1. <b>Raft log</b> - raft log is the data that has just been synced from the TiKV raft leader to TiFlash , and has not yet been applied to the column storage engine (Delta Tree Engine). After applying to storage engine successfully, it will be deleted.
* It belongs to IO latency sensitive data and cannot upload S3 right after it is written.
2. <b>Column data</b> - The column data in Delta Tree Engine is divided into memtable and column file persisted on disk. Note that Delta Tree has no WAL, no flush data depends on raft log replay recovery after restart.
* Although there is memtable as a write buffer, the write IOPS caused by Delta Tree's flush also reaches 500 IOPS under busy conditions. So this part of the data should also be counted in the storage delay sensitive data and cannot be uploaded
* The column file that has just been flushed to disk is usually relatively small. In order to avoid too much fragmentation, there will be compaction workers in the background to organize and merge it into a larger column file. Column files data are not IO latency sensitive nd can be uploaded to S3 before confirming that the write is successful.
2. <b>Column data</b> - The column data in DeltaTree Engine is divided into memtable and column file persisted on disk. Note that DeltaTree Engine has no WAL, no flush data, so it depends on raft log to replay the lost updates after restart.
* Although there is memtable as a write buffer, the write IOPS caused by Delta Tree's flush also reaches 500 IOPS under busy conditions. So this part of the data should also be counted in the storage delay sensitive data and cannot be uploaded immediately.
* The column files that have just been flushed to disk are usually relatively small. In order to avoid too much fragmentation, there will be compaction workers in the background to organize and merge it into a larger column files. Column files data is not IO latency sensitive and can be uploaded to S3 before confirming that the write is successful.

If we want to use different strategies for different data types, it will undoubtedly be a lot of work.

Expand All @@ -118,7 +110,7 @@ What's more troublesome is that it is very complicated to achieve the consistenc


### Solution: All data is stored in PageStorage
PageStorage can be treated as a local object store, which can store `PageID - > Page` pairs. Pages are bytes and sizes range from 1KB to 100MB.
PageStorage can be treated as a local Key Value store, which can store `PageID - > Page` pairs. Pages are bytes and sizes range from 1KB to 100MB.


1. Pages can be stored permanently while avoiding the resulting write amplification
Expand All @@ -127,13 +119,13 @@ PageStorage can be treated as a local object store, which can store `PageID - >
4. It supports MVCC Snapshot, that is, the application layer can obtain a fixed data view according to the Snapshot


Managing these complex IO dependencies at the application layer is doomed to have no future. We choose to store all TiFlash data in PageStorage (PS), and then PS upload all data changes regularly , which can solve the consistency problem.
Managing these complex IO dependencies at the application layer is doomed to have no future. We choose to store all TiFlash data in PageStorage (PS), and then PS upload all data changes periodically. Since PS suppots snapshots, it can solve the consistency problem.


It works like EBS snapshot, i.e. uploading changes to S3 periodically.
So the basic upload mechanism works like EBS snapshot, i.e. uploading changes to S3 periodically.

1. It has been guaranteed that if the PS state on the TiFlash disk at a certain time t is S(t), at this time restart, the PS based on the local disk can be restored to the state S(t)
2. If we get PS snapshot at time t and upload all the data in the snapshot to S3 to form file F, then the TiFlash state recovered from F must be equal to S(t)
1. It has been guaranteed that if the PS state on the TiFlash disk at a certain time t is S(t). At this time restart, the PS based on the local disk can be restored to the state S(t)
2. We get PS snapshot at time t and upload all the data in the snapshot to S3 and generate file F. Then the TiFlash state recovered from F must be equal to S(t)

Here is the new architecture:

Expand All @@ -142,8 +134,8 @@ Here is the new architecture:
## Key Implementation Points

* Periodically upload checkpoints and delta data to S3 by WN.
* Store raft logs data into PageStorage. Currently it is stored in [raft-engine](https://github.com/tikv/raft-engine).
* Store raft logs data into PageStorage instead of storing in [raft-engine](https://github.com/tikv/raft-engine).
* Support generating query snapshots by WN and sending latest data to RN
* Support directly read data from S3 by RN, and cache data to local disk for faster access
* Support Fast Add Peer. i.e. TiFlash WN use the data uploaded by other WNs as region snapshot during apply snapshot. The detail design will be covered by other document.
* Support S3 GC. The data files on S3 are shared among WNs. It is difficult to decide whether a file can be deleted or not. The detail design will be covered by other document.
* Support S3 GC. The data files on S3 are shared among WNs. It is pretty difficult to decide whether a file can be deleted or not. The detail design will be covered by other document.

0 comments on commit 7071b5e

Please sign in to comment.