-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-27516 Document the table based replication queue storage in ref guide #5203
HBASE-27516 Document the table based replication queue storage in ref guide #5203
Conversation
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
|
||
Replication State in ZooKeeper:: | ||
By default, the state is contained in the base node _/hbase/replication_. | ||
Usually this nodes contains two child nodes, the `peers` znode is for storing replication peer | ||
state, and the `rs` znodes is for storing replication queue state. | ||
Currently, this nodes contains only one child node, namely `peers` znode, which is used for storing replication peer state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As now we only have one ref guide on master branch, we should include the doducmentation for all branches. So here I think we should mention that, after 3.0.0, it only contains one child node, but before 3.0.0, we still use zk to store queue data.
Each master cluster region server has its own znode in the replication znodes hierarchy. | ||
It contains one znode per peer cluster (if 5 slave clusters, 5 znodes are created), and each of these contain a queue of WALs to process. | ||
Each master cluster region server has its queue state in the hbase:replication table. | ||
It contains one row per peer cluster (if 5 slave clusters, 5 rows are created), and each of these contain a queue of WALs to process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here things are a bit different. For zookeeper, it is like a tree, we have a znode for a peer cluster, but under the znode we have lots of files.
But for table based implementation, we have server name in row key, which means we will have lots of rows for a given peer...
After queues are all transferred, they are deleted from the old location. | ||
The znodes that were recovered are renamed with the ID of the slave cluster appended with the name of the dead server. | ||
When a region server fails, the HMaster of master cluster will trigger the SCP, and all replication queues on the failed region server will be claimed in the SCP. | ||
The claim queue operation is just to remove the row of a replication queue, and insert a new row, where we change the server name to the region server which claims the queue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to mention that we use multi row mutate endpoint here, so the data for a single peer must be in the same region.
🎊 +1 overall
This message was automatically generated. |
b5535c9
to
b1a7069
Compare
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
|
||
The `Peers` Znode:: | ||
The `peers` znode is stored in _/hbase/replication/peers_ by default. | ||
It consists of a list of all peer replication clusters, along with the status of each of them. | ||
The value of each peer is its cluster key, which is provided in the HBase Shell. | ||
The cluster key contains a list of ZooKeeper nodes in the cluster's quorum, the client port for the ZooKeeper quorum, and the base znode for HBase in HDFS on that cluster. | ||
|
||
The `RS` Znode:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd better keep this unchanged, as it describes what we have before 3.0.0. And we can introduce a new section to describe the hbase:replication table storage.
|
||
Replication State in ZooKeeper:: | ||
By default, the state is contained in the base node _/hbase/replication_. | ||
Usually this nodes contains two child nodes, the `peers` znode is for storing replication peer | ||
state, and the `rs` znodes is for storing replication queue state. | ||
After 3.0.0, it only contains one child node, but before 3.0.0, we still use zk to store queue data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Usually this nodes contains two child nodes, the peers
znode is for storing replication peer state, and the rs
znodes is for storing replication queue state. And if you choose the file system based replication peer storage, you will not see the peers
znode. And starting from 3.0.0, we have moved the replication queue state to hbase:replication table, so you will not see the rs
znode."
@@ -2433,26 +2433,22 @@ Replication State Storage:: | |||
`ReplicationPeerStorage` and `ReplicationQueueStorage`. The former one is for storing the | |||
replication peer related states, and the latter one is for storing the replication queue related | |||
states. | |||
HBASE-15867 is only half done, as although we have abstract these two interfaces, we still only | |||
have zookeeper based implementations. | |||
And in HBASE-27109, we have implemented the `ReplicationQueueStorage` interface to store the replication queue in the hbase:replication table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And in HBASE-27110, we have implemented a file system based replication peer storage, to store replication peer state on file system. Of course you can still use the zookeeper based replication peer storage.
And in HBASE-27109, we have changed the replication queue storage from zookeeper based to hbase table based. See the below 'Replication Queue State in hbase:replication table' section for more details.
@@ -2475,14 +2471,14 @@ When nodes are removed from the slave cluster, or if nodes go down or come back | |||
|
|||
==== Keeping Track of Logs | |||
|
|||
Each master cluster region server has its own znode in the replication znodes hierarchy. | |||
It contains one znode per peer cluster (if 5 slave clusters, 5 znodes are created), and each of these contain a queue of WALs to process. | |||
Before 3.0.0, for zookeeper based implementation, it is like a tree, we have a znode for a peer cluster, but under the znode we have lots of files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here we'd better make two different sections to describe the logic before and after 3.0.0. As on zookeeper, we store all WAL files on it and for table based solution, we only store an offset.
Each of these queues will track the WALs created by that region server, but they can differ in size. | ||
For example, if one slave cluster becomes unavailable for some time, the WALs should not be deleted, so they need to stay in the queue while the others are processed. | ||
See <<rs.failover.details,rs.failover.details>> for an example. | ||
|
||
When a source is instantiated, it contains the current WAL that the region server is writing to. | ||
During log rolling, the new file is added to the queue of each slave cluster's znode just before it is made available. | ||
During log rolling, the new file is added to the queue of each slave cluster's record just before it is made available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is different for table based replication queue storage, and it is the key point here. For zookeeper, it is an external system so there is no problem to let log rolling depend on it, but if we want to store the state in a hbase table, we can not let log rolling depend on it as it will introduce dead lock...
We will only write to hbase:replication when want to record an offset after replicating something.
When no region servers are failing, keeping track of the logs in ZooKeeper adds no value. | ||
Unfortunately, region servers do fail, and since ZooKeeper is highly available, it is useful for managing the transfer of the queues in the event of a failure. | ||
|
||
Each of the master cluster region servers keeps a watcher on every other region server, in order to be notified when one dies (just as the master does). When a failure happens, they all race to create a znode called `lock` inside the dead region server's znode that contains its queues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need to keep this, as it is the logic for some hbase releases. Let me check the version where we start to use SCP to claim replication queue.
017a9d3
to
772acaa
Compare
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
@Apache9 sir. Could you take a look? Thanks. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
@@ -2454,6 +2450,12 @@ The `RS` Znode:: | |||
The child znode name is the region server's hostname, client port, and start code. | |||
This list includes both live and dead region servers. | |||
|
|||
[[hbase:replication]] | |||
hbase:replication:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use "The hbase:replication
Table"?
|
||
After 3.0.0, for table based implementation, we have server name in row key, which means we will have lots of rows for a given peer. | ||
|
||
For a normal replication queue, where the WAL files belong to it is still alive, all the WAL files are kept in memory, so we do not need to get the WAL files from replication queue storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the region server is still alive"
@@ -2519,12 +2533,12 @@ The next time the cleaning process needs to look for a log, it starts by using i | |||
NOTE: WALs are saved when replication is enabled or disabled as long as peers exist. | |||
|
|||
[[rs.failover.details]] | |||
==== Region Server Failover | |||
==== Region Server Failover(based on ZooKeeper) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we do not need to se two different sections. First, we could mention the 'setting a watcher' way, this is way in the old time.
And starting from 2.5.0, the failover logic has been moved to SCP, where we add a SERVER_CRASH_CLAIM_REPLICATION_QUEUES
step in SCP to claim the replication queues for a dead server.
And starting from 3.0.0, where we changed the replication queue storage from zookeeper to table, the update to the replication queue storage is async, so we also need an extra step to add the missing replication queues before claiming.
And on how to claim the replication queue, you can have two sections, to describe the layout and claiming way for zookeeper based implementation and table based implementation.
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No big concerns frome me, just a minor nit.
Thanks @2005hithlj !
Assume that 1.1.1.2 failed. | ||
The survivors will claim queue of that, and, arbitrarily, 1.1.1.3 wins. | ||
It will claim all the queue of 1.1.1.2, including removing the row of a replication queue, and inserting a new row(where we change the server name to the region server which claims the queue). | ||
Finally ,the layout will look like the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The position of the ',' is incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The position of the ',' is incorrect?
OK sir , I have revised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No big concerns frome me, just a minor nit.
Thanks @2005hithlj !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No big concerns frome me, just a minor nit.
Thanks @2005hithlj !
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
… guide (#5203) Signed-off-by: Duo Zhang <[email protected]>
… guide (#5203) Signed-off-by: Duo Zhang <[email protected]>
… guide (#5203) Signed-off-by: Duo Zhang <[email protected]>
… guide (#5203) Signed-off-by: Duo Zhang <[email protected]>
… guide (#5203) Signed-off-by: Duo Zhang <[email protected]>
https://issues.apache.org/jira/browse/HBASE-27516