-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
*: fix node bootstrap not idempotent #1774
Conversation
please add test for the bootstrap flow. |
src/server/node.rs
Outdated
for _ in 0..MAX_CHECK_CLUSTER_BOOTSTRAPPED_RETRY_COUNT { | ||
match self.pd_client.get_region(b"") { | ||
Ok(region) => { | ||
if region.get_id() == region_id { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we must check region epoch, if not equal, there must be some fatal error here.
any update? @nolouch |
fn bootstrap_cluster(&mut self, engine: &DB, region: metapb::Region) -> Result<()> { | ||
let region_id = region.get_id(); | ||
match self.pd_client.bootstrap_cluster(self.store.clone(), region) { | ||
Err(PdError::ClusterBootstrapped(_)) => { | ||
error!("cluster {} is already bootstrapped", self.cluster_id); | ||
try!(store::clear_region(engine, region_id)); | ||
try!(store::clear_prepare_bootstrap(engine, region_id)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you want to clear_prepare_bootstrap_state()
if pd returns Ok() at L237.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right
src/raftstore/store/bootstrap.rs
Outdated
pub fn clear_prepare_bootstrap_state(engine: &DB) -> Result<()> { | ||
let wb = WriteBatch::new(); | ||
try!(wb.delete(&keys::prepare_bootstrap_key())); | ||
try!(engine.write(wb)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use engine.delete
directly.
src/raftstore/store/keys.rs
Outdated
@@ -60,6 +61,9 @@ pub const REGION_STATE_SUFFIX: u8 = 0x01; | |||
pub fn store_ident_key() -> Vec<u8> { | |||
STORE_IDENT_KEY.to_vec() | |||
} | |||
pub fn prepare_bootstrap_key() -> Vec<u8> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a blank line
src/server/node.rs
Outdated
Ok(region) => { | ||
if region.get_id() == first_region.get_id() { | ||
if !self.check_region_epoch(region.clone(), first_region.clone()) { | ||
return Err(box_err!("first region epoch inconsistent with pd info")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add more info here.
src/server/node.rs
Outdated
Ok(()) | ||
} | ||
// TODO: should we clean region for other errors too? | ||
Err(e) => panic!("bootstrap cluster {} err: {:?}", self.cluster_id, e), | ||
Ok(_) => { | ||
if let Err(e) = store::clear_prepare_bootstrap_state(engine) { | ||
warn!("clear prepare bootstrap state failed: {:?}", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return error
src/server/node.rs
Outdated
match self.pd_client.get_region(b"") { | ||
Ok(region) => { | ||
if region.get_id() == first_region.get_id() { | ||
if !self.check_region_epoch(region.clone(), first_region.clone()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use &
here, no need to clone
tests/raftstore/test_bootstrap.rs
Outdated
// assume that there is a node bootstrap the cluster and add region in pd successfully | ||
cluster.add_first_region().unwrap(); | ||
// now at same time start the another node, and will recive cluster is not bootstrap | ||
// try to bootstrap with a new region |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to test both when pd is bootstrapped and pd is not bootstrapped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I start the cluster two times, in the second time the cluster is bootstrapped.
PTAL @siddontang @disksing |
LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
tests/raftstore/pd.rs
Outdated
@@ -362,6 +373,10 @@ impl TestPdClient { | |||
Ok(()) | |||
} | |||
|
|||
fn is_regions_empty(&self) -> Result<(bool)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need Result here?
tests/raftstore/test_bootstrap.rs
Outdated
use super::node::new_node_cluster; | ||
use super::util::*; | ||
|
||
fn test_bootstrap_idempotent<T: Simulator>(cluster: &mut Cluster<T>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we check conf ver and version inconsistent for check_region_epoch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any update? @nolouch
cluster.check_regions_number(1); | ||
cluster.shutdown(); | ||
sleep_ms(500); | ||
cluster.start(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start twice to check what?
I think we must check the prepare bootstrap key in RocksDB too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first the cluster is not bootstraped. second is bootstraped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- For the first start, check the bootstrap key in RocksDB directly
- For the second start, check again too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have checked it in here https://github.com/pingcap/tikv/pull/1774/files#diff-7d211b12fe1e345db17e9dbfb322abd3R186
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, not the same.
Here we must check the prepare key in RocksDB explicitly.
… into shuning/fix-bootstrap
CI failed
|
PTAL @siddontang |
tests/raftstore/test_bootstrap.rs
Outdated
|
||
// to check meet inconsistent epoch when bootstrap,will meet error in here | ||
let e = sim.wl() | ||
.run_node_with_handle_error(1, cluster.cfg.clone(), engine.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you only want to check epoch function work right or not, you don't need to do it.
You can just extract the check_epoch function to a common function, not in Node, then test it directly.
IMO, your test is complex and not easy to understand.
… into shuning/fix-bootstrap
Ping @nolouch |
src/server/node.rs
Outdated
r3.set_end_key(keys::EMPTY_KEY.to_vec()); | ||
r3.mut_region_epoch().set_version(1); | ||
r3.mut_region_epoch().set_conf_ver(2); | ||
match check_region_epoch(&r1, &r2).unwrap_err() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you check use assert!(check_region_epoch().is_err())
directly.
match self.sim.try_read() { | ||
Ok(s) => keys = s.get_node_ids(), | ||
Err(sync::TryLockError::Poisoned(e)) => { | ||
let s = e.into_inner(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to check this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I met panic on panic and test core dump. because test case panic and then run drop will call shutdown, and the Lock reason will panic again. this check to prevent it and catch the first panic stack information.
LGTM PTAL @disksing |
tests/raftstore/test_bootstrap.rs
Outdated
cfg.raft_store.use_sst_file_snapshot); | ||
|
||
|
||
// assume there is a node has bootstraped the cluster and add region in pd successfully |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bootstrapped
tests/raftstore/test_bootstrap.rs
Outdated
|
||
#[test] | ||
fn test_bootstrap_with_prepared_data() { | ||
test_node_bootstrap_with_prepared_data(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not add #[test]
to test_node_bootstrap_with_prepared_data
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like other tests do and easy to find #[test]
… into shuning/fix-bootstrap
PTAL @disksing |
LGTM. |
* improve error handling of tpcc Signed-off-by: Ping Yu <[email protected]> * remove not relevant Signed-off-by: Ping Yu <[email protected]> --------- Signed-off-by: Ping Yu <[email protected]>
close #1761
PTAL @disksing @siddontang