Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: fix node bootstrap not idempotent #1774

Merged
merged 23 commits into from
Apr 28, 2017
Merged

*: fix node bootstrap not idempotent #1774

merged 23 commits into from
Apr 28, 2017

Conversation

nolouch
Copy link
Contributor

@nolouch nolouch commented Apr 17, 2017

@CLAassistant
Copy link

CLAassistant commented Apr 17, 2017

CLA assistant check
All committers have signed the CLA.

@siddontang
Copy link
Contributor

please add test for the bootstrap flow.

for _ in 0..MAX_CHECK_CLUSTER_BOOTSTRAPPED_RETRY_COUNT {
match self.pd_client.get_region(b"") {
Ok(region) => {
if region.get_id() == region_id {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we must check region epoch, if not equal, there must be some fatal error here.

@siddontang
Copy link
Contributor

any update? @nolouch

fn bootstrap_cluster(&mut self, engine: &DB, region: metapb::Region) -> Result<()> {
let region_id = region.get_id();
match self.pd_client.bootstrap_cluster(self.store.clone(), region) {
Err(PdError::ClusterBootstrapped(_)) => {
error!("cluster {} is already bootstrapped", self.cluster_id);
try!(store::clear_region(engine, region_id));
try!(store::clear_prepare_bootstrap(engine, region_id));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you want to clear_prepare_bootstrap_state() if pd returns Ok() at L237.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right

pub fn clear_prepare_bootstrap_state(engine: &DB) -> Result<()> {
let wb = WriteBatch::new();
try!(wb.delete(&keys::prepare_bootstrap_key()));
try!(engine.write(wb));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use engine.delete directly.

@@ -60,6 +61,9 @@ pub const REGION_STATE_SUFFIX: u8 = 0x01;
pub fn store_ident_key() -> Vec<u8> {
STORE_IDENT_KEY.to_vec()
}
pub fn prepare_bootstrap_key() -> Vec<u8> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a blank line

Ok(region) => {
if region.get_id() == first_region.get_id() {
if !self.check_region_epoch(region.clone(), first_region.clone()) {
return Err(box_err!("first region epoch inconsistent with pd info"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add more info here.

Ok(())
}
// TODO: should we clean region for other errors too?
Err(e) => panic!("bootstrap cluster {} err: {:?}", self.cluster_id, e),
Ok(_) => {
if let Err(e) = store::clear_prepare_bootstrap_state(engine) {
warn!("clear prepare bootstrap state failed: {:?}", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return error

match self.pd_client.get_region(b"") {
Ok(region) => {
if region.get_id() == first_region.get_id() {
if !self.check_region_epoch(region.clone(), first_region.clone()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use & here, no need to clone

// assume that there is a node bootstrap the cluster and add region in pd successfully
cluster.add_first_region().unwrap();
// now at same time start the another node, and will recive cluster is not bootstrap
// try to bootstrap with a new region
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to test both when pd is bootstrapped and pd is not bootstrapped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I start the cluster two times, in the second time the cluster is bootstrapped.

@nolouch
Copy link
Contributor Author

nolouch commented Apr 20, 2017

PTAL @siddontang @disksing

@disksing
Copy link
Contributor

LGTM.

Copy link
Contributor

@andelf andelf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -362,6 +373,10 @@ impl TestPdClient {
Ok(())
}

fn is_regions_empty(&self) -> Result<(bool)> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need Result here?

use super::node::new_node_cluster;
use super::util::*;

fn test_bootstrap_idempotent<T: Simulator>(cluster: &mut Cluster<T>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we check conf ver and version inconsistent for check_region_epoch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any update? @nolouch

cluster.check_regions_number(1);
cluster.shutdown();
sleep_ms(500);
cluster.start();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start twice to check what?

I think we must check the prepare bootstrap key in RocksDB too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any update?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first the cluster is not bootstraped. second is bootstraped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. For the first start, check the bootstrap key in RocksDB directly
  2. For the second start, check again too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, not the same.
Here we must check the prepare key in RocksDB explicitly.

@siddontang
Copy link
Contributor

@nolouch

CI failed

thread 'raftstore::test_bootstrap::test_node_bootstrap_witch_check_epoch' panicked at 'called `Result::unwrap()` on an `Err` value: Other(StringError("region version inconsist: 1 with 2"))', /checkout/src/libcore/result.rs:859

@nolouch
Copy link
Contributor Author

nolouch commented Apr 25, 2017

PTAL @siddontang


// to check meet inconsistent epoch when bootstrap,will meet error in here
let e = sim.wl()
.run_node_with_handle_error(1, cluster.cfg.clone(), engine.clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you only want to check epoch function work right or not, you don't need to do it.

You can just extract the check_epoch function to a common function, not in Node, then test it directly.

IMO, your test is complex and not easy to understand.

@siddontang
Copy link
Contributor

Ping @nolouch

r3.set_end_key(keys::EMPTY_KEY.to_vec());
r3.mut_region_epoch().set_version(1);
r3.mut_region_epoch().set_conf_ver(2);
match check_region_epoch(&r1, &r2).unwrap_err() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you check use assert!(check_region_epoch().is_err()) directly.

match self.sim.try_read() {
Ok(s) => keys = s.get_node_ids(),
Err(sync::TryLockError::Poisoned(e)) => {
let s = e.into_inner();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to check this?

Copy link
Contributor Author

@nolouch nolouch Apr 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I met panic on panic and test core dump. because test case panic and then run drop will call shutdown, and the Lock reason will panic again. this check to prevent it and catch the first panic stack information.

@siddontang
Copy link
Contributor

LGTM

PTAL @disksing

cfg.raft_store.use_sst_file_snapshot);


// assume there is a node has bootstraped the cluster and add region in pd successfully
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bootstrapped


#[test]
fn test_bootstrap_with_prepared_data() {
test_node_bootstrap_with_prepared_data();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add #[test] to test_node_bootstrap_with_prepared_data directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like other tests do and easy to find #[test]

@siddontang
Copy link
Contributor

PTAL @disksing

@disksing
Copy link
Contributor

LGTM.

@nolouch nolouch merged commit 80d8dad into master Apr 28, 2017
@nolouch nolouch deleted the shuning/fix-bootstrap branch April 28, 2017 13:14
iosmanthus pushed a commit to iosmanthus/tikv that referenced this pull request Oct 17, 2024
* improve error handling of tpcc

Signed-off-by: Ping Yu <[email protected]>

* remove not relevant

Signed-off-by: Ping Yu <[email protected]>

---------

Signed-off-by: Ping Yu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bootstrap may leave unexpected region data in storage.
5 participants