Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panicked at 'assertion failed: (left == right) when adding learner #471

Closed
ppamorim opened this issue Jul 25, 2022 · 2 comments · Fixed by #473
Closed

panicked at 'assertion failed: (left == right) when adding learner #471

ppamorim opened this issue Jul 25, 2022 · 2 comments · Fixed by #473

Comments

@ppamorim
Copy link
Contributor

ppamorim commented Jul 25, 2022

When calling app.raft.add_learner(node_id, Some(node), true) the application crashes and a error happens.

Payload:

POST - /add-learner [8117459876464397197, "0.0.0.0:8101"]

Error:

thread 'actix-server worker 1' panicked at 'assertion failed: `(left == right)`
  left: `2`,
 right: `0`: zero-th log entry must be (0,0,0), but 0-0 2', /Users/ppamorim/.cargo/git/checkouts/openraft-0677bbb78cc57964/db047b9/openraft/src/raft_types.rs:66:13

0: rust_begin_unwind
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/panicking.rs:143:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
   4: openraft::raft_types::LogId<NID>::new
   5: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
   6: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
   7: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
   8: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
   9: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  10: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  11: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  12: sentry_core::hub::Hub::run
  13: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  14: <actix_web::middleware::logger::LoggerResponse<S,B> as core::future::future::Future>::poll
  15: <actix_web::middleware::compress::CompressResponse<S,B> as core::future::future::Future>::poll
  16: actix_http::h1::dispatcher::InnerDispatcher<T,S,B,X,U>::poll_response
  17: <actix_http::h1::dispatcher::Dispatcher<T,S,B,X,U> as core::future::future::Future>::poll
  18: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  19: tokio::runtime::task::harness::Harness<T,S>::poll
  20: std::thread::local::LocalKey<T>::with
  21: tokio::task::local::LocalSet::tick
  22: tokio::macros::scoped_tls::ScopedKey<T>::set
  23: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  24: std::thread::local::LocalKey<T>::with
  25: tokio::runtime::basic_scheduler::Context::enter
  26: tokio::macros::scoped_tls::ScopedKey<T>::set
  27: tokio::runtime::basic_scheduler::BasicScheduler::block_on
  28: tokio::runtime::Runtime::block_on

Full log: https://gist.github.com/ppamorim/520c04892db587790e07cadf2b30c6e4

Openraft is in sync with the main branch.

EDIT 1:

It seems to not be deterministic, if I call this a couple of seconds after the nodes start and attempt multiple times, the error doesn't happen.

This bug is visible on this branch: https://shorturl.at/cgOU0 (shorted to prevent indexing by search engines)

To reproduce: run ./script/start-cluster.sh and check n0.log.

EDIT 2:

By adding a loop inside the request to add the learner, it seems to sort the issue but it's a workaround. Feels like the version of openraft on the main branch is not ready after its initial setup. Interesting.

As:

pub async fn add_learner(
  opt: &Opt,
  client: &reqwest::Client,
  id: &CognitoNodeId,
) -> Result<Response, Error> {
  let url = format!(
    "http://{}/add-learner",
    opt
      .leader_addr
      .clone()
      .unwrap()
  );
  let ip = if let Some(ip_adddr) = opt.ip_addr.clone() {
    ip_adddr
  } else {
    opt.http_addr.clone()
  };
  let json: (&CognitoNodeId, String) = (id, ip);
  let mut attempts = opt.expected_nodes_count;
  loop {
    match client
      .post(&url)
      .json(&json)
      .send()
      .await
    {
      Ok(response) => break Ok(response),
      Err(e) => {
        if {
          println!("Error: {}", e);
          attempts -= 1;
          attempts
        } < 100 {
          break Err(e);
        }
      }
    }
  }
}
@github-actions
Copy link

👋 Thanks for opening this issue!

Get help or engage by:

  • /help : to print help messages.
  • /assignme : to assign this issue to you.

@drmingdrmer
Copy link
Member

I believe the panic happens here:
https://github.com/datafuselabs/openraft/blob/d62183517be4b6e38360cc107d10f2488123641b/openraft/src/raft.rs#L439

This is to check if replication to the newly added learner becomes up to date.
But here it tries to use matched.leader_id to build a LogId, while matched is an uninitialized value (0,0). In such a case, it should avoid creating a LogId but just compare the index instead.

I will remove this LogId::new() to fix this issue.

Thank you man! 😆

@lichuang lichuang assigned lichuang and unassigned lichuang Jul 26, 2022
drmingdrmer added a commit to drmingdrmer/openraft that referenced this issue Jul 26, 2022
When waiting for a newly added learner to become up to date,
it tries to compare last-log-id and the reported `matched` replication
state.
But the `matched` may have not yet receive any update and is
unitialized, in such case, it tries to create a temp LogId with
`leader_id(0, 0)`, which is illegal.

The fix is simple: do not use log-id. Just calculating replication lag by log index.

Add test to reproduce it: openraft/tests/membership/t99_issue_471_adding_learner_uses_uninit_leader_id.rs

- Fix: databendlabs#471
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants