Cluster entering a recovery loop may caused by Kafka sink QueueFull (Local: Queue full) #16640

ly9chee · 2024-05-08T09:25:16Z

Describe the bug

Description

The cluster entering a recovery loop when creating kafka sink from a mv (about 4 million records); when the problematic sink was dropped, the system went back to normal.

create sink sink_1 from mv1
WITH (
    connector='kafka',
    primary_key='***',
    topic='***',
    properties.bootstrap.server='***',
    properties.compression.codec='zstd'
) FORMAT UPSERT ENCODE JSON

20 minutes later, the sink had been successfully recreated.

Other observed phenomena

Kafka should work fine, because within the recovery period, the topic had 140 million records written into it, but the upstream mv only had 4 million records.

Not sure increasing properties.retry.max can solve this issue.

Error message/log

2024-05-08T02:50:36.677896483Z  WARN actor{otel.name="Actor 4487" actor_id=4487 prev_epoch=6416051083608064 curr_epoch=6416068298735616}:executor{otel.name="Sink 118700000001"}: risingwave_connector::sink::kafka: producing message (key Some([*, *, *])) to topic *** failed error=Message production error: QueueFull (Local: Queue full)
2024-05-08T02:50:36.677906392Z  WARN actor{otel.name="Actor 4487" actor_id=4487 prev_epoch=6416051083608064 curr_epoch=6416068298735616}:executor{otel.name="Sink 118700000001"}: risingwave_connector::sink::kafka: Producer queue full. Delivery future buffer size=100029. Await and retry #1
2024-05-08T02:50:36.677909398Z  WARN actor{otel.name="Actor 4487" actor_id=4487 prev_epoch=6416051083608064 curr_epoch=6416068298735616}:executor{otel.name="Sink 118700000001"}: risingwave_connector::sink::kafka: producing message (key Some([*, *, *])) to topic *** error=Message production error: QueueFull (Local: Queue full)
2024-05-08T02:50:36.677920839Z  WARN actor{otel.name="Actor 4487" actor_id=4487 prev_epoch=6416051083608064 curr_epoch=6416068298735616}:executor{otel.name="Sink 118700000001"}: risingwave_connector::sink::kafka: Producer queue full. Delivery future buffer size=100028. Await and retry #2
2024-05-08T02:50:37.377364973Z ERROR risingwave_stream::task::stream_manager: actor exit with error actor_id=4490 error=Executor error: Sink error: Kafka error: Message production error: QueueFull (Local: Queue full)

Backtrace:
   0: <thiserror_ext::backtrace::MaybeBacktrace as thiserror_ext::backtrace::WithBacktrace>::capture
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/thiserror-ext-0.0.11/src/backtrace.rs:30:18
   1: thiserror_ext::ptr::ErrorBox<T,B>::new
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/thiserror-ext-0.0.11/src/ptr.rs:40:33
   2: <risingwave_stream::executor::error::StreamExecutorError as core::convert::From<E>>::from
             at ./risingwave/src/stream/src/executor/error.rs:35:35
   3: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
             at ./rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/core/src/result.rs:1959:27
   4: risingwave_stream::executor::sink::SinkExecutor<F>::execute_consume_log::{{closure}}
             at ./risingwave/src/stream/src/executor/sink.rs:430:13
   5: <futures_util::stream::once::Once<Fut> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/stream/once.rs:46:33
                                                                                                      51440,1       17%
   5: <futures_util::stream::once::Once<Fut> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/stream/once.rs:46:33
   6: <futures_util::future::future::IntoStream<F> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/lib.rs:102:13
   7: futures_util::stream::select_with_strategy::poll_side
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/stream/select_with_strategy.rs:219:27
   8: futures_util::stream::select_with_strategy::poll_inner
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/stream/select_with_strategy.rs:234:28
   9: <futures_util::stream::select_with_strategy::SelectWithStrategy<St1,St2,Clos,State> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/stream/select_with_strategy.rs:270:17
  10: <futures_util::stream::select::Select<St1,St2> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/stream/select.rs:115:9
  11: <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-core-0.3.30/src/stream.rs:120:9
  12: <futures_util::stream::stream::flatten::Flatten<St,<St as futures_core::stream::Stream>::Item> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/stream/stream/flatten.rs:50:44
  13: <futures_util::stream::stream::Flatten<St> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/lib.rs:102:13
  14: <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-core-0.3.30/src/stream.rs:120:9
  15: <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-core-0.3.30/src/stream.rs:120:9
  16: futures_util::stream::stream::StreamExt::poll_next_unpin
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/stream/stream/mod.rs:1638:9
  17: <futures_util::stream::stream::next::Next<St> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/stream/stream/next.rs:32:9
  18: <await_tree::future::Instrumented<F,_> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/await-tree-0.1.1/src/future.rs:124:23
  19: risingwave_stream::executor::wrapper::trace::instrument_await_tree::{{closure}}
             at ./risingwave/src/stream/src/executor/wrapper/trace.rs:115:10
                                                                                                      51457,4       17%
  19: risingwave_stream::executor::wrapper::trace::instrument_await_tree::{{closure}}
             at ./risingwave/src/stream/src/executor/wrapper/trace.rs:115:10
  20: <futures_async_stream::try_stream::GenTryStream<G> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-async-stream-0.2.9/src/lib.rs:506:33
  21: risingwave_stream::executor::wrapper::schema_check::schema_check::{{closure}}
             at ./risingwave/src/stream/src/executor/wrapper/schema_check.rs:24:1
  22: <futures_async_stream::try_stream::GenTryStream<G> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-async-stream-0.2.9/src/lib.rs:506:33
  23: <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-core-0.3.30/src/stream.rs:120:9
  24: futures_util::stream::stream::StreamExt::poll_next_unpin
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/stream/stream/mod.rs:1638:9
  25: <futures_util::stream::stream::next::Next<St> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/stream/stream/next.rs:32:9
  26: risingwave_stream::executor::wrapper::epoch_check::epoch_check::{{closure}}
             at ./risingwave/src/stream/src/executor/wrapper/epoch_check.rs:31:44
  27: <futures_async_stream::try_stream::GenTryStream<G> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-async-stream-0.2.9/src/lib.rs:506:33
  28: <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-core-0.3.30/src/stream.rs:120:9
  29: <S as futures_core::stream::TryStream>::try_poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-core-0.3.30/src/stream.rs:196:9
  30: futures_util::stream::try_stream::TryStreamExt::try_poll_next_unpin
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/stream/try_stream/mod.rs:1131:9
  31: <futures_util::stream::try_stream::try_next::TryNext<St> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/stream/try_stream/try_next.rs:32:9
  32: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/task/task_local.rs:347:31
  33: tokio::task::task_local::LocalKey<T>::scope_inner
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/task/task_local.rs:217:19
  34: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/task/task_local.rs:343:19
  35: risingwave_common::util::epoch::task_local::scope::{{closure}}
             at ./risingwave/src/common/src/util/epoch.rs:244:47
  36: risingwave_stream::executor::wrapper::epoch_provide::epoch_provide::{{closure}}
                                                                                                      51485,3       17%
             at ./risingwave/src/common/src/util/epoch.rs:244:47
  36: risingwave_stream::executor::wrapper::epoch_provide::epoch_provide::{{closure}}
             at ./risingwave/src/stream/src/executor/wrapper/epoch_provide.rs:31:59
  37: <futures_async_stream::try_stream::GenTryStream<G> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-async-stream-0.2.9/src/lib.rs:506:33
  38: <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-core-0.3.30/src/stream.rs:120:9
  39: futures_util::stream::stream::StreamExt::poll_next_unpin
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/stream/stream/mod.rs:1638:9
  40: <futures_util::stream::stream::next::Next<St> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/stream/stream/next.rs:32:9
  41: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.40/src/instrument.rs:321:9
  42: risingwave_stream::executor::wrapper::trace::trace::{{closure}}
             at ./risingwave/src/stream/src/executor/wrapper/trace.rs:48:69
  43: <futures_async_stream::try_stream::GenTryStream<G> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-async-stream-0.2.9/src/lib.rs:506:33
  44: <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-core-0.3.30/src/stream.rs:120:9
  45: <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-core-0.3.30/src/stream.rs:120:9
  46: <risingwave_stream::executor::dispatch::DispatchExecutor as risingwave_stream::executor::StreamConsumer>::execute::{{closure}}
             at ./risingwave/src/stream/src/executor/dispatch.rs:382:9
  47: <futures_async_stream::try_stream::GenTryStream<G> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-async-stream-0.2.9/src/lib.rs:506:33
  48: <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-core-0.3.30/src/stream.rs:120:9
  49: <&mut S as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-core-0.3.30/src/stream.rs:104:9
  50: <tokio_stream::stream_ext::next::Next<St> as core::future::future::Future>::poll
             at ./root/.cargo/git/checkouts/tokio-968c02b7a1a41bea/fe39bb8/tokio-stream/src/stream_ext/next.rs:42:9
  51: <tokio_stream::stream_ext::try_next::TryNext<St> as core::future::future::Future>::poll
             at ./root/.cargo/git/checkouts/tokio-968c02b7a1a41bea/fe39bb8/tokio-stream/src/stream_ext/try_next.rs:43:9
  52: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.40/src/instrument.rs:321:9
  53: <await_tree::future::Instrumented<F,_> as core::future::future::Future>::poll
                                                                                                      51518,14      17%
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.40/src/instrument.rs:321:9
  53: <await_tree::future::Instrumented<F,_> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/await-tree-0.1.1/src/future.rs:124:23
  54: risingwave_stream::executor::actor::Actor<C>::run_consumer::{{closure}}
             at ./risingwave/src/stream/src/executor/actor.rs:206:18
  55: <tokio::future::maybe_done::MaybeDone<Fut> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/future/maybe_done.rs:68:48
  56: risingwave_stream::executor::actor::Actor<C>::run::{{closure}}::{{closure}}::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/macros/join.rs:126:24
  57: <tokio::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/future/poll_fn.rs:58:9
  58: risingwave_stream::executor::actor::Actor<C>::run::{{closure}}::{{closure}}
             at ./risingwave/src/stream/src/executor/actor.rs:162:17
  59: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/task/task_local.rs:347:31
  60: tokio::task::task_local::LocalKey<T>::scope_inner
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/task/task_local.rs:217:19
  61: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/task/task_local.rs:343:19
  62: risingwave_expr::expr_context::expr_context_scope::{{closure}}
             at ./risingwave/src/expr/core/src/expr_context.rs:35:65
  63: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/task/task_local.rs:347:31
  64: tokio::task::task_local::LocalKey<T>::scope_inner
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/task/task_local.rs:217:19
  65: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/task/task_local.rs:343:19
  66: risingwave_stream::executor::actor::Actor<C>::run::{{closure}}
             at ./risingwave/src/stream/src/executor/actor.rs:170:10
  67: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/future/future/map.rs:55:37
  68: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/lib.rs:91:13
  69: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/task/task_local.rs:347:31
  70: tokio::task::task_local::LocalKey<T>::scope_inner
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/task/task_local.rs:217:19
  71: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll
                                                                                                      51552,14      17%
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/harness.rs:208:27
 114: tokio::runtime::task::harness::Harness<T,S>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/harness.rs:153:15
 115: tokio::runtime::task::raw::RawTask::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/raw.rs:200:18
 116: tokio::runtime::task::UnownedTask<S>::run
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/mod.rs:437:9
 117: tokio::runtime::blocking::pool::Task::run
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/blocking/pool.rs:159:9
 118: tokio::runtime::blocking::pool::Inner::run
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/blocking/pool.rs:513:17
 119: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/blocking/pool.rs:471:13
 120: std::sys_common::backtrace::__rust_begin_short_backtrace
             at ./rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/sys_common/backtrace.rs:155:18
 121: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}
             at ./rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/thread/mod.rs:529:17
 122: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at ./rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/core/src/panic/unwind_safe.rs:272:9
 123: std::panicking::try::do_call
             at ./rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/panicking.rs:552:40
 124: std::panicking::try
             at ./rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/panicking.rs:516:19
 125: std::panic::catch_unwind
             at ./rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/panic.rs:142:14
 126: std::thread::Builder::spawn_unchecked_::{{closure}}
             at ./rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/thread/mod.rs:528:30
 127: core::ops::function::FnOnce::call_once{{vtable.shim}}
             at ./rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/core/src/ops/function.rs:250:5
 128: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at ./rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/alloc/src/boxed.rs:2015:9
 129: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at ./rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/alloc/src/boxed.rs:2015:9
 130: std::sys::unix::thread::Thread::new::thread_start
                                                                                                      51674,14      17%
             at ./rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/alloc/src/boxed.rs:2015:9
 130: std::sys::unix::thread::Thread::new::thread_start
             at ./rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/sys/unix/thread.rs:108:17
 131: start_thread
             at ./nptl/pthread_create.c:442:8
 132: __GI___clone
             at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:100

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

v1.8.2

Additional context

No response

The text was updated successfully, but these errors were encountered:

xiangjinwu · 2024-05-09T03:30:13Z

@tabVersion Any recommendations / thoughts on this?

tabVersion · 2024-05-09T13:43:51Z

Seeing the log here

ERROR risingwave_stream::task::stream_manager: actor exit with error actor_id=4490 error=Executor error: Sink error: Kafka error: Message production error: QueueFull (Local: Queue full)

It indicates the batching is too small and it triggers a small failover to ingest the data within the same epoch over and over again.
I'd suggest increasing properties.queue.buffering.max.ms and `` to allow larger batching cache and reduce the overhead for each batch. Related issue confluentinc/librdkafka#2247

ly9chee · 2024-05-10T04:26:37Z

@tabVersion Thanks for the suggestions, but I have a concern about how to set those properties well, because in practice, the upstream thruput can change frequently, it may work pretty well when the upstream thruput is 1k/s, but enters a recovery loop when the upstream thruput gets high(100k/s) or Kafka is experiencing high load. And when we encounter this error, the only thing we can do is drop this sink to prevent cluster from continuing to crash.

It seems that in KafkaPayloadWriter implementation, when a Queue Full error is encountered, we only await one delivery to be sent and immediately create a new delivery future.

risingwave/src/connector/src/sink/kafka.rs

Lines 511 to 522 in 91b7ee2

    
           match e { 
        
               KafkaError::MessageProduction(RDKafkaErrorCode::QueueFull) => { 
        
                   tracing::warn!( 
        
                       "Producer queue full. Delivery future buffer size={}. Await and retry #{}", 
        
                       self.add_future.future_count(), 
        
                       i 
        
                   ); 
        
                   self.add_future.await_one_delivery().await?; 
        
                   continue; 
        
               } 
        
               _ => return Err(e.into()), 
        
           }

In this case, I think we might await all inflight deliveries being drained or wait a sufficient time before doing a retry, otherwise the producer queue may keep reaching full.

ly9chee · 2024-05-10T04:31:09Z

It seems this issue is not urgent, because I can't reproduce it.😅

tabVersion · 2024-07-10T09:32:14Z

remove from milestone, keep open for tracking

ly9chee · 2024-10-18T10:28:24Z

It seems that this error (Queue Full) can happen when the network bandwidth reaches its limit, after we increased the bandwidth, the error disappeared.

tabVersion · 2024-11-25T09:18:56Z

close as false alarm

ly9chee added the type/bug Something isn't working label May 8, 2024

github-actions bot added this to the release-1.10 milestone May 8, 2024

xiangjinwu assigned tabVersion May 9, 2024

neverchanje added the user-feedback label May 17, 2024

tabVersion removed this from the release-1.10 milestone Jul 10, 2024

tabVersion closed this as not planned Won't fix, can't repro, duplicate, stale Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster entering a recovery loop may caused by Kafka sink QueueFull (Local: Queue full) #16640

Cluster entering a recovery loop may caused by Kafka sink QueueFull (Local: Queue full) #16640

ly9chee commented May 8, 2024

xiangjinwu commented May 9, 2024

tabVersion commented May 9, 2024

ly9chee commented May 10, 2024

ly9chee commented May 10, 2024

tabVersion commented Jul 10, 2024

ly9chee commented Oct 18, 2024

tabVersion commented Nov 25, 2024

Cluster entering a recovery loop may caused by Kafka sink QueueFull (Local: Queue full) #16640

Cluster entering a recovery loop may caused by Kafka sink QueueFull (Local: Queue full) #16640

Comments

ly9chee commented May 8, 2024

Describe the bug

Description

Other observed phenomena

Error message/log

To Reproduce

Expected behavior

How did you deploy RisingWave?

The version of RisingWave

Additional context

xiangjinwu commented May 9, 2024

tabVersion commented May 9, 2024

ly9chee commented May 10, 2024

ly9chee commented May 10, 2024

tabVersion commented Jul 10, 2024

ly9chee commented Oct 18, 2024

tabVersion commented Nov 25, 2024