Stream 'crash' when using Single Active Consumer and restarting stream writer node #5889

racorn · 2022-09-27T15:36:29Z

I am evaluating RabbitMQ Single Active Consumer for Streams.

Setup

3 node RabbitMQ 3.11.0 cluster in docker containers using official image 3.11.0. Host running Ubuntu 18.04 LTS.
HAProxy in front of the brokers
1 java producer using rabbitmq-stream-client 0.7.0, sending 1 message per second
2 java consumers using rabbitmq-stream-client 0.7.0, enabling single active consumer, manual tracking strategy, storing offset for each consumed message
plain stream, not super stream

Procedure

Start java producers and consumers
Find the RabbitMQ node that has the writer role for the stream. Restart that node (e.g. $ docker restart rabbit-1)
Watch the clients discover the new topology (e.g. new writer), reconnet and resume activity.
When everything looks OK repeat from step 2.

Outcome

After a couple of restarts of the writer node, the Producer fails to send messages to the cluster, client returns error code 10002 (CODE_PRODUCER_NOT_AVAILABLE). The consumers log the message

  [OffsetTrackingCoordinator] [] [ ] - Error while flushing tracker: Not possible to query offset for consumer test-consumer on stream radiusEvent for now

stream status

Status of stream radiusEvent on node rabbit@rabbit-1 ...
Error:
{{:badmatch, {:error, :noproc}}, [{:rabbit_stream_queue, :get_counters, 1, [file: 'rabbit_stream_queue.erl', line: 659]}, {:rabbit_stream_queue, :status, 2, [file: 'rabbit_stream_queue.erl', line: 649]}]}

The server logs shows this error (truncated)

2022-09-27 14:29:32.111898+00:00 [error] <0.535.0> handle_leader err {'EXIT',
2022-09-27 14:29:32.111898+00:00 [error] <0.535.0>                       {{badmatch,#{}},
2022-09-27 14:29:32.111898+00:00 [error] <0.535.0>                        [{rabbit_stream_sac_coordinator,ensure_monitors,4,
2022-09-27 14:29:32.111898+00:00 [error] <0.535.0>                             [{file,"rabbit_stream_sac_coordinator.erl"},
2022-09-27 14:29:32.111898+00:00 [error] <0.535.0>                              {line,384}]},
2022-09-27 14:29:32.111898+00:00 [error] <0.535.0>                         {rabbit_stream_coordinator,apply,3,
2022-09-27 14:29:32.111898+00:00 [error] <0.535.0>                             [{file,"rabbit_stream_coordinator.erl"},
2022-09-27 14:29:32.111898+00:00 [error] <0.535.0>                              {line,395}]},
2022-09-27 14:29:32.111898+00:00 [error] <0.535.0>                         {ra_server,apply_with,2,
2022-09-27 14:29:32.111898+00:00 [error] <0.535.0>                             [{file,"src/ra_server.erl"},{line,2186}]},
2022-09-27 14:29:32.111898+00:00 [error] <0.535.0>                         {ra_log_reader,mem_tbl_fold,5,
2022-09-27 14:29:32.111898+00:00 [error] <0.535.0>                             [{file,"src/ra_log_reader.erl"},{line,159}]},
2022-09-27 14:29:32.111898+00:00 [error] <0.535.0>                         {ra_log_reader,'-fold/5-fun-0-',4,

And I don't how to get out of this state without resetting the whole cluster.

In my environment, this is quite easy to reproduce, but the number of writer node restarts vary from 1 to 10.

node-2.log
node-3.log
node-1.log

Server logs for all 3 nodes are attached. The errors are reported near the end of the files.

The text was updated successfully, but these errors were encountered:

michaelklishin · 2022-09-28T09:46:36Z

Can you please put together an executable example that does not use any proxies?

acogoluegnes · 2022-09-28T12:18:28Z

I'm trying to reproduce the issue. Do the 2 consumers share the name or is it distinct?

racorn · 2022-09-28T13:21:09Z

Can you please put together an executable example that does not use any proxies?

Do you mean if I can try to reproduce without using the HAProxy that load balances the 3 brokers?

acogoluegnes · 2022-09-28T13:23:54Z

Yes, the idea is to reduce noise as much as possible.

I tried to reproduce just by stop_app / start_app the stream leader node but the client recovers.

racorn · 2022-09-28T13:26:32Z

I'm trying to reproduce the issue. Do the 2 consumers share the name or is it distinct?

They use the same name

ConsumerBuilder builder = environment.consumerBuilder()
                .singleActiveConsumer()
                .manualTrackingStrategy().builder()
                .name("test-consumer")
                .messageHandler((context, message) -> {
                    ...
                    context.storeOffset();
                });
...

racorn · 2022-09-28T13:35:24Z

Yes, the idea is to reduce noise as much as possible.

I tried to reproduce just by stop_app / start_app the stream leader node but the client recovers.

Ok - I can try that. I will just have change my setup to use 'stream.advertised_host' for the containers.

acogoluegnes · 2022-09-28T13:39:14Z

Sounds good, thanks.

Do not assume the connection PID of a consumer is still known from the state on state cleaning when unregistering a consumer. Fixes #5889

acogoluegnes · 2022-09-28T14:17:36Z

@racorn I pushed a fix (#5897), could you try to repro with the pivotalrabbitmq/rabbitmq:rabbitmq-server-5889-sac-coordinator-crash-in-monitors-otp-max-bazel?

Thanks.

racorn · 2022-09-28T15:31:43Z

Thank you @acogoluegnes. I will build and try it out tomorrow.

Do not assume the connection PID of a consumer is still known from the state on state cleaning when unregistering a consumer. Fixes #5889 (cherry picked from commit 3767401)

Do not assume the connection PID of a consumer is still known from the state on state cleaning when unregistering a consumer. Fixes #5889

acogoluegnes added a commit that referenced this issue Sep 28, 2022

Make ensure_monitors more defensive in SAC coordinator

3767401

Do not assume the connection PID of a consumer is still known from the state on state cleaning when unregistering a consumer. Fixes #5889

acogoluegnes mentioned this issue Sep 28, 2022

Make ensure_monitors more defensive in SAC coordinator #5897

Merged

michaelklishin closed this as completed in #5897 Sep 30, 2022

kjnilsson pushed a commit that referenced this issue Oct 3, 2022

Make ensure_monitors more defensive in SAC coordinator

f233870

Do not assume the connection PID of a consumer is still known from the state on state cleaning when unregistering a consumer. Fixes #5889

michaelklishin added this to the 3.11.1 milestone Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream 'crash' when using Single Active Consumer and restarting stream writer node #5889

Stream 'crash' when using Single Active Consumer and restarting stream writer node #5889

racorn commented Sep 27, 2022 •

edited by michaelklishin

Loading

michaelklishin commented Sep 28, 2022

acogoluegnes commented Sep 28, 2022

racorn commented Sep 28, 2022

acogoluegnes commented Sep 28, 2022

racorn commented Sep 28, 2022 •

edited

Loading

racorn commented Sep 28, 2022

acogoluegnes commented Sep 28, 2022

acogoluegnes commented Sep 28, 2022

racorn commented Sep 28, 2022

Stream 'crash' when using Single Active Consumer and restarting stream writer node #5889

Stream 'crash' when using Single Active Consumer and restarting stream writer node #5889

Comments

racorn commented Sep 27, 2022 • edited by michaelklishin Loading

Setup

Procedure

Outcome

michaelklishin commented Sep 28, 2022

acogoluegnes commented Sep 28, 2022

racorn commented Sep 28, 2022

acogoluegnes commented Sep 28, 2022

racorn commented Sep 28, 2022 • edited Loading

racorn commented Sep 28, 2022

acogoluegnes commented Sep 28, 2022

acogoluegnes commented Sep 28, 2022

racorn commented Sep 28, 2022

racorn commented Sep 27, 2022 •

edited by michaelklishin

Loading

racorn commented Sep 28, 2022 •

edited

Loading