Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream 'crash' when using Single Active Consumer and restarting stream writer node #5889

Closed
racorn opened this issue Sep 27, 2022 · 9 comments · Fixed by #5897
Closed

Stream 'crash' when using Single Active Consumer and restarting stream writer node #5889

racorn opened this issue Sep 27, 2022 · 9 comments · Fixed by #5897
Milestone

Comments

@racorn
Copy link

racorn commented Sep 27, 2022

I am evaluating RabbitMQ Single Active Consumer for Streams.

Setup

  • 3 node RabbitMQ 3.11.0 cluster in docker containers using official image 3.11.0. Host running Ubuntu 18.04 LTS.
  • HAProxy in front of the brokers
  • 1 java producer using rabbitmq-stream-client 0.7.0, sending 1 message per second
  • 2 java consumers using rabbitmq-stream-client 0.7.0, enabling single active consumer, manual tracking strategy, storing offset for each consumed message
  • plain stream, not super stream

Procedure

  1. Start java producers and consumers
  2. Find the RabbitMQ node that has the writer role for the stream. Restart that node (e.g. $ docker restart rabbit-1)
  3. Watch the clients discover the new topology (e.g. new writer), reconnet and resume activity.
  4. When everything looks OK repeat from step 2.

Outcome

After a couple of restarts of the writer node, the Producer fails to send messages to the cluster, client returns error code 10002 (CODE_PRODUCER_NOT_AVAILABLE). The consumers log the message

  [OffsetTrackingCoordinator] [] [ ] - Error while flushing tracker: Not possible to query offset for consumer test-consumer on stream radiusEvent for now

stream status

Status of stream radiusEvent on node rabbit@rabbit-1 ...
Error:
{{:badmatch, {:error, :noproc}}, [{:rabbit_stream_queue, :get_counters, 1, [file: 'rabbit_stream_queue.erl', line: 659]}, {:rabbit_stream_queue, :status, 2, [file: 'rabbit_stream_queue.erl', line: 649]}]}

The server logs shows this error (truncated)

2022-09-27 14:29:32.111898+00:00 [error] <0.535.0> handle_leader err {'EXIT',
2022-09-27 14:29:32.111898+00:00 [error] <0.535.0>                       {{badmatch,#{}},
2022-09-27 14:29:32.111898+00:00 [error] <0.535.0>                        [{rabbit_stream_sac_coordinator,ensure_monitors,4,
2022-09-27 14:29:32.111898+00:00 [error] <0.535.0>                             [{file,"rabbit_stream_sac_coordinator.erl"},
2022-09-27 14:29:32.111898+00:00 [error] <0.535.0>                              {line,384}]},
2022-09-27 14:29:32.111898+00:00 [error] <0.535.0>                         {rabbit_stream_coordinator,apply,3,
2022-09-27 14:29:32.111898+00:00 [error] <0.535.0>                             [{file,"rabbit_stream_coordinator.erl"},
2022-09-27 14:29:32.111898+00:00 [error] <0.535.0>                              {line,395}]},
2022-09-27 14:29:32.111898+00:00 [error] <0.535.0>                         {ra_server,apply_with,2,
2022-09-27 14:29:32.111898+00:00 [error] <0.535.0>                             [{file,"src/ra_server.erl"},{line,2186}]},
2022-09-27 14:29:32.111898+00:00 [error] <0.535.0>                         {ra_log_reader,mem_tbl_fold,5,
2022-09-27 14:29:32.111898+00:00 [error] <0.535.0>                             [{file,"src/ra_log_reader.erl"},{line,159}]},
2022-09-27 14:29:32.111898+00:00 [error] <0.535.0>                         {ra_log_reader,'-fold/5-fun-0-',4,

And I don't how to get out of this state without resetting the whole cluster.

In my environment, this is quite easy to reproduce, but the number of writer node restarts vary from 1 to 10.

node-2.log
node-3.log
node-1.log

Server logs for all 3 nodes are attached. The errors are reported near the end of the files.

@michaelklishin
Copy link
Member

Can you please put together an executable example that does not use any proxies?

@acogoluegnes
Copy link
Contributor

I'm trying to reproduce the issue. Do the 2 consumers share the name or is it distinct?

@racorn
Copy link
Author

racorn commented Sep 28, 2022

Can you please put together an executable example that does not use any proxies?

Do you mean if I can try to reproduce without using the HAProxy that load balances the 3 brokers?

@acogoluegnes
Copy link
Contributor

Yes, the idea is to reduce noise as much as possible.

I tried to reproduce just by stop_app / start_app the stream leader node but the client recovers.

@racorn
Copy link
Author

racorn commented Sep 28, 2022

I'm trying to reproduce the issue. Do the 2 consumers share the name or is it distinct?

They use the same name

ConsumerBuilder builder = environment.consumerBuilder()
                .singleActiveConsumer()
                .manualTrackingStrategy().builder()
                .name("test-consumer")
                .messageHandler((context, message) -> {
                    ...
                    context.storeOffset();
                });
...

@racorn
Copy link
Author

racorn commented Sep 28, 2022

Yes, the idea is to reduce noise as much as possible.

I tried to reproduce just by stop_app / start_app the stream leader node but the client recovers.

Ok - I can try that. I will just have change my setup to use 'stream.advertised_host' for the containers.

@acogoluegnes
Copy link
Contributor

Sounds good, thanks.

acogoluegnes added a commit that referenced this issue Sep 28, 2022
Do not assume the connection PID of a consumer is still
known from the state on state cleaning when unregistering
a consumer.

Fixes #5889
@acogoluegnes
Copy link
Contributor

@racorn I pushed a fix (#5897), could you try to repro with the pivotalrabbitmq/rabbitmq:rabbitmq-server-5889-sac-coordinator-crash-in-monitors-otp-max-bazel?

Thanks.

@racorn
Copy link
Author

racorn commented Sep 28, 2022

Thank you @acogoluegnes. I will build and try it out tomorrow.

mergify bot pushed a commit that referenced this issue Sep 30, 2022
Do not assume the connection PID of a consumer is still
known from the state on state cleaning when unregistering
a consumer.

Fixes #5889

(cherry picked from commit 3767401)
kjnilsson pushed a commit that referenced this issue Oct 3, 2022
Do not assume the connection PID of a consumer is still
known from the state on state cleaning when unregistering
a consumer.

Fixes #5889
@michaelklishin michaelklishin added this to the 3.11.1 milestone Oct 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants