Replace clustered RabbitMQ with something simpler #5443

chrismeyersfsu · 2019-12-04T16:03:54Z

ISSUE TYPE

Feature Idea

SUMMARY

Replace our clustered implementation of RabbitMQ with something that is easier to understand and operate (and that matches AWX's needs better).

AWX currently makes extensive use of clustered RabbitMQ:

As a form of direct topic-based RPC for dispatching jobs (e.g., playbook runs) to underlying AWX instances. This process involves a periodic scheduler that wakes up, finds work to do, picks an available node with capacity, and places a message on its queue, which is treated as a sort of per-instance "task queue" ala https://python-rq.org or https://docs.celeryproject.org/en/stable/. Certain special messages (which generally are used to perform internal housekeeping tasks in AWX) are "broadcast" to all nodes instead of following a direct RPC topology.
As a buffer for processing job output (Ansible callback events/stdout) via AWX's "callback receiver" process running on each AWX instance.
As a backend for AWX's websocket support for in-browser streaming stdout and live job status updates. Our websocket implementation is based on a custom AMQP-specific ASGI backend which we wrote and maintain, https://github.com/ansible/asgi_amqp/. As time has marched on, and the upstream channels library has drastically changed its architecture in anticipation of native async support in python3, it has become an increased maintenance burden for us to continue to support a custom backend specific to AMQP (especially when it appears that pretty much everybody upstream that uses Channels is just using Redis).

When we originally designed this system years ago, we optimized as heavily as possible for data integrity and safety. But in the scenarios described above, the data we manage under this system is largely ephemeral. In the most extreme cases, it doesn't persist beyond the lifetime of a running playbook. In other words, if a node running a playbook were to suddenly go offline, we can't really recover from that sort of scenario anyways without re-running the playbook. Similarly, if messages are lost in flight in rare circumstances, you can always just relaunch a playbook.

We're paying a heavy cost for this cluster-wide data mirroring/replication. Historically, we've heard from many of our users that:

RabbitMQ clustering doesn't work well in environments unless cluster peers have very low latency. In fact, this is a limitation called out repeatedly in RabbitMQ's clustering documentation. It's an aspect of RabbitMQ clustering that we knew about when we chose it years ago, but it's turned out to be much more painful than we anticipated.
Especially in environments with unreliable networks, RabbitMQ can be very difficult to administer and troubleshoot. In particular, we regularly have users that report network partitioning scenarios that require manual intervention via manual erlang and/or RabbitMQ-specific remediation.
When cluster nodes disappear for prolonged periods of time (hours, days), we've seen many situations where RabbitMQ clustering just isn't able to recover on its own, which causes a myriad of issues when the node returns. Detecting and remediating this often leads to service outages.
The firewall/security group requirements for inter-node replication is a common source of confusion for users, and failing to do it properly can result in situations where adding a node to an existing cluster fails and results in an unanticipated cluster-wide outage.

What we've come to realize is that this architecture is likely not worth the operational and architectural cost we're paying.

Long-term, we'd prefer to move to a model that does not require a control plane that relies on a clustered message bus, but instead one where members of the control plane can largely drop off with minimal effect beyond lowered total execution capacity. RabbitMQ clustering explicitly is not reliable across AZs, and especially not regions, and while newer topologies we're considering don't absolve of this entirely, our goal is to move AWX to a model which is much more forgiving of low-latency networks in general.

In the next major version of AWX, we'd like to investigate replacing RabbitMQ with a combination of features provided by Redis (a new dependency) and Postgres itself. This would most likely look something like this:

Dispatching tasks is still treated as “direct RPC”. In other words, when the task manager runs, it picks one cluster node with capacity, and assigns it as the “execution node”. Dispatcher processes running on every node listen for “tasks” via PostgreSQL channel notification support (https://www.postgresql.org/docs/10/sql-notify.html)
Events emitted from playbooks are no longer sent to a distributed message queue (previously RabbitMQ), but instead a local redis running on each node. Callback receivers on each node listen for events on that node and persist them into the database.
When an event is persisted to the database by the callback receiver, it also is broadcasted to all cluster peers via ASGI. In this way, if a playbook runs on Node A, users connected to Daphne on Nodes B, C, and D will receive a broadcast of these events and see the output in their browser tabs.

Longer term, introducing Redis would potentially allow us to also lose our dependence on memcached (so in other words, we might be able to swap out two dependencies, and replace them with one single new dependency).

kdelee · 2019-12-19T20:48:12Z

some additional work items under this:

include an awx-manage based health check for the redis system
include ^ health check in the sos report as well as find way to depend on/enable the redis sos report https://github.com/ansible/awx/blob/devel/tools/sosreport/tower.py so we can get redis logs in sos report
send one final unsubscribed message back to ws client when tower ACKs the unsubscribe request so we can know when we have actually been unsubscribed

ryanpetrello · 2020-03-17T15:23:12Z

cc @MrMEEE in case you haven't seen this yet

also: https://groups.google.com/forum/#!topic/awx-project/lRnm2vB1oEQ

MrMEEE · 2020-03-18T13:03:27Z

@ryanpetrello Thanks for the heads-up.. I will follow this closely :)

ryanpetrello · 2020-03-18T13:10:53Z

@MrMEEE the biggest change is "install and configure Redis, not RabbitMQ". Also, we lost a number of RabbitMQ toggle-ables in the installer.

You may be interested in any changes under ./installer in the PR:

https://github.com/ansible/awx/pull/6034/files#diff-bfa9126dc8059138bf7554d741cb6a5d
https://github.com/ansible/awx/pull/6034/files#diff-fabe539e09ace3de67486bba9b5b3be6
https://github.com/ansible/awx/pull/6034/files#diff-0091f8a83b63dafea8313c794ba726b3

elyezer · 2020-03-18T23:33:38Z

Extensive testing was done before merge to ensure the installation was working as expected and that replacing rabbitmq with redis would not introduce regressions.

With that said, we can consider this as being verified and any further polishing will be handled by separated issues (we already got some of those already opened).

ryanpetrello · 2020-03-30T17:35:02Z

Just a heads up @MrMEEE - 10.0.0 is out now, and includes this change.

MrMEEE · 2020-03-30T17:38:49Z

@ryanpetrello thanks.. A completely new build platform, CentOS8/RHEL8 support and the Redis changes are in the works.. I hope for a release after easter

aak1989 · 2020-03-31T10:59:05Z

Hey there, I installed AWX on kubernetes after redis was introduced , the installation compled with no issue but when i access the UI and try to do anything on the UI i get error related to api. i am attaching couple of screenshot of the error i am getting.

aak1989 · 2020-03-31T12:42:35Z

I have installed 10.0.0 version of AWX and that has resolved the above issue, however the AWX UI no longer appears to refresh automatically, For example, starting a job, the job always remains at pending unless you manually refresh the page. The job appears to stay in pending until the browser is manually refreshed.

ryanpetrello · 2020-03-31T15:21:04Z

@aak1989 you've got HTTP 500 errors - can you share any errors you might see in the awx_web logs?

ryanpetrello · 2020-03-31T15:21:19Z

Also, could you file a new issue describing what you're encountering? Thanks.

rkatta22 · 2020-05-12T12:47:04Z

I have installed 10.0.0 version of AWX and that has resolved the above issue, however the AWX UI no longer appears to refresh automatically, For example, starting a job, the job always remains at pending unless you manually refresh the page. The job appears to stay in pending until the browser is manually refreshed.

Hi I got same error when I am upgrading ansible tower from 7.0.0 to 11.0.0 through Docker-Compose file, now when I run the docker-compose up command, I am getting below error. please help me what is the mistake I am doing hear in the configuration.

ValueError: Redis URL must specify one of the followingschemes (redis://, rediss://, unix://)
task_1 | 2020-05-09 11:00:41,341 INFO exited: callback-receiver (exit status 1; not expected)
task_1 | 2020-05-09 11:00:42,344 INFO spawned: 'callback-receiver' with pid 1276
task_1 | 2020-05-09 11:00:43,345 INFO success: callback-receiver entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
web_1 | 2020-05-09 11:00:43,549 WARNING awx.main.analytics.broadcast_websocket ('Unsupported URI scheme', 'amqp')
task_1 | 2020-05-09 11:00:43,977 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1282
task_1 | 2020-05-09 11:00:43,977 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1282
task_1 | 2020-05-09 11:00:43,982 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1283
task_1 | 2020-05-09 11:00:43,982 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1283
task_1 | 2020-05-09 11:00:43,986 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1284
task_1 | 2020-05-09 11:00:43,986 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1284
task_1 | 2020-05-09 11:00:43,991 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1285
task_1 | 2020-05-09 11:00:43,991 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1285
task_1 | Traceback (most recent call last):

below is my compose file:
version: '2'
services:

web:
image: ansible/awx_web:11.0.0
container_name: awx_web
depends_on:
- redis
- memcached
ports:
- "80:8052"
- "443:8443"
hostname: awxweb
user: root
restart: unless-stopped
volumes:
- "/var/lib/awx/SECRET_KEY:/etc/tower/SECRET_KEY"
- "/var/lib/awx/environment.sh:/etc/tower/conf.d/environment.sh"
- "/var/lib/awx/credentials.py:/etc/tower/conf.d/credentials.py"
- "/var/lib/awx/projects:/var/lib/awx/projects:rw"
- "/var/lib/awx/projects/nginx.conf:/etc/nginx/nginx.conf:rw"

dns:
  - 10.204.226.77
  - 10.204.226.111
environment:
  http_proxy:
  https_proxy:
  no_proxy:

task:
image: ansible/awx_task:11.0.0
container_name: awx_task
depends_on:
- redis
- memcached
- web
hostname: awx
user: root
restart: unless-stopped
volumes:
- "/var/lib/awx/SECRET_KEY:/etc/tower/SECRET_KEY"
- "/var/lib/awx/environment.sh:/etc/tower/conf.d/environment.sh"
- "/var/lib/awx/credentials.py:/etc/tower/conf.d/credentials.py"
- "/var/lib/awx/projects:/var/lib/awx/projects:rw"
dns:
- 10.204.226.77
- 10.204.226.111
environment:
http_proxy:
https_proxy:
no_proxy:
redis:
image: redis:6.0-rc4-alpine3.11
container_name: tools_redis_1
environment:
REDIS_PASSWORD: password
ports:
- "6379:6379"
volumes:
- "/var/lib/awx/redis.conf:/usr/local/etc/redis/redis.conf"
- "/var/lib/awx/redis_socket_standalone:/var/run/redis/"
command: ["/usr/local/etc/redis/redis.conf"]
memcached:
image: "memcached:alpine"
container_name: awx_memcached
restart: unless-stopped
environment:
http_proxy:
https_proxy:
no_proxy:

and In environment.sh file I have done the below configuration:
REDIS_URL=redis://ansible-ro.rbkm0e.ng.0002.use1.cache.amazonaws.com:6379
REDIS_PORT=6379
REDIS_SOCKET=/var/lib/awx/redis.sock
REDIS_PASSWORD=password

rkatta22 · 2020-05-12T12:47:42Z

@aak1989 you've got HTTP 500 errors - can you share any errors you might see in the awx_web logs?

ValueError: Redis URL must specify one of the followingschemes (redis://, rediss://, unix://)
task_1 | 2020-05-09 11:00:41,341 INFO exited: callback-receiver (exit status 1; not expected)
task_1 | 2020-05-09 11:00:42,344 INFO spawned: 'callback-receiver' with pid 1276
task_1 | 2020-05-09 11:00:43,345 INFO success: callback-receiver entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
web_1 | 2020-05-09 11:00:43,549 WARNING awx.main.analytics.broadcast_websocket ('Unsupported URI scheme', 'amqp')
task_1 | 2020-05-09 11:00:43,977 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1282
task_1 | 2020-05-09 11:00:43,977 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1282
task_1 | 2020-05-09 11:00:43,982 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1283
task_1 | 2020-05-09 11:00:43,982 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1283
task_1 | 2020-05-09 11:00:43,986 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1284
task_1 | 2020-05-09 11:00:43,986 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1284
task_1 | 2020-05-09 11:00:43,991 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1285
task_1 | 2020-05-09 11:00:43,991 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1285
task_1 | Traceback (most recent call last):

tima · 2020-05-12T22:45:52Z

@rkatta22 making comments on a closed ticket is not going to receive a reply. You need to file a new issue if you think you've encountered a bug.

ryanpetrello · 2020-05-12T23:43:37Z

@rkatta22 you haven't encountered a bug - you just have old configuration of some sort laying around pointed at an old AMQP connection string from a prior install (which is no longer valid):

('Unsupported URI scheme', 'amqp')

If you need help troubleshooting an AWX install, try our mailing list or IRC channel:

http://webchat.freenode.net/?channels=ansible-awx
https://groups.google.com/forum/#!forum/awx-project

ryanpetrello · 2020-05-13T13:25:41Z

@rkatta22,

If you need help troubleshooting an AWX install, try our mailing list or IRC channel:

http://webchat.freenode.net/?channels=ansible-awx
https://groups.google.com/forum/#!forum/awx-project

chrismeyersfsu added the component:api label Dec 4, 2019

chrismeyersfsu self-assigned this Dec 4, 2019

chrismeyersfsu added component:installer priority:high type:enhancement labels Dec 4, 2019

chrismeyersfsu mentioned this issue Dec 4, 2019

Replace RabbitMQ w/ Redis #5441

Closed

ryanpetrello mentioned this issue Dec 4, 2019

Replacement of RabbitMQ to eliminate complexity #5414

Closed

AlanCoding mentioned this issue Dec 16, 2019

Improve event processing performance by passing host_id through event data #5514

Closed

kdelee self-assigned this Jan 21, 2020

elyezer self-assigned this Feb 18, 2020

chrismeyersfsu mentioned this issue Feb 21, 2020

Replace rabbitmq with redis #6034

Merged

ryanpetrello changed the title ~~Replace RabbitMQ with something simpler~~ Replace clustered RabbitMQ with something simpler Mar 5, 2020

ryanpetrello mentioned this issue Mar 14, 2020

Change local_docker rabbitmq version to match kubernetes deployments #6292

Closed

ryanpetrello mentioned this issue Mar 17, 2020

AWX UI returns "500 A server error has occurred", appears to be a RabbitMQ issue #3897

Closed

ryanpetrello added state:needs_test and removed state:in_progress labels Mar 18, 2020

elyezer closed this as completed Mar 18, 2020

elyezer removed the state:needs_test label Mar 18, 2020

geerlingguy mentioned this issue Mar 20, 2020

Prepare for AWX/Tower new versions using Redis instead of RabbitMQ geerlingguy/tower-operator#39

Closed

StepBee mentioned this issue Mar 29, 2020

Unable to save any forms #6460

Closed

ryanpetrello mentioned this issue Mar 30, 2020

Failed to save settings. Returned status: 504 #6391

Closed

sebstyle mentioned this issue Apr 20, 2020

10.0 redis clustering sujiar37/AWX-HA-InstanceGroup#26

Closed

This comment has been minimized.

Sign in to view

geerlingguy mentioned this issue May 26, 2020

Update to AWX 11.2.0, Tower 3.7 geerlingguy/tower-operator#42

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace clustered RabbitMQ with something simpler #5443

Replace clustered RabbitMQ with something simpler #5443

chrismeyersfsu commented Dec 4, 2019 •

edited by ryanpetrello

Loading

kdelee commented Dec 19, 2019

ryanpetrello commented Mar 17, 2020 •

edited

Loading

MrMEEE commented Mar 18, 2020

ryanpetrello commented Mar 18, 2020 •

edited

Loading

elyezer commented Mar 18, 2020

ryanpetrello commented Mar 30, 2020

MrMEEE commented Mar 30, 2020

aak1989 commented Mar 31, 2020

aak1989 commented Mar 31, 2020

ryanpetrello commented Mar 31, 2020

ryanpetrello commented Mar 31, 2020

rkatta22 commented May 12, 2020

rkatta22 commented May 12, 2020

tima commented May 12, 2020

ryanpetrello commented May 12, 2020 •

edited

Loading

This comment has been minimized.

ryanpetrello commented May 13, 2020

Replace clustered RabbitMQ with something simpler #5443

Replace clustered RabbitMQ with something simpler #5443

Comments

chrismeyersfsu commented Dec 4, 2019 • edited by ryanpetrello Loading

ISSUE TYPE

SUMMARY

kdelee commented Dec 19, 2019

ryanpetrello commented Mar 17, 2020 • edited Loading

MrMEEE commented Mar 18, 2020

ryanpetrello commented Mar 18, 2020 • edited Loading

elyezer commented Mar 18, 2020

ryanpetrello commented Mar 30, 2020

MrMEEE commented Mar 30, 2020

aak1989 commented Mar 31, 2020

aak1989 commented Mar 31, 2020

ryanpetrello commented Mar 31, 2020

ryanpetrello commented Mar 31, 2020

rkatta22 commented May 12, 2020

rkatta22 commented May 12, 2020

tima commented May 12, 2020

ryanpetrello commented May 12, 2020 • edited Loading

This comment has been minimized.

ryanpetrello commented May 13, 2020

chrismeyersfsu commented Dec 4, 2019 •

edited by ryanpetrello

Loading

ryanpetrello commented Mar 17, 2020 •

edited

Loading

ryanpetrello commented Mar 18, 2020 •

edited

Loading

ryanpetrello commented May 12, 2020 •

edited

Loading