Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace clustered RabbitMQ with something simpler #5443

Closed
chrismeyersfsu opened this issue Dec 4, 2019 · 17 comments
Closed

Replace clustered RabbitMQ with something simpler #5443

chrismeyersfsu opened this issue Dec 4, 2019 · 17 comments

Comments

@chrismeyersfsu
Copy link
Member

chrismeyersfsu commented Dec 4, 2019

ISSUE TYPE
  • Feature Idea
SUMMARY

Replace our clustered implementation of RabbitMQ with something that is easier to understand and operate (and that matches AWX's needs better).

AWX currently makes extensive use of clustered RabbitMQ:

  1. As a form of direct topic-based RPC for dispatching jobs (e.g., playbook runs) to underlying AWX instances. This process involves a periodic scheduler that wakes up, finds work to do, picks an available node with capacity, and places a message on its queue, which is treated as a sort of per-instance "task queue" ala https://python-rq.org or https://docs.celeryproject.org/en/stable/. Certain special messages (which generally are used to perform internal housekeeping tasks in AWX) are "broadcast" to all nodes instead of following a direct RPC topology.

  2. As a buffer for processing job output (Ansible callback events/stdout) via AWX's "callback receiver" process running on each AWX instance.

  3. As a backend for AWX's websocket support for in-browser streaming stdout and live job status updates. Our websocket implementation is based on a custom AMQP-specific ASGI backend which we wrote and maintain, https://github.com/ansible/asgi_amqp/. As time has marched on, and the upstream channels library has drastically changed its architecture in anticipation of native async support in python3, it has become an increased maintenance burden for us to continue to support a custom backend specific to AMQP (especially when it appears that pretty much everybody upstream that uses Channels is just using Redis).

When we originally designed this system years ago, we optimized as heavily as possible for data integrity and safety. But in the scenarios described above, the data we manage under this system is largely ephemeral. In the most extreme cases, it doesn't persist beyond the lifetime of a running playbook. In other words, if a node running a playbook were to suddenly go offline, we can't really recover from that sort of scenario anyways without re-running the playbook. Similarly, if messages are lost in flight in rare circumstances, you can always just relaunch a playbook.

We're paying a heavy cost for this cluster-wide data mirroring/replication. Historically, we've heard from many of our users that:

  • RabbitMQ clustering doesn't work well in environments unless cluster peers have very low latency. In fact, this is a limitation called out repeatedly in RabbitMQ's clustering documentation. It's an aspect of RabbitMQ clustering that we knew about when we chose it years ago, but it's turned out to be much more painful than we anticipated.

  • Especially in environments with unreliable networks, RabbitMQ can be very difficult to administer and troubleshoot. In particular, we regularly have users that report network partitioning scenarios that require manual intervention via manual erlang and/or RabbitMQ-specific remediation.

  • When cluster nodes disappear for prolonged periods of time (hours, days), we've seen many situations where RabbitMQ clustering just isn't able to recover on its own, which causes a myriad of issues when the node returns. Detecting and remediating this often leads to service outages.

  • The firewall/security group requirements for inter-node replication is a common source of confusion for users, and failing to do it properly can result in situations where adding a node to an existing cluster fails and results in an unanticipated cluster-wide outage.

What we've come to realize is that this architecture is likely not worth the operational and architectural cost we're paying.

Long-term, we'd prefer to move to a model that does not require a control plane that relies on a clustered message bus, but instead one where members of the control plane can largely drop off with minimal effect beyond lowered total execution capacity. RabbitMQ clustering explicitly is not reliable across AZs, and especially not regions, and while newer topologies we're considering don't absolve of this entirely, our goal is to move AWX to a model which is much more forgiving of low-latency networks in general.

In the next major version of AWX, we'd like to investigate replacing RabbitMQ with a combination of features provided by Redis (a new dependency) and Postgres itself. This would most likely look something like this:

  • Dispatching tasks is still treated as “direct RPC”. In other words, when the task manager runs, it picks one cluster node with capacity, and assigns it as the “execution node”. Dispatcher processes running on every node listen for “tasks” via PostgreSQL channel notification support (https://www.postgresql.org/docs/10/sql-notify.html)

  • Events emitted from playbooks are no longer sent to a distributed message queue (previously RabbitMQ), but instead a local redis running on each node. Callback receivers on each node listen for events on that node and persist them into the database.

  • When an event is persisted to the database by the callback receiver, it also is broadcasted to all cluster peers via ASGI. In this way, if a playbook runs on Node A, users connected to Daphne on Nodes B, C, and D will receive a broadcast of these events and see the output in their browser tabs.

Longer term, introducing Redis would potentially allow us to also lose our dependence on memcached (so in other words, we might be able to swap out two dependencies, and replace them with one single new dependency).

@kdelee
Copy link
Member

kdelee commented Dec 19, 2019

some additional work items under this:

  • include an awx-manage based health check for the redis system
  • include ^ health check in the sos report as well as find way to depend on/enable the redis sos report https://github.com/ansible/awx/blob/devel/tools/sosreport/tower.py so we can get redis logs in sos report
  • send one final unsubscribed message back to ws client when tower ACKs the unsubscribe request so we can know when we have actually been unsubscribed

@kdelee kdelee self-assigned this Jan 21, 2020
@elyezer elyezer self-assigned this Feb 18, 2020
@ryanpetrello ryanpetrello changed the title Replace RabbitMQ with something simpler Replace clustered RabbitMQ with something simpler Mar 5, 2020
@ryanpetrello
Copy link
Contributor

ryanpetrello commented Mar 17, 2020

cc @MrMEEE in case you haven't seen this yet

also: https://groups.google.com/forum/#!topic/awx-project/lRnm2vB1oEQ

@MrMEEE
Copy link
Contributor

MrMEEE commented Mar 18, 2020

@ryanpetrello Thanks for the heads-up.. I will follow this closely :)

@ryanpetrello
Copy link
Contributor

ryanpetrello commented Mar 18, 2020

@MrMEEE the biggest change is "install and configure Redis, not RabbitMQ". Also, we lost a number of RabbitMQ toggle-ables in the installer.

You may be interested in any changes under ./installer in the PR:

https://github.com/ansible/awx/pull/6034/files#diff-bfa9126dc8059138bf7554d741cb6a5d
https://github.com/ansible/awx/pull/6034/files#diff-fabe539e09ace3de67486bba9b5b3be6
https://github.com/ansible/awx/pull/6034/files#diff-0091f8a83b63dafea8313c794ba726b3

@elyezer
Copy link
Member

elyezer commented Mar 18, 2020

Extensive testing was done before merge to ensure the installation was working as expected and that replacing rabbitmq with redis would not introduce regressions.

With that said, we can consider this as being verified and any further polishing will be handled by separated issues (we already got some of those already opened).

@ryanpetrello
Copy link
Contributor

Just a heads up @MrMEEE - 10.0.0 is out now, and includes this change.

@MrMEEE
Copy link
Contributor

MrMEEE commented Mar 30, 2020

@ryanpetrello thanks.. A completely new build platform, CentOS8/RHEL8 support and the Redis changes are in the works.. I hope for a release after easter

@aak1989
Copy link

aak1989 commented Mar 31, 2020

Hey there, I installed AWX on kubernetes after redis was introduced , the installation compled with no issue but when i access the UI and try to do anything on the UI i get error related to api. i am attaching couple of screenshot of the error i am getting.
Screen Shot 2020-03-31 at 3 56 59 AM
Screen Shot 2020-03-31 at 3 58 48 AM

@aak1989
Copy link

aak1989 commented Mar 31, 2020

I have installed 10.0.0 version of AWX and that has resolved the above issue, however the AWX UI no longer appears to refresh automatically, For example, starting a job, the job always remains at pending unless you manually refresh the page. The job appears to stay in pending until the browser is manually refreshed.

@ryanpetrello
Copy link
Contributor

@aak1989 you've got HTTP 500 errors - can you share any errors you might see in the awx_web logs?

@ryanpetrello
Copy link
Contributor

Also, could you file a new issue describing what you're encountering? Thanks.

@rkatta22
Copy link

I have installed 10.0.0 version of AWX and that has resolved the above issue, however the AWX UI no longer appears to refresh automatically, For example, starting a job, the job always remains at pending unless you manually refresh the page. The job appears to stay in pending until the browser is manually refreshed.

Hi I got same error when I am upgrading ansible tower from 7.0.0 to 11.0.0 through Docker-Compose file, now when I run the docker-compose up command, I am getting below error. please help me what is the mistake I am doing hear in the configuration.

ValueError: Redis URL must specify one of the followingschemes (redis://, rediss://, unix://)
task_1 | 2020-05-09 11:00:41,341 INFO exited: callback-receiver (exit status 1; not expected)
task_1 | 2020-05-09 11:00:42,344 INFO spawned: 'callback-receiver' with pid 1276
task_1 | 2020-05-09 11:00:43,345 INFO success: callback-receiver entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
web_1 | 2020-05-09 11:00:43,549 WARNING awx.main.analytics.broadcast_websocket ('Unsupported URI scheme', 'amqp')
task_1 | 2020-05-09 11:00:43,977 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1282
task_1 | 2020-05-09 11:00:43,977 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1282
task_1 | 2020-05-09 11:00:43,982 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1283
task_1 | 2020-05-09 11:00:43,982 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1283
task_1 | 2020-05-09 11:00:43,986 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1284
task_1 | 2020-05-09 11:00:43,986 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1284
task_1 | 2020-05-09 11:00:43,991 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1285
task_1 | 2020-05-09 11:00:43,991 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1285
task_1 | Traceback (most recent call last):

below is my compose file:
version: '2'
services:

web:
image: ansible/awx_web:11.0.0
container_name: awx_web
depends_on:
- redis
- memcached
ports:
- "80:8052"
- "443:8443"
hostname: awxweb
user: root
restart: unless-stopped
volumes:
- "/var/lib/awx/SECRET_KEY:/etc/tower/SECRET_KEY"
- "/var/lib/awx/environment.sh:/etc/tower/conf.d/environment.sh"
- "/var/lib/awx/credentials.py:/etc/tower/conf.d/credentials.py"
- "/var/lib/awx/projects:/var/lib/awx/projects:rw"
- "/var/lib/awx/projects/nginx.conf:/etc/nginx/nginx.conf:rw"

dns:
  - 10.204.226.77
  - 10.204.226.111
environment:
  http_proxy:
  https_proxy:
  no_proxy:

task:
image: ansible/awx_task:11.0.0
container_name: awx_task
depends_on:
- redis
- memcached
- web
hostname: awx
user: root
restart: unless-stopped
volumes:
- "/var/lib/awx/SECRET_KEY:/etc/tower/SECRET_KEY"
- "/var/lib/awx/environment.sh:/etc/tower/conf.d/environment.sh"
- "/var/lib/awx/credentials.py:/etc/tower/conf.d/credentials.py"
- "/var/lib/awx/projects:/var/lib/awx/projects:rw"
dns:
- 10.204.226.77
- 10.204.226.111
environment:
http_proxy:
https_proxy:
no_proxy:
redis:
image: redis:6.0-rc4-alpine3.11
container_name: tools_redis_1
environment:
REDIS_PASSWORD: password
ports:
- "6379:6379"
volumes:
- "/var/lib/awx/redis.conf:/usr/local/etc/redis/redis.conf"
- "/var/lib/awx/redis_socket_standalone:/var/run/redis/"
command: ["/usr/local/etc/redis/redis.conf"]
memcached:
image: "memcached:alpine"
container_name: awx_memcached
restart: unless-stopped
environment:
http_proxy:
https_proxy:
no_proxy:

and In environment.sh file I have done the below configuration:
REDIS_URL=redis://ansible-ro.rbkm0e.ng.0002.use1.cache.amazonaws.com:6379
REDIS_PORT=6379
REDIS_SOCKET=/var/lib/awx/redis.sock
REDIS_PASSWORD=password

@rkatta22
Copy link

@aak1989 you've got HTTP 500 errors - can you share any errors you might see in the awx_web logs?

ValueError: Redis URL must specify one of the followingschemes (redis://, rediss://, unix://)
task_1 | 2020-05-09 11:00:41,341 INFO exited: callback-receiver (exit status 1; not expected)
task_1 | 2020-05-09 11:00:42,344 INFO spawned: 'callback-receiver' with pid 1276
task_1 | 2020-05-09 11:00:43,345 INFO success: callback-receiver entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
web_1 | 2020-05-09 11:00:43,549 WARNING awx.main.analytics.broadcast_websocket ('Unsupported URI scheme', 'amqp')
task_1 | 2020-05-09 11:00:43,977 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1282
task_1 | 2020-05-09 11:00:43,977 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1282
task_1 | 2020-05-09 11:00:43,982 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1283
task_1 | 2020-05-09 11:00:43,982 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1283
task_1 | 2020-05-09 11:00:43,986 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1284
task_1 | 2020-05-09 11:00:43,986 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1284
task_1 | 2020-05-09 11:00:43,991 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1285
task_1 | 2020-05-09 11:00:43,991 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1285
task_1 | Traceback (most recent call last):

@tima
Copy link

tima commented May 12, 2020

@rkatta22 making comments on a closed ticket is not going to receive a reply. You need to file a new issue if you think you've encountered a bug.

@ryanpetrello
Copy link
Contributor

ryanpetrello commented May 12, 2020

@rkatta22 you haven't encountered a bug - you just have old configuration of some sort laying around pointed at an old AMQP connection string from a prior install (which is no longer valid):

('Unsupported URI scheme', 'amqp')

If you need help troubleshooting an AWX install, try our mailing list or IRC channel:

http://webchat.freenode.net/?channels=ansible-awx
https://groups.google.com/forum/#!forum/awx-project

@rkatta22

This comment has been minimized.

@ryanpetrello
Copy link
Contributor

@rkatta22,

If you need help troubleshooting an AWX install, try our mailing list or IRC channel:

http://webchat.freenode.net/?channels=ansible-awx
https://groups.google.com/forum/#!forum/awx-project

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants