-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace clustered RabbitMQ with something simpler #5443
Comments
some additional work items under this:
|
cc @MrMEEE in case you haven't seen this yet also: https://groups.google.com/forum/#!topic/awx-project/lRnm2vB1oEQ |
@ryanpetrello Thanks for the heads-up.. I will follow this closely :) |
@MrMEEE the biggest change is "install and configure Redis, not RabbitMQ". Also, we lost a number of RabbitMQ toggle-ables in the installer. You may be interested in any changes under https://github.com/ansible/awx/pull/6034/files#diff-bfa9126dc8059138bf7554d741cb6a5d |
Extensive testing was done before merge to ensure the installation was working as expected and that replacing rabbitmq with redis would not introduce regressions. With that said, we can consider this as being verified and any further polishing will be handled by separated issues (we already got some of those already opened). |
Just a heads up @MrMEEE - 10.0.0 is out now, and includes this change. |
@ryanpetrello thanks.. A completely new build platform, CentOS8/RHEL8 support and the Redis changes are in the works.. I hope for a release after easter |
I have installed 10.0.0 version of AWX and that has resolved the above issue, however the AWX UI no longer appears to refresh automatically, For example, starting a job, the job always remains at pending unless you manually refresh the page. The job appears to stay in pending until the browser is manually refreshed. |
@aak1989 you've got HTTP 500 errors - can you share any errors you might see in the |
Also, could you file a new issue describing what you're encountering? Thanks. |
Hi I got same error when I am upgrading ansible tower from 7.0.0 to 11.0.0 through Docker-Compose file, now when I run the docker-compose up command, I am getting below error. please help me what is the mistake I am doing hear in the configuration. ValueError: Redis URL must specify one of the followingschemes (redis://, rediss://, unix://) below is my compose file: web:
task: and In environment.sh file I have done the below configuration: |
ValueError: Redis URL must specify one of the followingschemes (redis://, rediss://, unix://) |
@rkatta22 making comments on a closed ticket is not going to receive a reply. You need to file a new issue if you think you've encountered a bug. |
@rkatta22 you haven't encountered a bug - you just have old configuration of some sort laying around pointed at an old AMQP connection string from a prior install (which is no longer valid):
If you need help troubleshooting an AWX install, try our mailing list or IRC channel: http://webchat.freenode.net/?channels=ansible-awx |
This comment has been minimized.
This comment has been minimized.
If you need help troubleshooting an AWX install, try our mailing list or IRC channel: http://webchat.freenode.net/?channels=ansible-awx |
ISSUE TYPE
SUMMARY
Replace our clustered implementation of RabbitMQ with something that is easier to understand and operate (and that matches AWX's needs better).
AWX currently makes extensive use of clustered RabbitMQ:
As a form of direct topic-based RPC for dispatching jobs (e.g., playbook runs) to underlying AWX instances. This process involves a periodic scheduler that wakes up, finds work to do, picks an available node with capacity, and places a message on its queue, which is treated as a sort of per-instance "task queue" ala https://python-rq.org or https://docs.celeryproject.org/en/stable/. Certain special messages (which generally are used to perform internal housekeeping tasks in AWX) are "broadcast" to all nodes instead of following a direct RPC topology.
As a buffer for processing job output (Ansible callback events/stdout) via AWX's "callback receiver" process running on each AWX instance.
As a backend for AWX's websocket support for in-browser streaming stdout and live job status updates. Our websocket implementation is based on a custom AMQP-specific ASGI backend which we wrote and maintain, https://github.com/ansible/asgi_amqp/. As time has marched on, and the upstream channels library has drastically changed its architecture in anticipation of native async support in python3, it has become an increased maintenance burden for us to continue to support a custom backend specific to AMQP (especially when it appears that pretty much everybody upstream that uses Channels is just using Redis).
When we originally designed this system years ago, we optimized as heavily as possible for data integrity and safety. But in the scenarios described above, the data we manage under this system is largely ephemeral. In the most extreme cases, it doesn't persist beyond the lifetime of a running playbook. In other words, if a node running a playbook were to suddenly go offline, we can't really recover from that sort of scenario anyways without re-running the playbook. Similarly, if messages are lost in flight in rare circumstances, you can always just relaunch a playbook.
We're paying a heavy cost for this cluster-wide data mirroring/replication. Historically, we've heard from many of our users that:
RabbitMQ clustering doesn't work well in environments unless cluster peers have very low latency. In fact, this is a limitation called out repeatedly in RabbitMQ's clustering documentation. It's an aspect of RabbitMQ clustering that we knew about when we chose it years ago, but it's turned out to be much more painful than we anticipated.
Especially in environments with unreliable networks, RabbitMQ can be very difficult to administer and troubleshoot. In particular, we regularly have users that report network partitioning scenarios that require manual intervention via manual erlang and/or RabbitMQ-specific remediation.
When cluster nodes disappear for prolonged periods of time (hours, days), we've seen many situations where RabbitMQ clustering just isn't able to recover on its own, which causes a myriad of issues when the node returns. Detecting and remediating this often leads to service outages.
The firewall/security group requirements for inter-node replication is a common source of confusion for users, and failing to do it properly can result in situations where adding a node to an existing cluster fails and results in an unanticipated cluster-wide outage.
What we've come to realize is that this architecture is likely not worth the operational and architectural cost we're paying.
Long-term, we'd prefer to move to a model that does not require a control plane that relies on a clustered message bus, but instead one where members of the control plane can largely drop off with minimal effect beyond lowered total execution capacity. RabbitMQ clustering explicitly is not reliable across AZs, and especially not regions, and while newer topologies we're considering don't absolve of this entirely, our goal is to move AWX to a model which is much more forgiving of low-latency networks in general.
In the next major version of AWX, we'd like to investigate replacing RabbitMQ with a combination of features provided by Redis (a new dependency) and Postgres itself. This would most likely look something like this:
Dispatching tasks is still treated as “direct RPC”. In other words, when the task manager runs, it picks one cluster node with capacity, and assigns it as the “execution node”. Dispatcher processes running on every node listen for “tasks” via PostgreSQL channel notification support (https://www.postgresql.org/docs/10/sql-notify.html)
Events emitted from playbooks are no longer sent to a distributed message queue (previously RabbitMQ), but instead a local redis running on each node. Callback receivers on each node listen for events on that node and persist them into the database.
When an event is persisted to the database by the callback receiver, it also is broadcasted to all cluster peers via ASGI. In this way, if a playbook runs on Node A, users connected to Daphne on Nodes B, C, and D will receive a broadcast of these events and see the output in their browser tabs.
Longer term, introducing Redis would potentially allow us to also lose our dependence on memcached (so in other words, we might be able to swap out two dependencies, and replace them with one single new dependency).
The text was updated successfully, but these errors were encountered: