Skip to content
This repository has been archived by the owner on Jul 11, 2020. It is now read-only.

Update to AWX 11.2.0, Tower 3.7 #42

Closed
geerlingguy opened this issue May 18, 2020 · 11 comments
Closed

Update to AWX 11.2.0, Tower 3.7 #42

geerlingguy opened this issue May 18, 2020 · 11 comments

Comments

@geerlingguy
Copy link
Owner

geerlingguy commented May 18, 2020

@geerlingguy geerlingguy added duplicate This issue or pull request already exists and removed duplicate This issue or pull request already exists labels May 26, 2020
@geerlingguy
Copy link
Owner Author

Strangely, it seems the AWX/Tower installer is using a shared volume for Redis instead of communicating via a service/port on TCP...?

@geerlingguy
Copy link
Owner Author

Ah, maybe it gets configured via the BROKER_URL.

@geerlingguy
Copy link
Owner Author

It seems like for the tower image, it used to be in Quay (quay.io/ansible-tower/ansible-tower), but the official Tower OpenShift installer now lists it at the Red Hat Registry (registry.redhat.io/ansible-tower-37/ansible-tower-rhel7)... which requires a valid Red Hat subscription and your cluster to be tied in/authenticated to be able to pull the images.

It's a bit annoying, but I guess the intention may be to not run Tower on non-OpenShift Kubernetes clusters? Or maybe someone just forgot to run the job to push the tower images out to Quay.io.

@geerlingguy
Copy link
Owner Author

For now I'll push up my working branch (with the redis changes) since it could help with getting at least the latest AWX versions supported.

@geerlingguy
Copy link
Owner Author

Hmm... getting:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/worker/base.py", line 122, in run
    res = queue.blpop(self.queues)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/redis/client.py", line 1865, in blpop
    return self.execute_command('BLPOP', *keys)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/redis/client.py", line 875, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/redis/connection.py", line 1185, in get_connection
    connection.connect()
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/redis/connection.py", line 557, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to example-tower-redis.example-tower.svc.cluster.local:6379. Connection refused.

@geerlingguy
Copy link
Owner Author

I tried connecting with a debug container:

$ kubectl run redis-cli --rm -n example-tower -it --image=goodsmileduck/redis-cli
/ # redis-cli -h example-tower-redis.example-tower.svc.cluster.local -p 6379 ping
Could not connect to Redis at example-tower-redis.example-tower.svc.cluster.local:6379: Connection refused

So then I checked the redis container logs:

$ kubectl logs -n example-tower example-tower-redis-6d5c655f9f-h495j
1:C 26 May 2020 20:30:40.828 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 26 May 2020 20:30:40.828 # Redis version=6.0.3, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 26 May 2020 20:30:40.828 # Configuration loaded
1:M 26 May 2020 20:30:40.829 * Running mode=standalone, port=0.
1:M 26 May 2020 20:30:40.829 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:M 26 May 2020 20:30:40.829 # Server initialized
1:M 26 May 2020 20:30:40.829 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
1:M 26 May 2020 20:30:40.830 * The server is now ready to accept connections at /var/run/redis/redis.sock

So it looks like it's only listening on the socket, and not on TCP...

@geerlingguy
Copy link
Owner Author

Got that fixed by switching redis to run on TCP port only, but now I'm getting the following when I try to launch a job from a template:

Call to /api/v2/job_templates/7/launch failed. POST returned status: 500. A server error has occurred.

@geerlingguy
Copy link
Owner Author

Error in task container:

2020-05-26 21:12:35,014 ERROR    awx.main.dispatch Worker failed to run task awx.main.scheduler.tasks.run_task_manager(*[], **{}
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/channels/layers.py", line 67, in _make_backend
    backend_class = import_string(self.configs[name]["BACKEND"])
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/utils/module_loading.py", line 17, in import_string
    module = import_module(module_path)
  File "/var/lib/awx/venv/awx/lib64/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'awx.main.channels'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/worker/task.py", line 86, in perform_work
    result = self.run_callable(body)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/worker/task.py", line 62, in run_callable
    return _call(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/scheduler/tasks.py", line 16, in run_task_manager
    TaskManager().schedule()
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/scheduler/task_manager.py", line 583, in schedule
    self._schedule()
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/transaction.py", line 284, in __exit__
    connection.set_autocommit(True)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/backends/base/base.py", line 410, in set_autocommit
    self.run_and_clear_commit_hooks()
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/backends/base/base.py", line 636, in run_and_clear_commit_hooks
    func()
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/models/unified_jobs.py", line 1255, in <lambda>
    connection.on_commit(lambda: self._websocket_emit_status(status))
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/models/unified_jobs.py", line 1245, in _websocket_emit_status
    emit_channel_notification('jobs-status_changed', status_data)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/consumers.py", line 230, in emit_channel_notification
    channel_layer = get_channel_layer()
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/channels/layers.py", line 363, in get_channel_layer
    return channel_layers[alias]
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/channels/layers.py", line 80, in __getitem__
    self.backends[key] = self.make_backend(key)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/channels/layers.py", line 46, in make_backend
    return self._make_backend(name, config)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/channels/layers.py", line 73, in _make_backend
    % (self.configs[name]["BACKEND"], name)
channels.exceptions.InvalidChannelLayerError: Cannot import BACKEND 'awx.main.channels.RedisGroupBroadcastChannelLayer' specified for default

@geerlingguy
Copy link
Owner Author

It looks like my CHANNEL_LAYERS / BROKER_URL needed some updating: https://github.com/ansible/awx/blob/devel/awx/settings/defaults.py#L932-L941

It looks like AWX's default install uses:

BROKER_URL = 'unix:///var/run/redis/redis.sock'

Which is a little wild, as that assumes Redis is running on the same host and has the unix socket available... that's not a very sustainable solution if you want to run Redis with HA or in a separate scalable instance.

@geerlingguy
Copy link
Owner Author

I was asking about the choice of socket instead of TCP by default, and two main reasons were given:

  • Slightly better latency running on same machine.
  • Slightly better security than running over TCP.

I concede that these reasons are okay for single-server deployments but it gets a bit murky when talking about deploying in K8s/OCP, or even Docker (though the Docker setup is probably on the same machine 99% of the time).

@geerlingguy
Copy link
Owner Author

Latest commit works fine in CI, as I reduced the memory commitments (I think with the task/web using 1Gi each, we're bumping into CI instance RAM limits!).

I haven't fully tested Tower 3.7.0 yet, but may try again tonight when I get a little more time to set up the pull secret for the Red Hat Registry.

However, this image is ready to go, and for those who are using it for AWX, they'll be happy to be able to install the latest version again, using Redis for the queue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant