-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vanilla install 11.0.0 fails #6792
Comments
I can confirm this is happening. I just did a clean install as well. Same problem. |
Yep, I'm able to reproduce this. Looking into it. |
Actually, I can't reproduce this; as the error message above suggested, it was just migrating (which took a minute). |
Here's my asciinema. This is on a cleanly installed Debian 10 host with docker and ansible. https://asciinema.org/a/b1jgaeSFWiv6jHkFmPpO8aLTI?t=13 This is the vars.yml:
|
I'm trying to setup AWX using docker-compose, I'm having the same problems as OP, resulting in an infinite loop (30 minutes so far) of Ansible trying to perform the migrations. I will test again from scratch and report as soon as possible. |
It never finishes the migrations on my hosts, at least, not for an hour. I still have it running so I can have a look again tomorrow ;-) |
Do you see any errors related to migrations? What happens if you exec into the web container and run:
by hand? |
Maybe unrelated to this issue, but release 11.1.0 has the same errors. After about 15' error messages it seems to resume its proper routine. |
There must be a difference somewhere. The Python runtime environment perhaps ? |
Hmmm, I get different output than @JG127.
After this I do get the login prompt, but somehow I cannot log in. |
After the migrations I still get the crashing dispatchter:
|
Adding the logs of the installer and the logs of the very first "docker-compose up" command. initial_start.log.tar.gz I've got the impression the web service is waking up too early. It logs all sorts of errors for as long as the task service hasn't finished with the migrations. Or rather, the task service wakes up late for the migrations. It's only at the end of the log file it begins to actually do something. The net result however is that the system is functional. Albeit it took its time to get to that point. Maybe a network timeout somewhere ? |
I agree, @JG127, this does sounds like sort of timing issue/race on startup. I've still been unable to reproduce, so if any of you find any additional clues, please let me know and I'm glad to help dig. |
The only thing coming to mind is the Python environment used to run the installer and the application. I always use a Virtualenv environment to work in when doing Python projects. Otherwise I'll end up using the libraries of the OS installed software. This is the elaborate description of what I do to set up the runtime environment:
I repeat the process with Python3. Since Python2 is deprecated and AWX is using Python3 to run ansible. It's the very same routine as described above except for the virtualenv command. Replace
with
Check of the releases
The very same errors pop up. A long shot is that by some fluke you are using different versions of the involved Docker images. You might want to check the image id's to make certain those are the same.
(why use both Redis and Memcached btw ?) And a very long shot is that it makes a difference to do this in a virtual machine. I am using VirtualBox 6.1.16 on a Windows 10 host. As per company regulations. |
Maybe this will shed some light ... While the task service is just sitting there it consumes 100% CPU. The processes:
Is there something I can try to find out what it is doing ? |
What do the task logs/stdout say? Maybe try something like |
No logging at all. Not in the docker logs nor the log files in /var/log. This means the issue happens very early in the code. Before it logs something. |
I'm experiencing the same problem. I' not able to start any release. I've tried 9.3.0 10.0.0 and 11.1.0.
|
If this were a Java based system I would dump the threads or do a cpu profile. Python however is a bit new to me. Is there a way to cpu-profile a Python process ? |
I ran into this problem today as well. I was able to fix it by deleting all of the running docker containers and then running Hope this helps. |
I am getting the same error as @JG127 . Starting with:
The fresh install does not work for me, for whatever reason. In addition to that I get the error I've posted in an earlier post. Therefore, I've investigated pg_hba.conf. There you can find no allow entry for that host. Then I modified the pg_hba.conf like this, allowing everything in 172.16.0.0./12
After saving the changes and starting the containers again, the problem is gone. It looks like, because of the error in the init.py, pg_hba.conf gets not populated correctly. |
@Naf3tsR It works for me too ! |
Yes ! :-) Can this be fixed via Docker (Compose) ? I'd rather not hack my way in. |
@Naf3tsR Thanks, it works for me. I'm using the 11.2.0 version, still has the issue. |
Hi, I have same problem, it was resolved with using Postgresql 12. |
Upgrade to Postgresql 12 didn't help for me. |
I'm having the same problem in version 11.2.0, using the docker compose, however, in the second attempt it always works. I'm using PostgresSQL in compose as well. My logs are similar to @roedie. Out of curiosity, did anyone have the problem using OpenShift or Kubernetes? |
I just tried a fresh install with the 13.0.0 (docker-compose mode on debian 10). It seems to give the "main_instance" error too:
Then I ran
|
Tried with latest version (v13.0.0) on a clean env (Debian 10) and didn't have issues. NOTE: Using an external PostgreSQL 11.8 database. |
I see this problem in my CI job that builds my awx containers. Approximately one out of 10 starts fail. It seems that something is killing (or restarting) the postgres container quite early
Which in turn causes the migration task inside the awx_task container to fail:
Restarting the awx_task container seams to restart the migration process which is then working. ( So the question is, what is restarting the awx_postgres container? From the postgres container entrypoint:
If the migrate process starts during the postgresql initialization, then the connection will be dropped as soon as the temp_server stops. |
Came here while investigating an issue during Release 13.0.0 install issue that was also giving me I ended up reverting to release 12.0.0 while troubleshooting assuming it was a dirty release. I was able to eventually get into the web UI for 12.0.0 using 2 steps.
This appears to be a valid work-around for new docker-compose install at least with my config |
ran into the same
as mentioned by @johnrkriter (thanks a lot!) |
I had same problem , and it was resolved with same way... |
The workaround works for me as well. Is this going to be fixed? |
This issue is still around with docker exec awx_task awx-manage migrate
docker container restart awx_task |
Pity is that nobody from the project seems interested... :-( |
This looks like a race condition betweent the pg container and the aws-task container. Since i am not familiar with the project structure it will probably take me some time to find the right place to look :) I will update this as soon as i find something. |
So i think the description of @anxstj is pretty accurate and complete, we now just need to figure out a good way to wait for the postges ini to finish before we start the migration. Does anybody have a good idea how to do that? |
Judging from the issue itself, i think the best option to fix this is to make the aws-task container (or the script running in it) fail if the migration fails instead of trying to continue with something that will never succeed. The db migrations themselves should be idempotent, so just failing and starting fresh should be fine. |
This change will make docker-compose wait for both the redis and the postgres container before starting the other services. related ansible#6792
Does it work after applying #8497 ? |
This change will make docker-compose wait for both the redis and the postgres container before starting the other services. related ansible#6792
This change will make the t ask container wait for postgres before running the migrations. related ansible#6792
@dasJ I think the patch could still run into the issue where the "while loop" exits because it successfully accesses the instance of the "docker_temp_server" that will be killed directly after. It makes it a lot more unlikely, but i do not see how the patch would completely avoid this case. It think the real fix here is to make the aws-task container completely fail on this error and start fresh (not sure if the "set -e" is enough here since iirc all errors are still retried when running "aws-migrate") |
@jklare wdym by "docker_temp_server"? I can neither find it in this repo nor when googeling |
@dasJ in the description of the issue @anxstj gave a couple of post above, it is mentioned. This temp postgres server instance is basically run when starting the postgres container to run initial migrations (https://github.com/docker-library/postgres/blob/master/11/docker-entrypoint.sh#L297-L302). And since this temp server is available and accepting connections and then killed/stopped shortly after, this causes the race condition. |
Oof. But I can't think of a way to prevent this issue :/
But it's probably a lot less severe since the second loop only succeeds when connection as the `awx` user succeeds which it only does when the postgres init script was already run (and not while it's runnig).
|
Yeah, i think it will be a lot less likely to happen (just running any command will already help here), but i think the real fix would be to make the awx-task container fail hard and exit if the db migration fails for any reason. It will be automatically restarted after exiting, the migration will be retried and will work the second time for sure. I have not had time yet to investigate on where to make the migration task inside the awx-task container fail, sry for not beeing super helpful here :( |
I was hoping my |
Had a lot of head scratching in the last week as I'd been getting this 100% repeatedly on my system. Two solutions are referenced in this issue, but bear in mind you have to leave about a 2½ minute gap after running the initial playbook (assuming the database is being created from scratch) before taking the remidial action (my system is a quad core atom C2550 with 8GB RAM running Ubuntu 20.04LTS) It would be very useful to get this resolved since the out of the box experience with the simplest AWX configuration seems to be problematic (so likely inhibits adoption) and has been for a while (I tried from HEAD back to - i think 10 was the earliest I tried. So to reiterate the two scriptable solutions are:
as mentioned earlier in thsi thread by a few people (although it gaves me an exception on the upgrade:
as per #6931 |
Any chance this is fixed in 16.0.0 ? |
No updates on this issue in a while. Going to assume it was fixed or not relevant for newer versions. |
TASK [local_docker : Check for existing Postgres data (run from inside the container for access to file)] *********************************************************************************************************************************** |
I'm facing above mentioned issue while running |
ISSUE TYPE
SUMMARY
A fresh install of the 11.0.0 release doesn't work, even though the installation instructions are followed. There are sql errors and a recurring error about clustering.
ENVIRONMENT
STEPS TO REPRODUCE
The installation playbook runs w/o apparent errors. However, when checking the Docker compose logs there are loads of sql errors and cluster errors as shown below.
The procedure was repeated by commenting out the line "dockerhub_base=ansible" in the inventory file. Tot make certain the AWX Docker images are build locally and in sync with the installer. The very same errors happen.
EXPECTED RESULTS
No errors in the logs and a fully functional application.
ACTUAL RESULTS
The logs are filling with errors and the application is not fully functional. Sometimes I'm getting an angry potato logo. I've added a screenshot in attachment. What is it used for ? :-)
The odd thing however is when there is no angry potato logo the application seems to be functional (i.e. management jobs can be run successfully). Despite the huge number of errors in the logs.
When there is an angry potato logo I can log in but not run jobs.
ADDITIONAL INFORMATION
These SQL statement errors below are repeated very frequently: The relations "conf_setting" and "main_instance" do not exist.
This error about clustering is repeated very frequently:
The text was updated successfully, but these errors were encountered: