Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logic for upgrading from Postgres 13 to 15 #122

Merged
merged 12 commits into from
Apr 9, 2024

Conversation

rooftopcellist
Copy link
Member

SUMMARY

Upgrading from postgres 13 to 15.

Related PR:

ISSUE TYPE
  • New or Enhanced Feature

@rooftopcellist
Copy link
Member Author

I can't figure out why CI is failing when running this line:

pulp content list

@aknochow @dsavineau any ideas?

@dsavineau
Copy link
Contributor

I can't figure out why CI is failing when running this line:

pulp content list

@aknochow @dsavineau any ideas?

I think this comment should have landed on the galaxy-operator repository instead of the eda operator right ?

@dsavineau
Copy link
Contributor

dsavineau commented Mar 21, 2024

not sure if I'm testing this correctly because this should be the same code than awx-operator but the upgrade fails when searching for the previous postgresql pod.

TASK [Set info for previous postgres pod] **************************************
task path: /opt/ansible/roles/postgres/tasks/check_postgres_version.yml:22

ok: [localhost] => {
    "ansible_facts": {
        "sorted_old_postgres_pods": "<list_reverseiterator object at 0x7f539eaa5610>"
    },
    "changed": false
}

TASK [Set info for previous postgres pod] **************************************
task path: /opt/ansible/roles/postgres/tasks/check_postgres_version.yml:29

        "old_postgres_pod": "<"
    },
    "changed": false
}
(...)
TASK [postgres : Get old PostgreSQL version] ***********************************
task path: /opt/ansible/roles/postgres/tasks/check_postgres_version.yml:56

fatal: [localhost]: FAILED! => {
    "msg": "The task includes an option with an undefined variable. The error was: 'ansible.utils.unsafe_proxy.AnsibleUnsafeText object' has no attribute 'metadata'\n\nThe error appears to be in '/opt/ansible/roles/postgres/tasks/check_postgres_version.yml': line 56, column 7, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n    - name: Get old PostgreSQL version\n      ^ here\n"

Also I noticed that postgres_keep_pvc_after_upgrade is used but not defined in defaults nor the CRD

@dsavineau
Copy link
Contributor

alright this seems related to the ansible version (and/or python version) used by the operator

  • EDA operator uses v1.27.0

    • ansible 2.9.27
    • python 3.8
  • AWX operator uses v1.34.0

    • ansible-core 2.15.8
    • python 3.9

if I switch to v1.34.0 then I don't have the issue anymore. I'll need to look more to understand why it's failing with the old version because we're likely to still use ansible 2.9 downstream

@dsavineau
Copy link
Contributor

dsavineau commented Mar 21, 2024

The reverse filter in ansible 2.9 returns an iterator and we cast it to list to avoid this issue.
With ansible-core 2.15 we don't have to do that but keeping the list filter isn't an issue.
I would say that we should add the list filter even if we plan to update the operator sdk to a newer image because ansible 2.9 is still use for downstream

@dsavineau
Copy link
Contributor

@rooftopcellist I've pushed some fixes for this PR so don't forget to fetch the changes if you want to push new things

With those fixes, the error I'm facing occurs during the postgresql dump/restore on the pg 15 pod

pg_restore: error: could not execute query: ERROR:  must be member of role "postgres"
Command was: ALTER SCHEMA public OWNER TO postgres;

pg_restore: warning: errors ignored on restore: 1

Note sure why we have this. I'll continue to debug it next week and try to do a similar pg upgrade for the awx operator to see if we don't have the same symptoms

@dsavineau
Copy link
Contributor

@rooftopcellist with the latest changes I pushed, I was able to upgrade from 13 to 15 without issue.

However, the change about granting the postgres role temporary to the eda postgresql user seems weird to me because we never had to do this in other operators for other postgresql upgrade (12 or 13)

c2b13fd

@kurokobo
Copy link
Contributor

@dsavineau
Copy link
Contributor

ansible/galaxy-operator#80 (comment)

I have less concerns about the postgresql upgrade from 13 to 15 for the EDA operator because it was already using postgresql image from sclorg as default.

@rooftopcellist
Copy link
Member Author

rooftopcellist commented Apr 9, 2024

@kurokobo I tested out a fresh install on k3s following similar steps to what you've shown in your awx-on-k3s repo and was not able to reproduce the permission error in the postgres pod that caused us to add these changes to the awx-operator:

I pinned the redis image to c9s and will cut an eda-server-operator 1.0.2 release so that users have a working release before the PostgreSQL 15 upgrade (which we will likely put out as a 2.0.0 release). @kurokobo I plan to wait on releasing the PG15 changes until you and/or some other community folks get a chance to try out these.

Since all of the comments have been addressed and tests are passing in every scenario I can think of, I plan to merge this shortly.

Testing

On Openshift:

I ran through a fresh install, backup, restore, and upgraded from main --> the pg15 branch all without error.

On k3s:
I did a fresh install, backup, restore and upgrade with no operator errors and no errors in the logs of the resulting deployments.

Notes for follow-up:
I noticed two things about how the pods scale down before upgrading the database:

  • Currently, it loops through deployments and scales them down one by one, which is slow. It would be faster to scale down all of the deployments at once if possible. But this is only for postgresql upgrades, which are rare, so I'm inclined to leave it the way it is.
  • we may want to scale down ui and redis as well while upgrade (just so that errors don't show in the eda-ui pod.

dsavineau and others added 7 commits April 8, 2024 20:30
With ansible 2.9.27 (operator-sdk v1.27.0) then the reverse filter
returns an iterator so we need to cast it to list.
The behavior doesn't exist when using a more recent operator-sdk
version like v1.34.0 (ansible-core 2.15.8) but using the list
filter on that version works too (even if not needed)

"sorted_old_postgres_pods": "<list_reverseiterator object at 0x7f539eaa5610>"

Signed-off-by: Dimitri Savineau <[email protected]>
During postgresql upgrade we're scaling down EDA deployments but
this requires to have proper permissions configured (patch on
deployments/scale)

message: deployments.apps "eda-default-worker" is forbidden: User
"system:serviceaccount:aap:eda-server-operator-controller-manager"
cannot patch resource "deployments/scale" in API group "apps" in
the namespace aap
reason:Forbidden

Signed-off-by: Dimitri Savineau <[email protected]>
The variable for the postgresql version had a typo. It was using
supported_postgres_version rather than supported_pg_version.

fatal: [localhost]: FAILED! => {
"msg": "The task includes an option with an undefined variable.
The error was: 'supported_postgres_version' is undefined
}

Signed-off-by: Dimitri Savineau <[email protected]>
The postgres_keep_pvc_after_upgrade variable wasn't defined in the
postgres role leading to the following error:

"msg": "The conditional check 'postgres_keep_pvc_after_upgrade' failed.
The error was: error while evaluating conditional
(postgres_keep_pvc_after_upgrade): 'postgres_keep_pvc_after_upgrade' is
undefined

Signed-off-by: Dimitri Savineau <[email protected]>
During the postgresql upgrade, we need to grant temporary the postgres
role to the eda postgresql user and remove it after the pg_restore is
over.

pg_restore: error: could not execute query: ERROR:  must be member of role "postgres"
Command was: ALTER SCHEMA public OWNER TO postgres;

Signed-off-by: Dimitri Savineau <[email protected]>
- This was removed because it is not respected by the sclorg PostgreSQL
  image
@kurokobo
Copy link
Contributor

kurokobo commented Apr 9, 2024

@rooftopcellist
Thanks, I will test 1.0.2 first, then test main branch, and let you know the results.
F.Y.I., I don't know if you know this but my repository already contains the guide to deploy EDA Server using EDA Server Operator. I've added this the last summer when I started contribution for EDA Server Operator : https://github.com/kurokobo/awx-on-k3s/tree/main/rulebooks

@rooftopcellist
Copy link
Member Author

I hadn't seen that yet, thanks for the link! I'll check out the similar one for galaxy. Thanks for checking out the changes!

@kurokobo
Copy link
Contributor

@rooftopcellist
Sorry for my delay, and sorry for adding a comment on closed PR.
I've tested for both 1.0.2 and main branch and I can confirm that both work as expected in senarios for fresh installation and upgrading.

One concern is that the postgres_data_volume_init implemented in AWX Operator is not implemented in EDA Server Operator.
As long as we use sclorg's PSQL, the permission error for UID:26 that occurred in AWX Operator can also occur in this Operator if users are using specific backend for PVs, e.g. hostPath, Longhorn, etc..
I suspect that the reason you could not reproduce it in your environment was because you reused the directory you created for AWX with chowned to 26:0.

To reproduce, please follow my guide again, with skipping sudo chown 26:0 /data/eda/postgres-13/data. Currently my guide deployes 1.0.2: https://github.com/kurokobo/awx-on-k3s/blob/main/rulebooks/README.md

I think it is a good idea to implement postgres_data_volume_init at this time, as the same issues will probably be reported as the number of users grows. I guess the reason why this is not reported yet is simply there are not many users for this Operator. Also, if the users uses my guide to deploy EDA Server, they would not faced this issue since I already added chown for UID:26 to the guide from the beginning.

Any thought?

@rooftopcellist
Copy link
Member Author

@kurokobo ahh you are absolutely correct. I did have the chown 26 line in my reproducer script because I saw it in your repo instructions.

But I agree, it's better to implement postgres_data_volume_init here as well, then users won't need that extra line for k3s hostPath and longhorn, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants