Upgrading to PostgreSQL 15 and moving to sclorg images #80

rooftopcellist · 2024-03-07T22:23:17Z

SUMMARY

Upgrading from PostgreSQL 13 to 15.

Major changes

Use new v1.34.1 ansible-operator base image and operator_sdk.util 0.5.0 - commit
The boolean for deleting the old Postgres PVC by default after Postgres upgrade is now fixed - commit
- postgres_keep_pvc_after_upgrade: false means the old PG13 PVC will be deleted after upgrade by default
Add checksum for secrets and configmaps to deployments so containers - commit
Set new postgres configuration secret if managed database. This means you no longer need to delete existing postgres_configuration secrets in the namespace before restoring - commit
Add initContainer to initial Postgres data volume permissions if needed - commit

This is for k8s deployments only.

Add postgres init container if postgres_data_volume_init is true.

This is aimed to solve the issue where users may need to chmod or chown the postgres data volume for user 26, which is the user that is running postgres in the sclorg image.

For example, one can now set the follow on the AWX spec:

spec:
  postgres_data_volume_init: true
  postgres_init_container_commands: |
    chown 26:0 /var/lib/pgsql/data
    chmod 700 /var/lib/pgsql/data

Minor changes

Update scale down utility playbook to scale down the correct deployments - commit
Improved logic for checking PG_VERSION paths for both postgres and sclorg postgresql image commit
Set POSTGRES_* env vars from secrets so when pods cycle values are updated - commit
Scale up web and content replicas after upgrade and restore (previously it waited for the next reconciliation loop which took more time - commit
Specify Redis 7 image explicitly so that when we bump the version it will be intentional - commit
PostgreSQL: Cast sorted_old_postgres_pods as list for compatiblity with Ansible 2.9.x and 2.15.x - commit
PostgreSQL: Grant postgres role to galaxy for compatibility with new sclorg image - commit
Change default postgres_data_path to that of the new sclorg image (/var/lib/pgsql/data/userdata) - commit
Add database configuration docs - commit
Refactor backup content logic into k8s job to enable custom securityContext - commit
- Set runAsUser to 1000 (galaxy user) for management pods in k8s
  so that it can access the contents of /var/lib/pulp
- Add initContainer for copying content from /var/lib/pulp during
  backups
- Add separate k8s job for copying content for file storage during
  restores
- add rbac rules for cronjobs and jobs to operator SA
- In k8s, set content pod user to 1000 like in the api deployment
- Set UID 1000 in k8s for backup and restore management pods
- Add a ttl for the k8s jobs so that the content PVC can be deleted if
  desired without ownerReference issues
Fix a bug where ! characters in the PGPASSWORD are not parsed correctly - commit
Add --ansible-log-events flag to Dockerfile to make it easier to change verbosity - commit
- Options are documented here: https://sdk.operatorframework.io/docs/building-operators/ansible/reference/advanced_options/#using-ansible-log-events
Rewrite backup and restore cr_object handling to store in yaml instead of json that can be invalid in some scenarios - commit
- Refactor how cluster_name is set so that the period is excluded if set
  to an empty string.
Wait for the Postgres Service to be ready & create server secret earlier - commit
- If the content or worker pods come up too fast and the postgres
  service is not ready, we get errors in those containers and it crash
  loops. This resolves that.
- Move all of the configuration and variables needed to create the
  galaxy-server secret to the common role. Prior to this, we would see
  errors when applying the content and worker deployments because the
  server-secret did not yet exist.
- Before applying the content and worker deployments, check to make sure
  the galaxy-server secret exists.
Refactor: Split Galaxy Server secret into a dedicated role - commit
- without this change, nodeport deployments failed because the web
  service was node created by the time get_node_ip.yml was run.
- Re-enable all PR checks
During backup, use the deployed application image specified on the CR for the backup k8s job that copies the contents of /var/lib/pulp - commit
- Add ability to configure backup_resource_requirements and
  restore_resource_requirements
- Deprecate backup_pvc_namespace parameter
- Only add file-storage-pvc checksum if file_storage is enabled
Always specify api_version with k8s tasks using the Pod resource - commit

config/manager/manager.yaml

rooftopcellist · 2024-03-08T22:34:37Z

I am about to be OOO for 2 weeks, so just a note to anyone who merges this. We should not squash the commits here as some of the changes here we may want to port to other operators (specifically the change that checks both paths for PG_VERSION).

roles/postgres/defaults/main.yml

roles/postgres/tasks/upgrade_postgres.yml

roles/redis/defaults/main.yml

aknochow · 2024-03-22T01:07:20Z

Looking into why those 2 checks are failing...

dsavineau · 2024-03-25T23:49:08Z

@rooftopcellist I rebased this PR and added two commits (same as eda operator) in order to get the upgrade working

kurokobo · 2024-03-27T06:40:44Z

Hi, thanks for working on this!

In changing PSQL from docker.io to sclorg, some considerations need to be made.

The path where PV should be mounted is /var/lib/pgsql/data, not /var/lib/postgresql/data
- Change default value for postgres_data_path awx-operator#1766
- This causes data loss, and this is why AWX Operator 2.13.0 is marked as DO NOT USE and removed from quay.io
Data path cannot be changed using the PGDATA environment variable
- postgres_data_path not respected by postgres-15 causing data lost awx-operator#1790
postgres_data_path should be fixed to /var/lib/pgsql/data/userdata and should be ignored if postgres_data_path is modified in Galaxy CR. If the PV is mounted on a path that is not actually used, data persistence may be broken.
- postgres_data_path should be deprecated in the CRD, and docs should be updated
- As a test, restart the PSQL pod after installation to make sure no data is lost.
The executing user is UID 26, not UID 0. This may cause permission issues for subpath data
- Postgres 15 pod: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied" awx-operator#1770

AWX Operator 2.13.0 has adopted sqlorg's PSQL, but there are a number of open issues, so it may be safer for Galaxy Operator and EDA Server Operator to collectively address them after things settle down.
EDA Server Operator has been using sclorg's PSQL 13 from the start, so the postgres_data_path problem seems unlikely.

rooftopcellist · 2024-04-01T17:29:32Z

Thank you for the patches @dsavineau !

@kurokobo Thank for your notes. I think I have addressed all of these now:

The default postgres data path is now /var/lib/pgsql/data/userdata
Since the PGDATA environment variable is no longer respected by the sclorg image, I removed the ability to configure it as we did in the awx-operator. If a user has postgres_data_path defined on their Galaxy spec, it will be ignored (the k8s validator omits it silently when Galaxy CR is applied).

Note: I confirmed that in the sclorg and registry.redhat.io provided PostgreSQL images, the data dir path that is actually used is /var/lib/pgsql/data/userdata/, even though PGDATA gets set to the value passed into the StatefulSet yaml. So we no longer specify the PGDATA env var to avoid confusion, since it is not respected by the PostgreSQL image anyways.

I still need to try to reproduce the UID 26 permissions issue you mentioned (link), I don't see it on my openshift cluster, but that makes sense because permissions are handled differently there. I'll try to reproduce on a k8s cluster.

kurokobo · 2024-04-02T06:14:42Z

Just added link to my comment on the issue: ansible/awx-operator#1770 (comment)

rooftopcellist · 2024-04-03T20:04:40Z

I have now added in the postgres_data_path related changes from this PR:

Remove the ability to customize the postgres_data_dir awx-operator#1798

I also just included the changes to make it possible initialize the postgresql data volume with the correct permissions if running on k8s with a hostMounted volume.

Add postgres init container to resolve permissions for some k3s deployments awx-operator#1805

…ontext - Set runAsUser to 1000 (galaxy user) for management pods in k8s so that it can access the contents of /var/lib/pulp - Add initContainer for copying content from /var/lib/pulp during backups - Add separate k8s job for copying content for file storage during restores - add rbac rules for cronjobs and jobs to operator SA - In k8s, set content pod user to 1000 like in the api deployment - Set UID 1000 in k8s for backup and restore management pods - Add a ttl for the k8s jobs so that the content PVC can be deleted if desired without ownerReference issues

* This makes fixes a bug where ! characters in the PGPASSWORD are not parsed correctly

…ge verbosity * Options are documented here: https://sdk.operatorframework.io/docs/building-operators/ansible/reference/advanced_options/#using-ansible-log-events

…nvalid json * Refactor how cluster_name is set so that the period is excluded if set to an empty string.

* If the content or worker pods come up too fast and the postgres service is not ready, we get errors in those containers and it crash loops. This resolves that. * Move all of the configuration and variables needed to create the galaxy-server secret to the common role. Prior to this, we would see errors when applying the content and worker deployments because the server-secret did not yet exist. * Before applying the content and worker deployments, check to make sure the galaxy-server secret exists.

…cycle to pick up changes

- without this change, nodeport deployments failed because the web service was node created by the time get_node_ip.yml was run. - Re-enable all PR checks

rooftopcellist · 2024-04-30T03:11:51Z

@dsavineau @kurokobo @aknochow This PR is now ready for merge IMO. I was able to do a fresh install, upgrade (main --> pg15 branch) and backup and restore on k3s using mounted volumes. There are details in the commit history for the changes needed and why.

roles/backup/tasks/remove_management_pod.yml

roles/backup/templates/backup-content-k8s-job.yaml.j2

config/rbac/role.yaml

roles/galaxy-content/templates/galaxy-content.deployment.yaml.j2

* Add ability to configure backup_resource_requirements and restore_resource_requirements * Deprecate backup_pvc_namespace parameter * Only add file-storage-pvc checksum if file_storage is enabled

rooftopcellist · 2024-04-30T19:21:38Z

Ok, testing went well. There is one caveat, upstream users on k3s will have a manual step:

If you see the following error in the postgres pod's logs upon upgrade, or when running a restore after the fact, then you are likely running on k3s and need to enable the init container to set permissions on the postgresql directory.

$ kubectl -n galaxy logs statefulset/new-galaxy-postgres-15
mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied

Follow these steps to remediate the issue by setting the postgres_data_volume_init parameter true and deleting the new postgresql StatefulSet.

# Patch your Galaxy custom resource
kubectl -n galaxy patch galaxy <deployment-name> --type=merge -p '{"spec": {"postgres_data_volume_init": true}}'

# Delete the new postgres stateful set
kubectl -n galaxy delete statefulset <deployment-name>-postgres-15

cc @kurokobo

I will be sure to include a note in the release notes.

As we are ready to merge this, I will cut an upstream release first since a few things have merged since the last one, so that community members could get those fixes without doing the database upgrade if desired.

kurokobo · 2024-05-01T01:34:50Z

@rooftopcellist
Thanks for pointing that out. Currently my guide does not include any upgrade guide for Galaxy Operator (I have only for AWX Operator), but you are right, some manualy interventions are required.
For AWX Operator, I was provided the guide to create new PV for PSQL 15 that includes chown 26:0 command, so if the users read my guide carefully (I believe so 😃 ), I believe users never face the issue: https://github.com/kurokobo/awx-on-k3s/blob/main/tips/upgrade-operator.md

Anyway I will give new Galaxy Operator try in next days. Thanks!

rooftopcellist · 2024-05-01T01:39:56Z

Ahh, even better! And should they miss a step and end up needing the initContainer to set the permissions to get the deployment back in a good state, they have the remediation steps on my earlier comment, which I'll include in the release notes.

Thanks for taking a look!

rooftopcellist · 2024-05-01T01:40:40Z

I plan to cut another release including the changes in this PR tomorrow morning.

kurokobo · 2024-05-01T14:14:02Z

@rooftopcellist
Made a minimal test and seems both 2024.4.30 and 2024.5.1 work fine😃 I've updated my guide to support 2024.5.1.
Thanks for working hard and releasing this with detailed release notes!

rooftopcellist · 2024-05-01T15:05:08Z

Wonderful news! Thanks for testing that out @kurokobo

rooftopcellist · 2024-05-09T05:10:52Z

roles/restore/tasks/deploy_galaxy.yml

@@ -14,66 +14,27 @@

 - name: Set custom resource spec variable from backup
  set_fact:
-    cr_spec: "{{ cr_object.stdout }}"
+    cr_spec_from_backup: "{{ cr_object.stdout }}"
    cr_spec_strip: "{ "


We should remove this cr_spec_strip var, it looks like it is no longer used.

Same with the other variables here that seem to have been for building the cr_spec... unless I've misunderstood this.

rooftopcellist force-pushed the pg15 branch from 992b3f0 to 363570e Compare March 7, 2024 22:33

rooftopcellist marked this pull request as draft March 7, 2024 22:33

rooftopcellist requested a review from aknochow March 8, 2024 02:41

rooftopcellist marked this pull request as ready for review March 8, 2024 02:41

rooftopcellist force-pushed the pg15 branch from 363570e to d33567f Compare March 8, 2024 02:53

kurokobo reviewed Mar 8, 2024

View reviewed changes

config/manager/manager.yaml Outdated Show resolved Hide resolved

rooftopcellist requested a review from dsavineau March 8, 2024 13:59

rooftopcellist force-pushed the pg15 branch 3 times, most recently from 5726c68 to fccdc09 Compare March 8, 2024 19:06

rooftopcellist force-pushed the pg15 branch 4 times, most recently from 3c8b29e to a10ad34 Compare March 11, 2024 18:31

dsavineau reviewed Mar 11, 2024

View reviewed changes

roles/postgres/defaults/main.yml Outdated Show resolved Hide resolved

roles/postgres/tasks/upgrade_postgres.yml Outdated Show resolved Hide resolved

roles/redis/defaults/main.yml Outdated Show resolved Hide resolved

rooftopcellist force-pushed the pg15 branch 2 times, most recently from caf718e to d4a7294 Compare March 12, 2024 13:38

dsavineau force-pushed the pg15 branch from d4a7294 to 2aea904 Compare March 12, 2024 13:38

dsavineau force-pushed the pg15 branch from 2aea904 to 272d38b Compare March 25, 2024 21:43

kurokobo mentioned this pull request Mar 27, 2024

Add logic for upgrading from Postgres 13 to 15 ansible/eda-server-operator#122

Merged

rooftopcellist force-pushed the pg15 branch from 272d38b to 2452f66 Compare March 28, 2024 00:24

rooftopcellist force-pushed the pg15 branch 2 times, most recently from 568f3a5 to 98536bf Compare April 4, 2024 19:08

rooftopcellist force-pushed the pg15 branch 3 times, most recently from a7f8f83 to e99c6b9 Compare April 30, 2024 00:39

rooftopcellist added 8 commits April 29, 2024 23:04

Turn off bash history expansion temporarily for restore

8d5ca16

* This makes fixes a bug where ! characters in the PGPASSWORD are not parsed correctly

Add --ansible-log-events flag to Dockerfile to make it easier to chan…

1ec057a

…ge verbosity * Options are documented here: https://sdk.operatorframework.io/docs/building-operators/ansible/reference/advanced_options/#using-ansible-log-events

Set new postgres configuration secret if managed database

aedd5b6

Rewrite backup and restore cr_object handling to store in yaml, not i…

b414529

…nvalid json * Refactor how cluster_name is set so that the period is excluded if set to an empty string.

Add checksum for secrets and configmaps to deployments so containers …

6215f54

…cycle to pick up changes

Refactor: Split Galaxy Server secret into a dedicated role

bbbc8b0

- without this change, nodeport deployments failed because the web service was node created by the time get_node_ip.yml was run. - Re-enable all PR checks

rooftopcellist force-pushed the pg15 branch from 3ec845f to bbbc8b0 Compare April 30, 2024 03:08

rooftopcellist mentioned this pull request Apr 30, 2024

Backup signing secret #113

Closed

dsavineau reviewed Apr 30, 2024

View reviewed changes

rooftopcellist force-pushed the pg15 branch from 197eac3 to 29a9806 Compare April 30, 2024 17:31

rooftopcellist requested a review from dsavineau April 30, 2024 17:34

dsavineau approved these changes Apr 30, 2024

View reviewed changes

rooftopcellist added 2 commits April 30, 2024 13:47

During backup, use the deployed application image specified on the CR

d72fe0a

* Add ability to configure backup_resource_requirements and restore_resource_requirements * Deprecate backup_pvc_namespace parameter * Only add file-storage-pvc checksum if file_storage is enabled

Always specify api_version with k8s tasks using the Pod resource

fe46e19

rooftopcellist force-pushed the pg15 branch from 29a9806 to fe46e19 Compare April 30, 2024 17:48

rooftopcellist merged commit 9035df1 into ansible:main Apr 30, 2024
5 checks passed

rooftopcellist mentioned this pull request May 1, 2024

Refactor galaxy-server secret logic #99

Closed

rooftopcellist commented May 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrading to PostgreSQL 15 and moving to sclorg images #80

Upgrading to PostgreSQL 15 and moving to sclorg images #80

rooftopcellist commented Mar 7, 2024 •

edited

Loading

rooftopcellist commented Mar 8, 2024

aknochow commented Mar 22, 2024

dsavineau commented Mar 25, 2024

kurokobo commented Mar 27, 2024

rooftopcellist commented Apr 1, 2024

kurokobo commented Apr 2, 2024

rooftopcellist commented Apr 3, 2024

rooftopcellist commented Apr 30, 2024

rooftopcellist commented Apr 30, 2024

kurokobo commented May 1, 2024

rooftopcellist commented May 1, 2024

rooftopcellist commented May 1, 2024

kurokobo commented May 1, 2024

rooftopcellist commented May 1, 2024

rooftopcellist May 9, 2024

rooftopcellist May 9, 2024

Upgrading to PostgreSQL 15 and moving to sclorg images #80

Upgrading to PostgreSQL 15 and moving to sclorg images #80

Conversation

rooftopcellist commented Mar 7, 2024 • edited Loading

SUMMARY

rooftopcellist commented Mar 8, 2024

aknochow commented Mar 22, 2024

dsavineau commented Mar 25, 2024

kurokobo commented Mar 27, 2024

rooftopcellist commented Apr 1, 2024

kurokobo commented Apr 2, 2024

rooftopcellist commented Apr 3, 2024

rooftopcellist commented Apr 30, 2024

rooftopcellist commented Apr 30, 2024

kurokobo commented May 1, 2024

rooftopcellist commented May 1, 2024

rooftopcellist commented May 1, 2024

kurokobo commented May 1, 2024

rooftopcellist commented May 1, 2024

rooftopcellist May 9, 2024

Choose a reason for hiding this comment

rooftopcellist May 9, 2024

Choose a reason for hiding this comment

rooftopcellist commented Mar 7, 2024 •

edited

Loading