-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrading to PostgreSQL 15 and moving to sclorg images #80
Conversation
5726c68
to
fccdc09
Compare
I am about to be OOO for 2 weeks, so just a note to anyone who merges this. We should not squash the commits here as some of the changes here we may want to port to other operators (specifically the change that checks both paths for PG_VERSION). |
3c8b29e
to
a10ad34
Compare
caf718e
to
d4a7294
Compare
Looking into why those 2 checks are failing... |
@rooftopcellist I rebased this PR and added two commits (same as eda operator) in order to get the upgrade working |
Hi, thanks for working on this! In changing PSQL from docker.io to sclorg, some considerations need to be made.
AWX Operator 2.13.0 has adopted sqlorg's PSQL, but there are a number of open issues, so it may be safer for Galaxy Operator and EDA Server Operator to collectively address them after things settle down. |
Thank you for the patches @dsavineau ! @kurokobo Thank for your notes. I think I have addressed all of these now:
I still need to try to reproduce the UID 26 permissions issue you mentioned (link), I don't see it on my openshift cluster, but that makes sense because permissions are handled differently there. I'll try to reproduce on a k8s cluster. |
Just added link to my comment on the issue: ansible/awx-operator#1770 (comment) |
I have now added in the I also just included the changes to make it possible initialize the postgresql data volume with the correct permissions if running on k8s with a hostMounted volume. |
568f3a5
to
98536bf
Compare
a7f8f83
to
e99c6b9
Compare
…ontext - Set runAsUser to 1000 (galaxy user) for management pods in k8s so that it can access the contents of /var/lib/pulp - Add initContainer for copying content from /var/lib/pulp during backups - Add separate k8s job for copying content for file storage during restores - add rbac rules for cronjobs and jobs to operator SA - In k8s, set content pod user to 1000 like in the api deployment - Set UID 1000 in k8s for backup and restore management pods - Add a ttl for the k8s jobs so that the content PVC can be deleted if desired without ownerReference issues
* This makes fixes a bug where ! characters in the PGPASSWORD are not parsed correctly
…nvalid json * Refactor how cluster_name is set so that the period is excluded if set to an empty string.
* If the content or worker pods come up too fast and the postgres service is not ready, we get errors in those containers and it crash loops. This resolves that. * Move all of the configuration and variables needed to create the galaxy-server secret to the common role. Prior to this, we would see errors when applying the content and worker deployments because the server-secret did not yet exist. * Before applying the content and worker deployments, check to make sure the galaxy-server secret exists.
…cycle to pick up changes
- without this change, nodeport deployments failed because the web service was node created by the time get_node_ip.yml was run. - Re-enable all PR checks
@dsavineau @kurokobo @aknochow This PR is now ready for merge IMO. I was able to do a fresh install, upgrade (main --> pg15 branch) and backup and restore on k3s using mounted volumes. There are details in the commit history for the changes needed and why. |
* Add ability to configure backup_resource_requirements and restore_resource_requirements * Deprecate backup_pvc_namespace parameter * Only add file-storage-pvc checksum if file_storage is enabled
Ok, testing went well. There is one caveat, upstream users on k3s will have a manual step: If you see the following error in the postgres pod's logs upon upgrade, or when running a restore after the fact, then you are likely running on k3s and need to enable the init container to set permissions on the postgresql directory.
Follow these steps to remediate the issue by setting the postgres_data_volume_init parameter true and deleting the new postgresql StatefulSet.
cc @kurokobo I will be sure to include a note in the release notes. As we are ready to merge this, I will cut an upstream release first since a few things have merged since the last one, so that community members could get those fixes without doing the database upgrade if desired. |
@rooftopcellist Anyway I will give new Galaxy Operator try in next days. Thanks! |
Ahh, even better! And should they miss a step and end up needing the initContainer to set the permissions to get the deployment back in a good state, they have the remediation steps on my earlier comment, which I'll include in the release notes. Thanks for taking a look! |
I plan to cut another release including the changes in this PR tomorrow morning. |
@rooftopcellist |
Wonderful news! Thanks for testing that out @kurokobo |
@@ -14,66 +14,27 @@ | |||
|
|||
- name: Set custom resource spec variable from backup | |||
set_fact: | |||
cr_spec: "{{ cr_object.stdout }}" | |||
cr_spec_from_backup: "{{ cr_object.stdout }}" | |||
cr_spec_strip: "{ " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove this cr_spec_strip
var, it looks like it is no longer used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with the other variables here that seem to have been for building the cr_spec... unless I've misunderstood this.
SUMMARY
Upgrading from PostgreSQL 13 to 15.
Major changes
postgres_keep_pvc_after_upgrade: false
means the old PG13 PVC will be deleted after upgrade by defaultThis is for k8s deployments only.
Add postgres init container if
postgres_data_volume_init
is true.This is aimed to solve the issue where users may need to chmod or chown the postgres data volume for user 26, which is the user that is running postgres in the sclorg image.
For example, one can now set the follow on the AWX spec:
Minor changes
POSTGRES_*
env vars from secrets so when pods cycle values are updated - commit/var/lib/pgsql/data/userdata
) - commitso that it can access the contents of /var/lib/pulp
backups
restores
desired without ownerReference issues
!
characters in thePGPASSWORD
are not parsed correctly - commit--ansible-log-events
flag to Dockerfile to make it easier to change verbosity - commitcr_object
handling to store in yaml instead of json that can be invalid in some scenarios - committo an empty string.
service is not ready, we get errors in those containers and it crash
loops. This resolves that.
galaxy-server secret to the common role. Prior to this, we would see
errors when applying the content and worker deployments because the
server-secret did not yet exist.
the galaxy-server secret exists.
service was node created by the time get_node_ip.yml was run.
restore_resource_requirements
api_version
with k8s tasks using the Pod resource - commit