Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db upgrage: upgrade postgres 15 to postgres 16 #1415

Closed
wants to merge 0 commits into from
Closed

Conversation

vh05
Copy link
Contributor

@vh05 vh05 commented Aug 19, 2024

The Postgres upgrade from 15 to 16 or any later versions can be done by adding the env variable named POSTGRES_UPGRADE = copy in the pg db pod.

More about upgrade process is here: https://www.postgresql.org/docs/current/pgupgrade.html

In our case, we achieve the upgrade by setting this env var in noobaa-db-pg STS.

Once after upgrade is done, we have to remove the POSTGRES_UPGRADE env from the STS. Otherwise pg db pod fails to come up because the db pod check PG directory version against upgrade image version and exits if there is no upgrade required and POSTGRES_UPGRADE is still provided.

To remove this env, we store the upgrade status in noobaa-core CR after the PG upgrade and refer the same status during db reconcile. If the upgrade is already done, we remove this env from desired sts status.

This will help to upgrade the db smoothly

@dannyzaken
Copy link
Contributor

@vh05, can you please improve the PR description? What are the changes, technical details on the env used, etc.

@vh05
Copy link
Contributor Author

vh05 commented Oct 7, 2024

@vh05, can you please improve the PR description? What are the changes, technical details on the env used, etc.

sure

Copy link
Contributor

@dannyzaken dannyzaken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that instead of deciding on upgrade and applying changes inside SetDesiredNooBaaDB it will be more manageable, easy to follow, and less error-prone if you implement an UpgradeDB function that will be performed before the DB reconciliation. This method will ensure that everything is upgraded correctly, and will only return nil if the reconciliation is ready to proceed.

@vh05
Copy link
Contributor Author

vh05 commented Nov 12, 2024

I think that instead of deciding on upgrade and applying changes inside SetDesiredNooBaaDB it will be more manageable, easy to follow, and less error-prone if you implement an UpgradeDB function that will be performed before the DB reconciliation. This method will ensure that everything is upgraded correctly, and will only return nil if the reconciliation is ready to proceed.

Sure Danny. Let me check and do the changes

@vh05 vh05 force-pushed the pg16 branch 6 times, most recently from 999ccc6 to e0c629d Compare November 18, 2024 04:42
@vh05 vh05 requested a review from dannyzaken November 18, 2024 05:58
@vh05 vh05 force-pushed the pg16 branch 3 times, most recently from 5307eec to 7873068 Compare November 20, 2024 03:01
@vh05 vh05 requested a review from dannyzaken November 20, 2024 03:15
@vh05 vh05 force-pushed the pg16 branch 6 times, most recently from 88d449c to 4526d87 Compare November 22, 2024 09:31
@vh05 vh05 requested review from dannyzaken and nadavMiz November 22, 2024 09:39
Copy link
Contributor

@jackyalbo jackyalbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than my comments, I think the main issue is to check for issues in the middle of the upgrade. I would recommend testing the flows with failing in the middle of every phase. try to delete the pod in the middle of upgrade/cleanning and so on, and make sure we recover from it in a good manner.

pkg/system/phase2_creating.go Outdated Show resolved Hide resolved
r.NooBaa.Status.PostgresUpdatePhase = nbv1.UpgradePhasePrepare
} else {
if r.NooBaa.Status.PostgresMajorVersion == nbv1.PostgresMajorVersionV16 {
r.NooBaa.Status.PostgresMajorVersion = nbv1.PostgresMajorVersionV16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I follow this... didn't we check this in the if statement?

Copy link
Contributor Author

@vh05 vh05 Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for this condition, If the desired_image is still pg_15 and current_image = pg15 but major version is not 16 then we come in the else case. Then in this case we, assign phaseNone, so that we go to phaseNone case and check for desired image and do upgrade reqd

This is to distinguish, desired image pg_16_SOME_HASH and current_image = pg16 but major version is 16 then we need not upgrade but simply assign phaseFinished flag

pkg/system/phase2_creating.go Outdated Show resolved Hide resolved
pkg/system/phase2_creating.go Outdated Show resolved Hide resolved
return nil
}
// make sure previous step has finished
if dbPod.ObjectMeta.DeletionTimestamp != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this check is enough to make sure the upgrade finished successfully? I think we better check the pod status, check it's not it CLBO, other error, still initializing.

Copy link
Contributor Author

@vh05 vh05 Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here. When the pod goes into ERROR or CLBO, most oberserved reason is either

  1. POSTGRES_COPY env removed in between upgrade or
  2. POSTGRES_COPY env is still present after upgrade.

I am not sure how to distinguish using the status. The pod logs gives clue about this but not sure how and whether we should read the logs and come to conclusion

We can't really revert the situation as the db upgrade is not in our control.

Ex: case 1: During upgrade to PG16, PG upgrade process changes the directory where the config etc present to POSTGRES_16 dir. If POSTGRES_COPY flag is removed at this stage it goes into CLBO. We have to place that flag at this moment to get the pod into running status. We can't really revert the upgrade state.

Likewise, case 2: Once after upgrade, we got remove the flag and not sure we can revert the status.

These are some 2 cases where we dont really decide upon pod status

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just think we should check as much as we can before moving to next phase. you set the env POSTGRES_COPY before the upgrade and the pod doen't come up, you will try this phase again and set it again, only when the pod is running with no errors we will assume the phase was successful. moving to the next phase will make us finish the upgrade and mark it as done while the upgrade completely failed.

r.Logger.Errorf("UpgradePostgresDB::got error on endpoints deployment reconcile %v", err)
return err
}
r.NooBaa.Status.PostgresMajorVersion = nbv1.PostgresMajorVersionV16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also here maybe we should double-check the db pod is up and running before setting the upgrade to finished.

Copy link
Contributor Author

@vh05 vh05 Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is CLBO or Error, Its not clear how to handle the errors as explained above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should at least fail the upgrade no? add an event? Finished should be set when we tried starting with v16 and to pod starts with no issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot fail the upgrade. Once the env to do the upgrade is set, the postgres will take the responsibility of upgrading. For any further errors we need to track the logs of the db pod

Comment on lines 380 to 417
if r.NooBaa.Spec.DBResources != nil {
c.Resources = *r.NooBaa.Spec.DBResources
}

c.Lifecycle = &corev1.Lifecycle{
PreStop: &corev1.LifecycleHandler{
Exec: &corev1.ExecAction{
Command: []string{"/bin/sh", "-c", "pg_ctl -D /var/lib/pgsql/data/userdata/ -w -t 60 -m fast stop"},
},
},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if this is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really sure of this. PG upgrade takes care of this. Just doing it explicitly that db upgrade can start safely. Your opinion ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to reconcile only the parts we want to change. and these parts are supposed to be already set. but verify me if it can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this part

pkg/system/phase2_creating.go Outdated Show resolved Hide resolved
pkg/util/util.go Outdated Show resolved Hide resolved
sts.Namespace = r.NooBaa.Namespace
stsObj, err := util.KubeGetObject(sts)
// Postgres database doesn't exists, no need to upgrade.
if err != nil {
Copy link
Contributor

@dannyzaken dannyzaken Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to perform this check outside of the switch, as this is not part of the upgrade.

you should perform all gating checks before starting the upgrade process: are we already in the desired version? are we on a new installation. did we already perform the upgrade?, etc.
Then, when we realize that we should start the upgrade, set the phase accordingly (preparing?) and then go into the switch.
Try not to use an empty phase as an expected value for the upgrade init. As mentioned before, we can have other values from previous upgrades

Copy link
Contributor Author

@vh05 vh05 Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty phrase is used for the cases when we move from older version. On older version upgradephase flag was not used/updated.

sure. I will move this case and phaseNone case outside since both are doing the samething and start from phaseprepate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this part out of switch

@dannyzaken
Copy link
Contributor

@vh05 @vh05, it will be helpful to add a markdown document in the PR (under doc dir) similar to what we had for 15 upgrade.
In this doc, describe the primary upgrade function in pseudo-code so we can provide feedback on the design before you implement it into the code. It will simplify the implementation once all the steps are defined.

I believe that such a document will help us both with the implementation and as a reference for debugging later.

@vh05
Copy link
Contributor Author

vh05 commented Dec 17, 2024

@vh05 @vh05, it will be helpful to add a markdown document in the PR (under doc dir) similar to what we had for 15 upgrade. In this doc, describe the primary upgrade function in pseudo-code so we can provide feedback on the design before you implement it into the code. It will simplify the implementation once all the steps are defined.

I believe that such a document will help us both with the implementation and as a reference for debugging later.

Here is the PR for doc:
#1496

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants