Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(upgrade): add support to migrate jiva volumes resources to openebs namespace #1646

Merged
merged 5 commits into from
Apr 6, 2020

Conversation

shubham14bajpai
Copy link
Contributor

Signed-off-by: shubham [email protected]

What this PR does / why we need it:
This PR adds support to migrate the old jiva volume resources to openebs namespace.

Test job logs:

mayadata:upgrade$ k logs -f jiva-vol-170180-pvc-fc34fbbd-6661-4f40-9-39a3-qjgtt
I0402 11:54:14.024428       1 jiva_volume.go:83] Started upgrading jivaVolume{pvc-be1d95bf-205d-49e2-8bc6-83f4c8ba42af} from 1.6.0 to 1.9.0-RC1
I0402 11:54:14.024504       1 jiva_volume.go:90] Upgrading to 1.9.0-RC1
I0402 11:54:14.117842       1 jiva_upgrade.go:413] Scaling down old deployments
I0402 11:54:14.117872       1 helper.go:325] Scaling deploy pvc-be1d95bf-205d-49e2-8bc6-83f4c8ba42af-ctrl in default namespace to 0
I0402 11:54:29.171796       1 helper.go:325] Scaling deploy pvc-be1d95bf-205d-49e2-8bc6-83f4c8ba42af-rep in default namespace to 0
I0402 11:54:44.241182       1 jiva_upgrade.go:695] splitting replica deployment
I0402 11:54:44.244362       1 jiva_upgrade.go:717] creating replica deployment pvc-be1d95bf-205d-49e2-8bc6-83f4c8ba42af-rep-1 in openebs namespace
I0402 11:54:49.270392       1 jiva_upgrade.go:717] creating replica deployment pvc-be1d95bf-205d-49e2-8bc6-83f4c8ba42af-rep-2 in openebs namespace
I0402 11:54:54.305663       1 jiva_upgrade.go:717] creating replica deployment pvc-be1d95bf-205d-49e2-8bc6-83f4c8ba42af-rep-3 in openebs namespace
I0402 11:54:59.352312       1 jiva_upgrade.go:762] creating controller deployment pvc-be1d95bf-205d-49e2-8bc6-83f4c8ba42af-ctrl in openebs namespace
I0402 11:55:04.389390       1 jiva_upgrade.go:781] removing controller service pvc-be1d95bf-205d-49e2-8bc6-83f4c8ba42af-ctrl-svc in default namespace
I0402 11:55:04.425101       1 jiva_upgrade.go:786] creating controller service pvc-be1d95bf-205d-49e2-8bc6-83f4c8ba42af-ctrl-svc in openebs namespace
I0402 11:55:09.522279       1 jiva_upgrade.go:318] Verifying replica sync
I0402 11:55:15.686415       1 jiva_upgrade.go:350] Synced replica quorum is reached
I0402 11:55:43.949692       1 jiva_upgrade.go:354] Replica syncing complete
I0402 11:55:43.991904       1 jiva_upgrade.go:654] Upgrade Successful forpvc-be1d95bf-205d-49e2-8bc6-83f4c8ba42af
I0402 11:55:44.002157       1 jiva_volume.go:106] Upgraded successfully

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Special notes for your reviewer:

Checklist:

  • Fixes #
  • Labelled this PR & related issue with documentation tag
  • PR messages has document related information
  • Labelled this PR & related issue with breaking-changes tag
  • PR messages has breaking changes related information
  • Labelled this PR & related issue with requires-upgrade tag
  • PR messages has upgrade related information
  • Commit has unit tests
  • Commit has integration tests

@kmova kmova added this to the 1.9 milestone Apr 2, 2020
@kmova
Copy link
Contributor

kmova commented Apr 2, 2020

How does the code handle the following cases:

  • The previous version jiva volume was deployed with "DeployInOpenEBSNamespace=true"
  • If there is a failure after the previous target service is deleted.
  • And in general if during migrate, the job fails and on subsequent restarts it doesn't come backup up and user wants to revert to old version. This case need not be fixed, but at least the steps to recover from this state should be documented.

@@ -411,6 +409,20 @@ func (j *jivaVolumeOptions) preupgrade(pvName, openebsNamespace string) error {
return err
}

if currentVersion < "1.9.0" {
klog.Infof("Scaling down old deployments")
err = j.preMigration(openebsNamespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name preMigration as something to say if less than 1.9.0

if currentVersion < "1.9.0" {
klog.Infof("Scaling down old deployments")
err = j.preMigration(openebsNamespace)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case of failures, do we need to revert the partial upgrade that has been done so that operations go fine? or would that be out of scope of upgrade jobs?
Did we handled such things like reverting partial upgrade in any previous failures?
is this scope for higher level operator (or) need to follow blue-green deployments?

return nil
}

func (j *jivaVolumeOptions) migrateReplica(openebsNamespace string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this function name as well to reflect version < 1.9.0

// Scaling down controller deployment before patching replica deployment
// if the replica is not upgraded already.
if j.replicaObj.version == currentVersion {
err = scaleDeploy(j.controllerObj.name, j.ns, ctrlDeployLabel, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be moved to part of preUpgarde rather than being part of replicaUpgrade?

func jivaUpgrade(pvName, openebsNamespace string, utaskObj *utask.UpgradeTask) (*utask.UpgradeTask, error) {

var (
pvLabel = "openebs.io/persistent-volume=" + pvName
err error
pvLabel = "openebs.io/persistent-volume=" + pvName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert these hardcoded strings as constants like apis.PersistentVolumeCPK

return err
}
nodeNames := deployObj.Spec.Template.Spec.Affinity.NodeAffinity.
RequiredDuringSchedulingIgnoredDuringExecution.NodeSelectorTerms[0].MatchExpressions[0].Values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can it be sure that first element is the one that we always want?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see if-else in CASTemplates

RequiredDuringSchedulingIgnoredDuringExecution.NodeSelectorTerms[0].MatchExpressions[0].Values
var one int32 = 1
for i, node := range nodeNames {
replicaDeploy := deployObj.DeepCopy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add validation for each nodeName to verify that it exists

// updating the replica deployments image and version before
// creating them in openebs namespace.
lastIndex := strings.LastIndex(replicaDeploy.Spec.Template.Spec.Containers[0].Image, ":")
replicaDeploy.Spec.Template.Spec.Containers[0].Image = replicaDeploy.Spec.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be avoided and let regular patch takes care of updating image name, version etc?

replicaDeploy.Name = replicaDeploy.Name + "-" + strconv.Itoa(i+1)
replicaDeploy.Namespace = openebsNamespace
replicaDeploy.ResourceVersion = ""
replicaDeploy.Spec.Replicas = &one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as part of postMigration i.e., after replicaPatch is done, scale up the deployments

return err
}
var one int32 = 1
deployObj.Spec.Replicas = &one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as in migrateReplica applies here

if err != nil {
return err
}
svcList, err := serviceClient.WithNamespace(j.ns).List(metav1.ListOptions{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to migrateService

if err != nil {
return err
}
klog.Infof("creating controller service %s in %s namespace", svcObj.Name, openebsNamespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if upgrade job dies after deleting service? we will be losing the IP..
is it possible to create service with same IP but in disabled mode?
If so, first create in disabled mode, delete old one and then enable new one?

}
}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code structure doesn't look neat..
better to allow old code to get executed by reducing code in new path as provided in comments there
Same applies for target also

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its also required to allow executing existing code path as 'patch' may keep changing as we make new releases..
By combining upgrade code into migration path, we will miss the patches that need to be done in new releases.

Copy link
Contributor

@vishnuitta vishnuitta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My other suggestion is:

  • As backward compatibility is already there, document the steps for migration and let user perform the steps

- segregate migration from upgarde

Signed-off-by: shubham <[email protected]>
return nil, err
}
svcObj := &corev1.Service{}
svcObj.ObjectMeta = metav1.ObjectMeta{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about getting the existing service object and then doing a deep copy, patching the namespace. I guess this logic is written to account for the case where the previous service object is already deleted and then possible job also is restarted and will have to now go ahead and create the new service.

Can there be an alternate solution like:

  • create a temporary service - with a new name and deleting the older cluster ip.
  • delete orig service
  • create new service, by copying from temp service and using the target ip (from PV) as the cluster ip
  • as part of cleanup, delete the temporary service.

}
if err == nil {
klog.Info("getting old deployment")
if deployObj.Name == "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can never happen

if j.ns == openebsNamespace {
replicasLeft = replicasLeft - 1
}
if len(deployList.Items) != replicaCount {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a chance where old replica deployment got scaled down and upgrade process got restarted without completing upgrade?
If such thing happens, replicaCount will be 0 here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of depending on spec.Replicas to get replicationFactor, better to depend on len of nodeNames in NodeAffinity..

RequiredDuringSchedulingIgnoredDuringExecution.NodeSelectorTerms[0].MatchExpressions[0].Values = []string{nodeNames[i]}
klog.Infof("creating replica deployment %s in %s namespace", replicaDeploy.Name, openebsNamespace)
replicaDeploy, err := deployClient.WithNamespace(openebsNamespace).Create(replicaDeploy)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by handling 'IsAlreadyExists' error, we can just keep creating on every iteration without much logic in getting the count.
Current code of getting count of replicas to be created is also fine.

return err
}
replicaCount := int(*oldDeployObj.Spec.Replicas)
replicasLeft := len(deployList.Items)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variable name should be replicasCreated?


func (j *jivaVolumeOptions) migrateTarget(pvName, openebsNamespace string) error {
// get the controller deployment in openebs namespace
deployObj, err := deployClient.WithNamespace(openebsNamespace).Get(j.controllerObj.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of doing Get, by just checking 'j.ns' for openebsNamespace and returning should be fine

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.. this Get call helps in case of upgrade restart also..

func (j *jivaVolumeOptions) migrateTarget(pvName, openebsNamespace string) error {
// get the controller deployment in openebs namespace
deployObj, err := deployClient.WithNamespace(openebsNamespace).Get(j.controllerObj.name)
if err != nil && !k8serror.IsNotFound(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can return for err as nil also, and can avoid next 'if' condition

if j.replicaObj.version == currentVersion {
err = scaleTargetDeploy(j.controllerObj.name, j.ns, 0)
if currentVersion < "1.9.0" {
err = scaleDeploy(j.controllerObj.name, j.ns, ctrlDeployLabel, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we want to scale down controller post 1.9 also?


func (j *jivaVolumeOptions) migrateTargetSVC(pvName, openebsNamespace string) error {
// migrate service only if service not in openebs namespace
if j.ns != openebsNamespace {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do reversing the condition

return nil
}

func getTargetSVC(pvName, openebsNamespace string) (*corev1.Service, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs validation with CASTemplate..
@utkarshmani1997 can you do this review?

if j.replicaObj.version == currentVersion && j.replicaObj.name != "" {
klog.Info("cleaning old replica deployment")
err = deployClient.WithNamespace(j.ns).Delete(j.replicaObj.name, &metav1.DeleteOptions{})
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle isNotFound error as not error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the job restarts and the replica deployment is missing the getReplica function will not populate replicaObj.name and the check j.replicaObj.name != "" will skip the deletion.

return err
}
}
if j.controllerObj.version == currentVersion && j.ns != openebsNamespace {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if upgrade job gets restarted at this point, replicaObj.version will not be set in getReplica call.. and all the upgrade process is dependent on this field replicaObj.version

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same problem with replicaObj.name also

Copy link
Contributor

@utkarshmani1997 utkarshmani1997 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

if err != nil {
return nil, err
}
replicaObj.replicas[deployObj.Name] = version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider adding logs at required places

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logs like those in patchReplica would be good to be there in new functions

@@ -59,6 +60,23 @@ func getOpenEBSVersion(d *appsv1.Deployment) (string, error) {
return d.Labels["openebs.io/version"], nil
}

func checkOpenEBSVersion(d *appsv1.Deployment) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this function be called isValidVersion? Also there is a message in this function about replica.

return patchDetails, nil
}

func getReplica(pvName, replicaLabel, namespace, openebsNamespace string) (*replicaDetails, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_: just to be clear, the variable namespace can be renamed as volumeNamespace

metav1.ListOptions{
LabelSelector: controllerLabel,
})
func (j *jivaVolumeOptions) preReplicaUpgradeLessThan190(pvName, openebsNamespace string) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments for the functions

"openebs.io/persistent-volume": pvName,
"openebs.io/persistent-volume-claim": pvObj.Spec.ClaimRef.Name,
"pvc": pvObj.Spec.ClaimRef.Name,
"openebs.io/version": currentVersion,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this supposed to be UpgradeVersion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.. we are updating this in patchService fn

})
func (j *jivaVolumeOptions) preReplicaUpgradeLessThan190(pvName, openebsNamespace string) (string, error) {
if currentVersion < "1.9.0" {
err := scaleDeploy(j.replicaObj.name, j.ns, replicaDeployLabel, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can replicaObj.name be nil here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Added appropriate checks.

}
replicaObj.name = deployObj.Name
version, err := getOpenEBSVersion(deployObj)
func getPatchDetailsForDeploy(pvName string, deployObj *appsv1.Deployment) (*replicaPatchDetails, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_: As this is specific to Replica Patch. Can we change this to getPatchDetailsForReplicaDeploy

// check if old replica is present for currentVersion < 1.9.0
// if present then migration is not complete and store the old
// replica details
if currentVersion < "1.9.0" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above comments are good..
add few more like
'replicaObj.name and replicaObj.version would be empty if old replica got deleted as part of upgrade.
So, later on code uses replicaObj.name to perform replica related migration.'


klog.Infof("splitting replica deployment")
var zero int32
for i := replicasCreated; i < replicaCount; i++ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment above this for loop to say that:
replica deployment pv-name-rep will be split into multiple replicas like pv-name-rep-1, pv-name-rep-2,... pv-name-rep-n, where n is the replica count for this volume.


func (j *jivaVolumeOptions) cleanup(openebsNamespace string) error {
var err error
if j.replicaObj.version == currentVersion && j.replicaObj.name != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also add a check on the namespace here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Namespace check is not required as irrespective of namespace either openebs or pvc, the old deployment pvName-rep is deleted after splitting it.

}
}
if j.controllerObj.version == currentVersion && j.ns != openebsNamespace {
klog.Info("cleaning old replica deployment")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_: this should be deleting controller(or target) deployment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -584,6 +611,18 @@ func (j *jivaVolumeOptions) verify(pvLabel, openebsNamespace string) error {
}
return err
}
if currentVersion < "1.9.0" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be possible that the job dies after cleaning up old replica deployments. So when the job restarts, will it still know that it has to go and cleanup the pending target deployment. One thought is -- just call the j.cleanup anyway without version check, and that cleanup will only delete stale entries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If job dies after cleanup is successful it will not attempt to delete the deployment as it will not be able to find old deployment and the replicaObj.name variable will be empty and we check on that before deleting the deployment.

Copy link
Contributor

@vishnuitta vishnuitta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes are good

@vishnuitta vishnuitta merged commit d1c471e into openebs-archive:master Apr 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants