ClusterPool Inventory #1672

abraverm · 2022-01-28T14:41:37Z

Implementation of https://github.com/openshift/hive/blob/master/docs/enhancements/clusterpool-inventory.md

x-ref: https://issues.redhat.com/browse/HIVE-1367

apis/hive/v1/clusterpool_types.go

pkg/controller/clusterpool/clusterpool_controller.go

vendor/github.com/openshift/hive/apis/hive/v1/clusterdeploymentcustomization_types.go

akhil-rane · 2022-02-01T00:13:09Z

pkg/controller/clusterpool/clusterpool_controller.go

+}
+
+func (r *ReconcileClusterPool) getCustomizedInstallConfigSecretRef(cd *hivev1.ClusterDeployment, customizationRef *corev1.LocalObjectReference, logger log.FieldLogger) error {
+	// TODO: Am I doing this right?


Instead of patching yaml secret why don't you patch installConfigTemplate before passing it to the builder. This library might help you do it https://github.com/evanphx/json-patch.

That was my first implementation, but Builder merges the provided template with its own values or generates a new install config (https://github.com/openshift/hive/blob/master/pkg/clusterresource/builder.go#L201) and the merge process favorites Builder values(

hive/pkg/clusterresource/builder.go

Line 424 in 8dc97bc

func (o *Builder) mergeInstallConfigTemplate() (*corev1.Secret, error) {

) so it will be counter intuative for the user if the patches are overwritten by hive internal logic. That is why I have put the patching process at the very end of ClusterDeployment creation, and it uses json-patch (see applyPatches function).

Only things changed by mergeInstallConfigTemplate function are name and basedomain. Extracting and then patching data from generated installConfigSecret, as currently written, looks like a tedious process to me. But yes theoretically it should work. I would like to get a second opinion here. @2uasimojo @suhanime any thoughts?

akhil-rane · 2022-02-01T00:15:38Z

@abraverm I gave the first pass through the code and added some review comments.

pkg/controller/clusterpool/clusterpool_controller.go

akhil-rane · 2022-02-02T00:34:27Z

pkg/controller/clusterpool/clusterpool_controller.go

+}
+
+func (r *ReconcileClusterPool) getCustomizedInstallConfigSecretRef(cd *hivev1.ClusterDeployment, customizationRef *corev1.LocalObjectReference, logger log.FieldLogger) error {
+	// TODO: Am I doing this right?


Only things changed by mergeInstallConfigTemplate function are name and basedomain. Extracting and then patching data from generated installConfigSecret, as currently written, looks like a tedious process to me. But yes theoretically it should work. I would like to get a second opinion here. @2uasimojo @suhanime any thoughts?

pkg/clusterresource/openstack.go

pkg/controller/clusterdeployment/clusterdeployment_controller.go

pkg/clusterresource/openstack.go

pkg/controller/clusterdeployment/clusterdeployment_controller.go

akhil-rane · 2022-02-09T01:20:51Z

pkg/controller/clusterdeployment/clusterdeployment_controller.go

+		cdLog.Infof("Deprovision request completed, releasing inventory customization")
+		if cd.Spec.ClusterPoolRef != nil && cd.Spec.ClusterPoolRef.ClusterDeploymentCustomizationRef != nil {
+			if err := r.releaseClusterDeploymentCustomization(cd, cdLog); err != nil {
+				cdLog.WithError(err).Log(controllerutils.LogLevel(err), "error releasing inventory customization")
+				return reconcile.Result{}, err
+			}
+		}
 		cdLog.Infof("DNSZone gone and deprovision request completed, removing finalizer")


Suggested change

cdLog.Infof("Deprovision request completed, releasing inventory customization")

if cd.Spec.ClusterPoolRef != nil && cd.Spec.ClusterPoolRef.ClusterDeploymentCustomizationRef != nil {

if err := r.releaseClusterDeploymentCustomization(cd, cdLog); err != nil {

cdLog.WithError(err).Log(controllerutils.LogLevel(err), "error releasing inventory customization")

return reconcile.Result{}, err

}

}

cdLog.Infof("DNSZone gone and deprovision request completed, removing finalizer")

cdLog.Infof("DNSZone gone and deprovision request completed, removing finalizers")

if cd.Spec.ClusterPoolRef != nil && cd.Spec.ClusterPoolRef.ClusterDeploymentCustomizationRef != nil {

if err := r.releaseClusterDeploymentCustomization(cd, cdLog); err != nil {

cdLog.WithError(err).Log(controllerutils.LogLevel(err), "error releasing inventory customization")

return reconcile.Result{}, err

}

}

pkg/controller/clusterdeployment/clusterdeployment_controller.go

akhil-rane · 2022-02-09T01:27:58Z

pkg/controller/clusterdeployment/clusterdeployment_controller.go

+		"ClusterDeploymentCustomizationReleased",
+		"Cluster Deployment Customization was released",


I would go with the message that says customization is available instead of mentioning it was released

codecov · 2022-02-09T22:03:24Z

Codecov Report

Merging #1672 (1786bbc) into master (6c77c5f) will decrease coverage by 0.01%.
The diff coverage is 42.23%.

@@            Coverage Diff             @@
##           master    #1672      +/-   ##
==========================================
- Coverage   41.60%   41.58%   -0.02%     
==========================================
  Files         355      362       +7     
  Lines       33129    34117     +988     
==========================================
+ Hits        13782    14189     +407     
- Misses      18184    18725     +541     
- Partials     1163     1203      +40

Impacted Files	Coverage Δ
cmd/hiveadmission/main.go	`0.00% <0.00%> (ø)`
...ed/typed/hive/v1/clusterdeploymentcustomization.go	`0.00% <0.00%> (ø)`
...ive/v1/fake/fake_clusterdeploymentcustomization.go	`0.00% <0.00%> (ø)`
...t/versioned/typed/hive/v1/fake/fake_hive_client.go	`0.00% <0.00%> (ø)`
...t/clientset/versioned/typed/hive/v1/hive_client.go	`0.00% <0.00%> (ø)`
pkg/client/informers/externalversions/generic.go	`0.00% <0.00%> (ø)`
...versions/hive/v1/clusterdeploymentcustomization.go	`0.00% <0.00%> (ø)`
...nt/informers/externalversions/hive/v1/interface.go	`0.00% <0.00%> (ø)`
.../listers/hive/v1/clusterdeploymentcustomization.go	`0.00% <0.00%> (ø)`
pkg/clusterresource/openstack.go	`91.30% <0.00%> (-8.70%)`	⬇️
... and 19 more

akhil-rane · 2022-02-09T22:30:24Z

pkg/controller/clusterpool/clusterpool_controller.go

+	for _, entry := range clp.Spec.Inventory {
+		cdc := &hivev1.ClusterDeploymentCustomization{}
+		ref := types.NamespacedName{Namespace: clp.Namespace, Name: entry.Name}
+		r.Get(context.TODO(), ref, cdc)


I think get function here needs error handling

pkg/controller/clusterpool/clusterpool_controller.go

akhil-rane · 2022-02-09T22:52:15Z

pkg/controller/clusterdeployment/clusterprovisions.go

+	config := &hivev1.FailedProvisionConfig{}
 	if len(path) == 0 {
-		return nil, nil
+		return config, nil


why do we need this change?

Bug that I have stumbled on. When running the controller locally (development), it doesn't have any of the configurations that Hive operators adds when it creates the controller:
https://github.com/openshift/hive/blob/master/pkg/operator/hive/hive_controller.go#L516
fpConfigHash -> failedProvisionConfigMapInfo
This creates environment variable FAILED_PROVISION_CONFIG_FILE that points at the created configuration.
The bug is that when that environment variable is not defined then the readProvisionFailedConfig will return nil, and that causes panic when a provision is retried as it tries to access an attribute in a nil object.

I would prefer running a controller locally same as if it was started by Hive Operator and I would have avoided this bug, but I don't know how to reproduce the configurations created by Operator properly.

I would really appreciate if you create a separate PR for this as it is unrelated to this work

akhil-rane · 2022-02-10T23:37:40Z

pkg/controller/clusterdeployment/clusterdeployment_controller.go

 	default:
-		cdLog.Infof("DNSZone gone and deprovision request completed, removing finalizer")
+		cdLog.Infof("DNSZone gone, customization released and deprovision request completed, removing finalizer")


Customization will only be released if it exists so this does not sound right to me

You are right and it also true about "DNSZone gone" part - it is by default true when manage DNS is disabled. So how about I will remove both?

It's not great English, but I suspect "gone" was chosen to cover both "we removed it" and "it was never there". We could just use the same convention for the customization and not worry about it :)

akhil-rane · 2022-02-10T23:47:27Z

pkg/controller/clusterdeployment/clusterprovisions.go

+	config := &hivev1.FailedProvisionConfig{}
 	if len(path) == 0 {
-		return nil, nil
+		return config, nil


I would really appreciate if you create a separate PR for this as it is unrelated to this work

akhil-rane · 2022-02-10T23:56:08Z

pkg/controller/clusterpool/clusterpool_controller.go

 		cd.Spec.ClusterPoolRef = &poolRef
 		lastIndex := len(objs) - 1
 		objs[i], objs[lastIndex] = objs[lastIndex], objs[i]
 	}
+	// Apply inventory customization
+	if clp.Spec.Inventory != nil {
+		for _, obj := range objs {


We already iterate through the objects once. Do we again need to iterate?

In the first iteration the ClusterDeployment was moved to the end of the slice, and with all the additional logic, the entire block looked too complicated to me. I decided to seperate them so it would easier to maintain on the cost of a duplicate iteration of a small loop.

I agree with Akhil that we ought to be able to loop just once.
But also agree with Alex that I don't want to see all 30LOC embedded in the previous loop.
But also, modifying slice order while iterating over the slice is not cool IMO (latent). See this example -- we end up running over one object twice and ignoring another.
So let's do this:

Factor those 30LOC out into a separate function, e.g. patchInstallConfig()

The above loop should:

edit the CD (as it does today)

patch the install config

not reorder the slice

Swizzle the slice order after the loop

So like:

var cdPos int for i, obj := range objs { switch { case cd, ok = obj.(*hivev1.ClusterDeployment); ok: cdPos = i poolRef := poolReference(clp) if cdc != nil { poolRef.ClusterDeploymentCustomizationRef = &corev1.LocalObjectReference{Name: cdc.Name} } cd.Spec.ClusterPoolRef = &poolRef case secret := isInstallConfigSecret(obj); secret != nil: if err := patchInstallConfig(cdc, secret, logger); err != nil { return nil, err } } } // Move the ClusterDeployment to the end of the slice lastIndex := len(objs) - 1 objs[cdPos], objs[lastIndex] = objs[lastIndex], objs[cdPos]

akhil-rane · 2022-02-10T23:57:07Z

pkg/controller/clusterpool/clusterpool_controller.go

+			var ok bool
+			if ics, ok = obj.(*corev1.Secret); ok {
+				installConfig, ok := ics.StringData["install-config.yaml"]
+				if ok {


Lets have a function that given an object determines if it is install config secret. That way we have more concise code.

akhil-rane · 2022-02-11T00:00:29Z

pkg/controller/clusterpool/clusterpool_controller.go

+				if ok {
+					newInstallConfig, err := r.getCustomizedInstallConfig([]byte(installConfig), cdc, logger)
+					if err != nil {
+						r.setInventoryValidCondition(clp, false, fmt.Sprintf("failed to customize with %s", cdc.Name), logger)


what happens when multiple customizations fail for clusterpool? How do we display that info on InventoryInvalid condition?

Also error handlng is missing

what happens when multiple customizations fail for clusterpool? How do we display that info on InventoryInvalid condition?

It is really hard to make a meaningful message for invalid inventory if won't keep track the individual status of customizations in the clusterpool.

akhil-rane · 2022-02-11T00:03:00Z

pkg/controller/clusterpool/clusterpool_controller.go

+						cdc.Status.Conditions = controllerutils.SetClusterDeploymentCustomizationCondition(
+							cdc.Status.Conditions,
+							hivev1.ClusterDeploymentCustomizationAvailableCondition,
+							corev1.ConditionFalse,
+							"CustomizationFailed",
+							"Failed to customize install config",
+							controllerutils.UpdateConditionIfReasonOrMessageChange,
+						)


I do not see any update on customization for these changes to take effect.

Also, what/who is responsible for making the customization available again when necessary errors are removed?

akhil-rane · 2022-02-11T00:09:53Z

pkg/controller/clusterpool/clusterpool_controller.go

+	reason := "Valid" // Maybe a different readon?
+	message := "Inventory customization succesfuly applied and reserved"


Suggested change

reason := "Valid" // Maybe a different readon?

message := "Inventory customization succesfuly applied and reserved"

reason := "InventoryValid"

message := "Inventory is valid"

akhil-rane · 2022-02-11T00:11:43Z

pkg/controller/clusterpool/clusterpool_controller.go

+			return nil, err
+		}
+		cloudBuilder := clusterresource.NewOpenStackCloudBuilderFromSecret(credsSecret)
+		cloudBuilder.Cloud = platform.OpenStack.Cloud


This is also set in NewOpenStackCloudBuilderFromSecret. I think we can remove it from there as it is not fetched from secret

Agree, and because it is a required field

abraverm · 2022-02-13T06:28:21Z

/retest-required

abraverm · 2022-02-14T10:31:44Z

/retest

abraverm · 2022-02-17T13:15:27Z

/retest

abraverm · 2022-02-17T15:33:59Z

/retest

abraverm · 2022-02-21T14:37:47Z

/retest

abraverm · 2022-02-21T20:10:16Z

/retest

… fix vsphere

abraverm · 2022-07-26T11:38:48Z

/retest

abraverm · 2022-07-26T11:40:32Z

/retest

2uasimojo · 2022-07-26T12:02:58Z

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_hive/1672/pull-ci-openshift-hive-master-verify/1551895261679194112#1:build-log.txt%3A58

vet: pkg/controller/clusterpool/clusterpool_controller_test.go:1773:48: FindClusterPoolCondition not declared by package utils

As noted above:

Alas, you've straddled #1820 where we've collapsed all the Find*Condition funcs down to one using generics. Easy fix -- see that PR for examples.

2uasimojo · 2022-08-01T20:27:41Z

The finalizer fix looks right to me -- and actually seems entirely appropriate now that I see it in action, not hacky like I thought it would.

Let's get this thing landed so we can iterate without choking the GH UI.

Amazing work @abraverm!

/lgtm

openshift-ci · 2022-08-01T20:30:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 2uasimojo, abraverm

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [2uasimojo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2022-08-01T20:55:25Z

@abraverm: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Goal: reduce e2e-pool wallclock time by ~35m. Problem Statement: When ClusterPool inventory (ClusterDeploymentCustomization) testing was added to e2e-pool (4fddbe7 / openshift#1672), it triggered ClusterPool's staleness algorithm such that we were actually wasting a whole cluster while waiting for the real pool to become ready. Grab a cup of coffee... To make the flow of the test a little bit easier, we were creating the real pool, then using its definition to generate the fake pool definition -- which does not have inventory -- and then adding inventory to the real pool. But if you add or change a pool's inventory, we mark all its clusters stale. So because of the flow above, when we initially created the real pool without inventory, it started provisioning a cluster. Then when we updated it (mere seconds later, if that), that cluster immediately became stale. Now, the way we decided to architect replacement of stale clusters, we prioritize _having claimable clusters_ over _all clusters being current_. Thus in this scenario we were actually ending up waiting until the stale cluster was fully provisioned before deleting it and starting over with the (inventory-affected) cluster. Solution: Create the real pool with an initial `size=0`. Scale it up to `size=1` _after_ adding the inventory.

Goal: reduce e2e-pool wallclock time by ~35m. Problem Statement: When ClusterPool inventory (ClusterDeploymentCustomization) testing was added to e2e-pool (4fddbe7 / openshift#1672), it triggered ClusterPool's staleness algorithm such that we were actually wasting a whole cluster while waiting for the real pool to become ready. Grab a cup of coffee... To make the flow of the test a little bit easier, we were creating the real pool, then using its definition to generate the fake pool definition -- which does not have inventory -- and then adding inventory to the real pool. But if you add or change a pool's inventory, we mark all its clusters stale. So because of the flow above, when we initially created the real pool without inventory, it started provisioning a cluster. Then when we updated it (mere seconds later, if that), that cluster immediately became stale. Now, the way we decided to architect replacement of stale clusters, we prioritize _having claimable clusters_ over _all clusters being current_. Thus in this scenario we were actually ending up waiting until the stale cluster was fully provisioned before deleting it and starting over with the (inventory-affected) cluster. Solution: Create the real pool with an initial `size=0`. Scale it up to `size=1` _after_ adding the inventory. (cherry picked from commit 2a23f0a)

openshift-ci bot requested review from 2uasimojo and joelddiaz January 28, 2022 14:42

akhil-rane reviewed Feb 1, 2022

View reviewed changes

akhil-rane reviewed Feb 2, 2022

View reviewed changes

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 3, 2022

abraverm force-pushed the cp_onprem_pr branch from 97424fe to c3d165c Compare February 6, 2022 21:23

openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 6, 2022

abraverm force-pushed the cp_onprem_pr branch from c3d165c to 5e613ba Compare February 8, 2022 11:43

joelddiaz reviewed Feb 8, 2022

View reviewed changes

pkg/clusterresource/openstack.go Outdated Show resolved Hide resolved

pkg/controller/clusterdeployment/clusterdeployment_controller.go Outdated Show resolved Hide resolved

akhil-rane reviewed Feb 9, 2022

View reviewed changes

abraverm force-pushed the cp_onprem_pr branch from 5e613ba to 45b8526 Compare February 9, 2022 20:29

akhil-rane reviewed Feb 9, 2022

View reviewed changes

akhil-rane reviewed Feb 10, 2022

View reviewed changes

akhil-rane reviewed Feb 11, 2022

View reviewed changes

abraverm force-pushed the cp_onprem_pr branch from 45b8526 to 9f76ef3 Compare February 11, 2022 18:26

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 14, 2022

abraverm force-pushed the cp_onprem_pr branch from 9f76ef3 to c7189fd Compare February 17, 2022 12:57

abraverm force-pushed the cp_onprem_pr branch from c7189fd to 18ac472 Compare February 17, 2022 16:01

openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 17, 2022

abraverm changed the title ~~ClusterPool Inventory - WIP~~ ClusterPool Inventory Feb 21, 2022

abraverm force-pushed the cp_onprem_pr branch 2 times, most recently from 792300e to 7f5d7d4 Compare February 23, 2022 12:18

abraverm added 9 commits July 26, 2022 10:10

e2e: fix typos

15f9867

e2e: merge customization with original pool, revert unassigned to a list

ec54e1d

fix rebase

a4158bd

e2e: customize REAL POOL after creation of FAKE POOL

8f4fb18

e2e:random name for the new cluster

4fddbe7

fix indentation and fail Reserve if not changed

97da831

Add support for ovirt and vshpere

cf5e4e0

Remove hibernation change, update roles, fix cdc deletion process and…

8fa20e2

… fix vsphere

Unique finalizer per clusterpool

229237f

abraverm force-pushed the cp_onprem_pr branch from d765b4c to 229237f Compare July 26, 2022 07:12

abraverm added 2 commits July 26, 2022 10:27

Remove unsupported hibernation check and fix rebase

47161f7

Fix finalizer name

af52c84

abraverm added 3 commits July 26, 2022 17:37

Fix finalizer process

f7ef0bf

Fix tests

5f0fd62

Fix verify

1786bbc

openshift-ci bot assigned 2uasimojo Aug 1, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 1, 2022

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 1, 2022

openshift-merge-robot merged commit 5daa572 into openshift:master Aug 1, 2022

2uasimojo mentioned this pull request Aug 8, 2023

e2e-pool: Start real pool provisions after inventory added #2080

Merged

		"ClusterDeploymentCustomizationReleased",
		"Cluster Deployment Customization was released",

		reason := "Valid" // Maybe a different readon?
		message := "Inventory customization succesfuly applied and reserved"

ClusterPool Inventory #1672

ClusterPool Inventory #1672

Conversation

abraverm commented Jan 28, 2022 • edited by akhil-rane Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akhil-rane commented Feb 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Feb 9, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abraverm commented Feb 13, 2022

abraverm commented Feb 14, 2022

abraverm commented Feb 17, 2022

abraverm commented Feb 17, 2022

abraverm commented Feb 21, 2022

abraverm commented Feb 21, 2022

abraverm commented Jul 26, 2022

abraverm commented Jul 26, 2022

2uasimojo commented Jul 26, 2022 • edited Loading

2uasimojo commented Aug 1, 2022

openshift-ci bot commented Aug 1, 2022

openshift-ci bot commented Aug 1, 2022

abraverm commented Jan 28, 2022 •

edited by akhil-rane

Loading

codecov bot commented Feb 9, 2022 •

edited

Loading

2uasimojo commented Jul 26, 2022 •

edited

Loading