Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClusterPool Inventory #1672

Merged
merged 27 commits into from
Aug 1, 2022
Merged

Conversation

abraverm
Copy link
Contributor

@abraverm abraverm commented Jan 28, 2022

@openshift-ci openshift-ci bot requested review from 2uasimojo and joelddiaz January 28, 2022 14:42
apis/hive/v1/clusterpool_types.go Outdated Show resolved Hide resolved
apis/hive/v1/clusterpool_types.go Outdated Show resolved Hide resolved
pkg/controller/clusterpool/clusterpool_controller.go Outdated Show resolved Hide resolved
pkg/controller/clusterpool/clusterpool_controller.go Outdated Show resolved Hide resolved
pkg/controller/clusterpool/clusterpool_controller.go Outdated Show resolved Hide resolved
pkg/controller/clusterpool/clusterpool_controller.go Outdated Show resolved Hide resolved
pkg/controller/clusterpool/clusterpool_controller.go Outdated Show resolved Hide resolved
}

func (r *ReconcileClusterPool) getCustomizedInstallConfigSecretRef(cd *hivev1.ClusterDeployment, customizationRef *corev1.LocalObjectReference, logger log.FieldLogger) error {
// TODO: Am I doing this right?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of patching yaml secret why don't you patch installConfigTemplate before passing it to the builder. This library might help you do it https://github.com/evanphx/json-patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my first implementation, but Builder merges the provided template with its own values or generates a new install config (https://github.com/openshift/hive/blob/master/pkg/clusterresource/builder.go#L201) and the merge process favorites Builder values(

func (o *Builder) mergeInstallConfigTemplate() (*corev1.Secret, error) {
) so it will be counter intuative for the user if the patches are overwritten by hive internal logic. That is why I have put the patching process at the very end of ClusterDeployment creation, and it uses json-patch (see applyPatches function).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only things changed by mergeInstallConfigTemplate function are name and basedomain. Extracting and then patching data from generated installConfigSecret, as currently written, looks like a tedious process to me. But yes theoretically it should work. I would like to get a second opinion here. @2uasimojo @suhanime any thoughts?

@akhil-rane
Copy link
Contributor

@abraverm I gave the first pass through the code and added some review comments.

pkg/controller/clusterpool/clusterpool_controller.go Outdated Show resolved Hide resolved
}

func (r *ReconcileClusterPool) getCustomizedInstallConfigSecretRef(cd *hivev1.ClusterDeployment, customizationRef *corev1.LocalObjectReference, logger log.FieldLogger) error {
// TODO: Am I doing this right?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only things changed by mergeInstallConfigTemplate function are name and basedomain. Extracting and then patching data from generated installConfigSecret, as currently written, looks like a tedious process to me. But yes theoretically it should work. I would like to get a second opinion here. @2uasimojo @suhanime any thoughts?

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 3, 2022
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 6, 2022
pkg/clusterresource/openstack.go Outdated Show resolved Hide resolved
Comment on lines 1377 to 1384
cdLog.Infof("Deprovision request completed, releasing inventory customization")
if cd.Spec.ClusterPoolRef != nil && cd.Spec.ClusterPoolRef.ClusterDeploymentCustomizationRef != nil {
if err := r.releaseClusterDeploymentCustomization(cd, cdLog); err != nil {
cdLog.WithError(err).Log(controllerutils.LogLevel(err), "error releasing inventory customization")
return reconcile.Result{}, err
}
}
cdLog.Infof("DNSZone gone and deprovision request completed, removing finalizer")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cdLog.Infof("Deprovision request completed, releasing inventory customization")
if cd.Spec.ClusterPoolRef != nil && cd.Spec.ClusterPoolRef.ClusterDeploymentCustomizationRef != nil {
if err := r.releaseClusterDeploymentCustomization(cd, cdLog); err != nil {
cdLog.WithError(err).Log(controllerutils.LogLevel(err), "error releasing inventory customization")
return reconcile.Result{}, err
}
}
cdLog.Infof("DNSZone gone and deprovision request completed, removing finalizer")
cdLog.Infof("DNSZone gone and deprovision request completed, removing finalizers")
if cd.Spec.ClusterPoolRef != nil && cd.Spec.ClusterPoolRef.ClusterDeploymentCustomizationRef != nil {
if err := r.releaseClusterDeploymentCustomization(cd, cdLog); err != nil {
cdLog.WithError(err).Log(controllerutils.LogLevel(err), "error releasing inventory customization")
return reconcile.Result{}, err
}
}

Comment on lines 1441 to 1442
"ClusterDeploymentCustomizationReleased",
"Cluster Deployment Customization was released",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go with the message that says customization is available instead of mentioning it was released

@codecov
Copy link

codecov bot commented Feb 9, 2022

Codecov Report

Merging #1672 (1786bbc) into master (6c77c5f) will decrease coverage by 0.01%.
The diff coverage is 42.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1672      +/-   ##
==========================================
- Coverage   41.60%   41.58%   -0.02%     
==========================================
  Files         355      362       +7     
  Lines       33129    34117     +988     
==========================================
+ Hits        13782    14189     +407     
- Misses      18184    18725     +541     
- Partials     1163     1203      +40     
Impacted Files Coverage Δ
cmd/hiveadmission/main.go 0.00% <0.00%> (ø)
...ed/typed/hive/v1/clusterdeploymentcustomization.go 0.00% <0.00%> (ø)
...ive/v1/fake/fake_clusterdeploymentcustomization.go 0.00% <0.00%> (ø)
...t/versioned/typed/hive/v1/fake/fake_hive_client.go 0.00% <0.00%> (ø)
...t/clientset/versioned/typed/hive/v1/hive_client.go 0.00% <0.00%> (ø)
pkg/client/informers/externalversions/generic.go 0.00% <0.00%> (ø)
...versions/hive/v1/clusterdeploymentcustomization.go 0.00% <0.00%> (ø)
...nt/informers/externalversions/hive/v1/interface.go 0.00% <0.00%> (ø)
.../listers/hive/v1/clusterdeploymentcustomization.go 0.00% <0.00%> (ø)
pkg/clusterresource/openstack.go 91.30% <0.00%> (-8.70%) ⬇️
... and 19 more

for _, entry := range clp.Spec.Inventory {
cdc := &hivev1.ClusterDeploymentCustomization{}
ref := types.NamespacedName{Namespace: clp.Namespace, Name: entry.Name}
r.Get(context.TODO(), ref, cdc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think get function here needs error handling

Comment on lines 572 to 574
config := &hivev1.FailedProvisionConfig{}
if len(path) == 0 {
return nil, nil
return config, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug that I have stumbled on. When running the controller locally (development), it doesn't have any of the configurations that Hive operators adds when it creates the controller:
https://github.com/openshift/hive/blob/master/pkg/operator/hive/hive_controller.go#L516
fpConfigHash -> failedProvisionConfigMapInfo
This creates environment variable FAILED_PROVISION_CONFIG_FILE that points at the created configuration.
The bug is that when that environment variable is not defined then the readProvisionFailedConfig will return nil, and that causes panic when a provision is retried as it tries to access an attribute in a nil object.

I would prefer running a controller locally same as if it was started by Hive Operator and I would have avoided this bug, but I don't know how to reproduce the configurations created by Operator properly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really appreciate if you create a separate PR for this as it is unrelated to this work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default:
cdLog.Infof("DNSZone gone and deprovision request completed, removing finalizer")
cdLog.Infof("DNSZone gone, customization released and deprovision request completed, removing finalizer")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Customization will only be released if it exists so this does not sound right to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right and it also true about "DNSZone gone" part - it is by default true when manage DNS is disabled. So how about I will remove both?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not great English, but I suspect "gone" was chosen to cover both "we removed it" and "it was never there". We could just use the same convention for the customization and not worry about it :)

Comment on lines 572 to 574
config := &hivev1.FailedProvisionConfig{}
if len(path) == 0 {
return nil, nil
return config, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really appreciate if you create a separate PR for this as it is unrelated to this work

cd.Spec.ClusterPoolRef = &poolRef
lastIndex := len(objs) - 1
objs[i], objs[lastIndex] = objs[lastIndex], objs[i]
}
// Apply inventory customization
if clp.Spec.Inventory != nil {
for _, obj := range objs {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already iterate through the objects once. Do we again need to iterate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first iteration the ClusterDeployment was moved to the end of the slice, and with all the additional logic, the entire block looked too complicated to me. I decided to seperate them so it would easier to maintain on the cost of a duplicate iteration of a small loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Akhil that we ought to be able to loop just once.
But also agree with Alex that I don't want to see all 30LOC embedded in the previous loop.
But also, modifying slice order while iterating over the slice is not cool IMO (latent). See this example -- we end up running over one object twice and ignoring another.
So let's do this:

  • Factor those 30LOC out into a separate function, e.g. patchInstallConfig()
  • The above loop should:
    • edit the CD (as it does today)
    • patch the install config
    • not reorder the slice
  • Swizzle the slice order after the loop

So like:

	var cdPos int
	for i, obj := range objs {
		switch {
		case cd, ok = obj.(*hivev1.ClusterDeployment); ok:
			cdPos = i
			poolRef := poolReference(clp)
			if cdc != nil {
				poolRef.ClusterDeploymentCustomizationRef = &corev1.LocalObjectReference{Name: cdc.Name}
			}
			cd.Spec.ClusterPoolRef = &poolRef
		case secret := isInstallConfigSecret(obj); secret != nil:
			if err := patchInstallConfig(cdc, secret, logger); err != nil {
				return nil, err
			}
		}
	}
	// Move the ClusterDeployment to the end of the slice
	lastIndex := len(objs) - 1
	objs[cdPos], objs[lastIndex] = objs[lastIndex], objs[cdPos]

Comment on lines 750 to 753
var ok bool
if ics, ok = obj.(*corev1.Secret); ok {
installConfig, ok := ics.StringData["install-config.yaml"]
if ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets have a function that given an object determines if it is install config secret. That way we have more concise code.

if ok {
newInstallConfig, err := r.getCustomizedInstallConfig([]byte(installConfig), cdc, logger)
if err != nil {
r.setInventoryValidCondition(clp, false, fmt.Sprintf("failed to customize with %s", cdc.Name), logger)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens when multiple customizations fail for clusterpool? How do we display that info on InventoryInvalid condition?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also error handlng is missing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens when multiple customizations fail for clusterpool? How do we display that info on InventoryInvalid condition?

It is really hard to make a meaningful message for invalid inventory if won't keep track the individual status of customizations in the clusterpool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 757 to 764
cdc.Status.Conditions = controllerutils.SetClusterDeploymentCustomizationCondition(
cdc.Status.Conditions,
hivev1.ClusterDeploymentCustomizationAvailableCondition,
corev1.ConditionFalse,
"CustomizationFailed",
"Failed to customize install config",
controllerutils.UpdateConditionIfReasonOrMessageChange,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see any update on customization for these changes to take effect.

Also, what/who is responsible for making the customization available again when necessary errors are removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 981 to 982
reason := "Valid" // Maybe a different readon?
message := "Inventory customization succesfuly applied and reserved"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
reason := "Valid" // Maybe a different readon?
message := "Inventory customization succesfuly applied and reserved"
reason := "InventoryValid"
message := "Inventory is valid"

return nil, err
}
cloudBuilder := clusterresource.NewOpenStackCloudBuilderFromSecret(credsSecret)
cloudBuilder.Cloud = platform.OpenStack.Cloud
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also set in NewOpenStackCloudBuilderFromSecret. I think we can remove it from there as it is not fetched from secret

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, and because it is a required field

@abraverm
Copy link
Contributor Author

/retest-required

@abraverm
Copy link
Contributor Author

/retest

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 14, 2022
@abraverm
Copy link
Contributor Author

/retest

1 similar comment
@abraverm
Copy link
Contributor Author

/retest

@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 17, 2022
@abraverm abraverm changed the title ClusterPool Inventory - WIP ClusterPool Inventory Feb 21, 2022
@abraverm
Copy link
Contributor Author

/retest

1 similar comment
@abraverm
Copy link
Contributor Author

/retest

@abraverm abraverm force-pushed the cp_onprem_pr branch 2 times, most recently from 792300e to 7f5d7d4 Compare February 23, 2022 12:18
@abraverm
Copy link
Contributor Author

/retest

1 similar comment
@abraverm
Copy link
Contributor Author

/retest

@2uasimojo
Copy link
Member

2uasimojo commented Jul 26, 2022

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_hive/1672/pull-ci-openshift-hive-master-verify/1551895261679194112#1:build-log.txt%3A58

vet: pkg/controller/clusterpool/clusterpool_controller_test.go:1773:48: FindClusterPoolCondition not declared by package utils

As noted above:

Alas, you've straddled #1820 where we've collapsed all the Find*Condition funcs down to one using generics. Easy fix -- see that PR for examples.

@2uasimojo
Copy link
Member

The finalizer fix looks right to me -- and actually seems entirely appropriate now that I see it in action, not hacky like I thought it would.

Let's get this thing landed so we can iterate without choking the GH UI.

Amazing work @abraverm!

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 1, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 1, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 2uasimojo, abraverm

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 1, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 1, 2022

@abraverm: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 5daa572 into openshift:master Aug 1, 2022
2uasimojo added a commit to 2uasimojo/hive that referenced this pull request Aug 8, 2023
Goal: reduce e2e-pool wallclock time by ~35m.

Problem Statement: When ClusterPool inventory
(ClusterDeploymentCustomization) testing was added to e2e-pool (4fddbe7
/ openshift#1672), it triggered ClusterPool's staleness algorithm such that we
were actually wasting a whole cluster while waiting for the real pool to
become ready. Grab a cup of coffee...

To make the flow of the test a little bit easier, we were creating the
real pool, then using its definition to generate the fake pool
definition -- which does not have inventory -- and then adding inventory
to the real pool.

But if you add or change a pool's inventory, we mark all its clusters
stale. So because of the flow above, when we initially created the real
pool without inventory, it started provisioning a cluster. Then when we
updated it (mere seconds later, if that), that cluster immediately
became stale.

Now, the way we decided to architect replacement of stale clusters, we
prioritize _having claimable clusters_ over _all clusters being
current_. Thus in this scenario we were actually ending up waiting until
the stale cluster was fully provisioned before deleting it and starting
over with the (inventory-affected) cluster.

Solution: Create the real pool with an initial `size=0`. Scale it up to
`size=1` _after_ adding the inventory.
2uasimojo added a commit to 2uasimojo/hive that referenced this pull request Aug 8, 2023
Goal: reduce e2e-pool wallclock time by ~35m.

Problem Statement: When ClusterPool inventory
(ClusterDeploymentCustomization) testing was added to e2e-pool (4fddbe7
/ openshift#1672), it triggered ClusterPool's staleness algorithm such that we
were actually wasting a whole cluster while waiting for the real pool to
become ready. Grab a cup of coffee...

To make the flow of the test a little bit easier, we were creating the
real pool, then using its definition to generate the fake pool
definition -- which does not have inventory -- and then adding inventory
to the real pool.

But if you add or change a pool's inventory, we mark all its clusters
stale. So because of the flow above, when we initially created the real
pool without inventory, it started provisioning a cluster. Then when we
updated it (mere seconds later, if that), that cluster immediately
became stale.

Now, the way we decided to architect replacement of stale clusters, we
prioritize _having claimable clusters_ over _all clusters being
current_. Thus in this scenario we were actually ending up waiting until
the stale cluster was fully provisioned before deleting it and starting
over with the (inventory-affected) cluster.

Solution: Create the real pool with an initial `size=0`. Scale it up to
`size=1` _after_ adding the inventory.
2uasimojo added a commit to 2uasimojo/hive that referenced this pull request Aug 15, 2023
Goal: reduce e2e-pool wallclock time by ~35m.

Problem Statement: When ClusterPool inventory
(ClusterDeploymentCustomization) testing was added to e2e-pool (4fddbe7
/ openshift#1672), it triggered ClusterPool's staleness algorithm such that we
were actually wasting a whole cluster while waiting for the real pool to
become ready. Grab a cup of coffee...

To make the flow of the test a little bit easier, we were creating the
real pool, then using its definition to generate the fake pool
definition -- which does not have inventory -- and then adding inventory
to the real pool.

But if you add or change a pool's inventory, we mark all its clusters
stale. So because of the flow above, when we initially created the real
pool without inventory, it started provisioning a cluster. Then when we
updated it (mere seconds later, if that), that cluster immediately
became stale.

Now, the way we decided to architect replacement of stale clusters, we
prioritize _having claimable clusters_ over _all clusters being
current_. Thus in this scenario we were actually ending up waiting until
the stale cluster was fully provisioned before deleting it and starting
over with the (inventory-affected) cluster.

Solution: Create the real pool with an initial `size=0`. Scale it up to
`size=1` _after_ adding the inventory.
2uasimojo added a commit to 2uasimojo/hive that referenced this pull request Aug 16, 2023
Goal: reduce e2e-pool wallclock time by ~35m.

Problem Statement: When ClusterPool inventory
(ClusterDeploymentCustomization) testing was added to e2e-pool (4fddbe7
/ openshift#1672), it triggered ClusterPool's staleness algorithm such that we
were actually wasting a whole cluster while waiting for the real pool to
become ready. Grab a cup of coffee...

To make the flow of the test a little bit easier, we were creating the
real pool, then using its definition to generate the fake pool
definition -- which does not have inventory -- and then adding inventory
to the real pool.

But if you add or change a pool's inventory, we mark all its clusters
stale. So because of the flow above, when we initially created the real
pool without inventory, it started provisioning a cluster. Then when we
updated it (mere seconds later, if that), that cluster immediately
became stale.

Now, the way we decided to architect replacement of stale clusters, we
prioritize _having claimable clusters_ over _all clusters being
current_. Thus in this scenario we were actually ending up waiting until
the stale cluster was fully provisioned before deleting it and starting
over with the (inventory-affected) cluster.

Solution: Create the real pool with an initial `size=0`. Scale it up to
`size=1` _after_ adding the inventory.

(cherry picked from commit 2a23f0a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants