-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ClusterPool Inventory #1672
ClusterPool Inventory #1672
Conversation
vendor/github.com/openshift/hive/apis/hive/v1/clusterdeploymentcustomization_types.go
Outdated
Show resolved
Hide resolved
} | ||
|
||
func (r *ReconcileClusterPool) getCustomizedInstallConfigSecretRef(cd *hivev1.ClusterDeployment, customizationRef *corev1.LocalObjectReference, logger log.FieldLogger) error { | ||
// TODO: Am I doing this right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of patching yaml secret why don't you patch installConfigTemplate
before passing it to the builder. This library might help you do it https://github.com/evanphx/json-patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my first implementation, but Builder merges the provided template with its own values or generates a new install config (https://github.com/openshift/hive/blob/master/pkg/clusterresource/builder.go#L201) and the merge process favorites Builder values(
hive/pkg/clusterresource/builder.go
Line 424 in 8dc97bc
func (o *Builder) mergeInstallConfigTemplate() (*corev1.Secret, error) { |
json-patch
(see applyPatches
function).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only things changed by mergeInstallConfigTemplate
function are name
and basedomain
. Extracting and then patching data from generated installConfigSecret, as currently written, looks like a tedious process to me. But yes theoretically it should work. I would like to get a second opinion here. @2uasimojo @suhanime any thoughts?
@abraverm I gave the first pass through the code and added some review comments. |
} | ||
|
||
func (r *ReconcileClusterPool) getCustomizedInstallConfigSecretRef(cd *hivev1.ClusterDeployment, customizationRef *corev1.LocalObjectReference, logger log.FieldLogger) error { | ||
// TODO: Am I doing this right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only things changed by mergeInstallConfigTemplate
function are name
and basedomain
. Extracting and then patching data from generated installConfigSecret, as currently written, looks like a tedious process to me. But yes theoretically it should work. I would like to get a second opinion here. @2uasimojo @suhanime any thoughts?
97424fe
to
c3d165c
Compare
c3d165c
to
5e613ba
Compare
pkg/controller/clusterdeployment/clusterdeployment_controller.go
Outdated
Show resolved
Hide resolved
pkg/controller/clusterdeployment/clusterdeployment_controller.go
Outdated
Show resolved
Hide resolved
pkg/controller/clusterdeployment/clusterdeployment_controller.go
Outdated
Show resolved
Hide resolved
pkg/controller/clusterdeployment/clusterdeployment_controller.go
Outdated
Show resolved
Hide resolved
pkg/controller/clusterdeployment/clusterdeployment_controller.go
Outdated
Show resolved
Hide resolved
cdLog.Infof("Deprovision request completed, releasing inventory customization") | ||
if cd.Spec.ClusterPoolRef != nil && cd.Spec.ClusterPoolRef.ClusterDeploymentCustomizationRef != nil { | ||
if err := r.releaseClusterDeploymentCustomization(cd, cdLog); err != nil { | ||
cdLog.WithError(err).Log(controllerutils.LogLevel(err), "error releasing inventory customization") | ||
return reconcile.Result{}, err | ||
} | ||
} | ||
cdLog.Infof("DNSZone gone and deprovision request completed, removing finalizer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cdLog.Infof("Deprovision request completed, releasing inventory customization") | |
if cd.Spec.ClusterPoolRef != nil && cd.Spec.ClusterPoolRef.ClusterDeploymentCustomizationRef != nil { | |
if err := r.releaseClusterDeploymentCustomization(cd, cdLog); err != nil { | |
cdLog.WithError(err).Log(controllerutils.LogLevel(err), "error releasing inventory customization") | |
return reconcile.Result{}, err | |
} | |
} | |
cdLog.Infof("DNSZone gone and deprovision request completed, removing finalizer") | |
cdLog.Infof("DNSZone gone and deprovision request completed, removing finalizers") | |
if cd.Spec.ClusterPoolRef != nil && cd.Spec.ClusterPoolRef.ClusterDeploymentCustomizationRef != nil { | |
if err := r.releaseClusterDeploymentCustomization(cd, cdLog); err != nil { | |
cdLog.WithError(err).Log(controllerutils.LogLevel(err), "error releasing inventory customization") | |
return reconcile.Result{}, err | |
} | |
} |
pkg/controller/clusterdeployment/clusterdeployment_controller.go
Outdated
Show resolved
Hide resolved
"ClusterDeploymentCustomizationReleased", | ||
"Cluster Deployment Customization was released", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would go with the message that says customization is available instead of mentioning it was released
5e613ba
to
45b8526
Compare
Codecov Report
@@ Coverage Diff @@
## master #1672 +/- ##
==========================================
- Coverage 41.60% 41.58% -0.02%
==========================================
Files 355 362 +7
Lines 33129 34117 +988
==========================================
+ Hits 13782 14189 +407
- Misses 18184 18725 +541
- Partials 1163 1203 +40
|
for _, entry := range clp.Spec.Inventory { | ||
cdc := &hivev1.ClusterDeploymentCustomization{} | ||
ref := types.NamespacedName{Namespace: clp.Namespace, Name: entry.Name} | ||
r.Get(context.TODO(), ref, cdc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think get function here needs error handling
config := &hivev1.FailedProvisionConfig{} | ||
if len(path) == 0 { | ||
return nil, nil | ||
return config, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug that I have stumbled on. When running the controller locally (development), it doesn't have any of the configurations that Hive operators adds when it creates the controller:
https://github.com/openshift/hive/blob/master/pkg/operator/hive/hive_controller.go#L516
fpConfigHash
-> failedProvisionConfigMapInfo
This creates environment variable FAILED_PROVISION_CONFIG_FILE
that points at the created configuration.
The bug is that when that environment variable is not defined then the readProvisionFailedConfig will return nil
, and that causes panic when a provision is retried as it tries to access an attribute in a nil object.
I would prefer running a controller locally same as if it was started by Hive Operator and I would have avoided this bug, but I don't know how to reproduce the configurations created by Operator properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really appreciate if you create a separate PR for this as it is unrelated to this work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default: | ||
cdLog.Infof("DNSZone gone and deprovision request completed, removing finalizer") | ||
cdLog.Infof("DNSZone gone, customization released and deprovision request completed, removing finalizer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Customization will only be released if it exists so this does not sound right to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right and it also true about "DNSZone gone" part - it is by default true when manage DNS is disabled. So how about I will remove both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not great English, but I suspect "gone" was chosen to cover both "we removed it" and "it was never there". We could just use the same convention for the customization and not worry about it :)
config := &hivev1.FailedProvisionConfig{} | ||
if len(path) == 0 { | ||
return nil, nil | ||
return config, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really appreciate if you create a separate PR for this as it is unrelated to this work
cd.Spec.ClusterPoolRef = &poolRef | ||
lastIndex := len(objs) - 1 | ||
objs[i], objs[lastIndex] = objs[lastIndex], objs[i] | ||
} | ||
// Apply inventory customization | ||
if clp.Spec.Inventory != nil { | ||
for _, obj := range objs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already iterate through the objects once. Do we again need to iterate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the first iteration the ClusterDeployment was moved to the end of the slice, and with all the additional logic, the entire block looked too complicated to me. I decided to seperate them so it would easier to maintain on the cost of a duplicate iteration of a small loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Akhil that we ought to be able to loop just once.
But also agree with Alex that I don't want to see all 30LOC embedded in the previous loop.
But also, modifying slice order while iterating over the slice is not cool IMO (latent). See this example -- we end up running over one object twice and ignoring another.
So let's do this:
- Factor those 30LOC out into a separate function, e.g.
patchInstallConfig()
- The above loop should:
- edit the CD (as it does today)
- patch the install config
- not reorder the slice
- Swizzle the slice order after the loop
So like:
var cdPos int
for i, obj := range objs {
switch {
case cd, ok = obj.(*hivev1.ClusterDeployment); ok:
cdPos = i
poolRef := poolReference(clp)
if cdc != nil {
poolRef.ClusterDeploymentCustomizationRef = &corev1.LocalObjectReference{Name: cdc.Name}
}
cd.Spec.ClusterPoolRef = &poolRef
case secret := isInstallConfigSecret(obj); secret != nil:
if err := patchInstallConfig(cdc, secret, logger); err != nil {
return nil, err
}
}
}
// Move the ClusterDeployment to the end of the slice
lastIndex := len(objs) - 1
objs[cdPos], objs[lastIndex] = objs[lastIndex], objs[cdPos]
var ok bool | ||
if ics, ok = obj.(*corev1.Secret); ok { | ||
installConfig, ok := ics.StringData["install-config.yaml"] | ||
if ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets have a function that given an object determines if it is install config secret. That way we have more concise code.
if ok { | ||
newInstallConfig, err := r.getCustomizedInstallConfig([]byte(installConfig), cdc, logger) | ||
if err != nil { | ||
r.setInventoryValidCondition(clp, false, fmt.Sprintf("failed to customize with %s", cdc.Name), logger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens when multiple customizations fail for clusterpool? How do we display that info on InventoryInvalid
condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also error handlng is missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens when multiple customizations fail for clusterpool? How do we display that info on InventoryInvalid condition?
It is really hard to make a meaningful message for invalid inventory if won't keep track the individual status of customizations in the clusterpool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cdc.Status.Conditions = controllerutils.SetClusterDeploymentCustomizationCondition( | ||
cdc.Status.Conditions, | ||
hivev1.ClusterDeploymentCustomizationAvailableCondition, | ||
corev1.ConditionFalse, | ||
"CustomizationFailed", | ||
"Failed to customize install config", | ||
controllerutils.UpdateConditionIfReasonOrMessageChange, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see any update on customization for these changes to take effect.
Also, what/who is responsible for making the customization available again when necessary errors are removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reason := "Valid" // Maybe a different readon? | ||
message := "Inventory customization succesfuly applied and reserved" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reason := "Valid" // Maybe a different readon? | |
message := "Inventory customization succesfuly applied and reserved" | |
reason := "InventoryValid" | |
message := "Inventory is valid" |
return nil, err | ||
} | ||
cloudBuilder := clusterresource.NewOpenStackCloudBuilderFromSecret(credsSecret) | ||
cloudBuilder.Cloud = platform.OpenStack.Cloud |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also set in NewOpenStackCloudBuilderFromSecret
. I think we can remove it from there as it is not fetched from secret
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, and because it is a required field
45b8526
to
9f76ef3
Compare
/retest-required |
/retest |
9f76ef3
to
c7189fd
Compare
/retest |
1 similar comment
/retest |
c7189fd
to
18ac472
Compare
/retest |
1 similar comment
/retest |
792300e
to
7f5d7d4
Compare
/retest |
1 similar comment
/retest |
As noted above:
|
The finalizer fix looks right to me -- and actually seems entirely appropriate now that I see it in action, not hacky like I thought it would. Let's get this thing landed so we can iterate without choking the GH UI. Amazing work @abraverm! /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: 2uasimojo, abraverm The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@abraverm: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Goal: reduce e2e-pool wallclock time by ~35m. Problem Statement: When ClusterPool inventory (ClusterDeploymentCustomization) testing was added to e2e-pool (4fddbe7 / openshift#1672), it triggered ClusterPool's staleness algorithm such that we were actually wasting a whole cluster while waiting for the real pool to become ready. Grab a cup of coffee... To make the flow of the test a little bit easier, we were creating the real pool, then using its definition to generate the fake pool definition -- which does not have inventory -- and then adding inventory to the real pool. But if you add or change a pool's inventory, we mark all its clusters stale. So because of the flow above, when we initially created the real pool without inventory, it started provisioning a cluster. Then when we updated it (mere seconds later, if that), that cluster immediately became stale. Now, the way we decided to architect replacement of stale clusters, we prioritize _having claimable clusters_ over _all clusters being current_. Thus in this scenario we were actually ending up waiting until the stale cluster was fully provisioned before deleting it and starting over with the (inventory-affected) cluster. Solution: Create the real pool with an initial `size=0`. Scale it up to `size=1` _after_ adding the inventory.
Goal: reduce e2e-pool wallclock time by ~35m. Problem Statement: When ClusterPool inventory (ClusterDeploymentCustomization) testing was added to e2e-pool (4fddbe7 / openshift#1672), it triggered ClusterPool's staleness algorithm such that we were actually wasting a whole cluster while waiting for the real pool to become ready. Grab a cup of coffee... To make the flow of the test a little bit easier, we were creating the real pool, then using its definition to generate the fake pool definition -- which does not have inventory -- and then adding inventory to the real pool. But if you add or change a pool's inventory, we mark all its clusters stale. So because of the flow above, when we initially created the real pool without inventory, it started provisioning a cluster. Then when we updated it (mere seconds later, if that), that cluster immediately became stale. Now, the way we decided to architect replacement of stale clusters, we prioritize _having claimable clusters_ over _all clusters being current_. Thus in this scenario we were actually ending up waiting until the stale cluster was fully provisioned before deleting it and starting over with the (inventory-affected) cluster. Solution: Create the real pool with an initial `size=0`. Scale it up to `size=1` _after_ adding the inventory.
Goal: reduce e2e-pool wallclock time by ~35m. Problem Statement: When ClusterPool inventory (ClusterDeploymentCustomization) testing was added to e2e-pool (4fddbe7 / openshift#1672), it triggered ClusterPool's staleness algorithm such that we were actually wasting a whole cluster while waiting for the real pool to become ready. Grab a cup of coffee... To make the flow of the test a little bit easier, we were creating the real pool, then using its definition to generate the fake pool definition -- which does not have inventory -- and then adding inventory to the real pool. But if you add or change a pool's inventory, we mark all its clusters stale. So because of the flow above, when we initially created the real pool without inventory, it started provisioning a cluster. Then when we updated it (mere seconds later, if that), that cluster immediately became stale. Now, the way we decided to architect replacement of stale clusters, we prioritize _having claimable clusters_ over _all clusters being current_. Thus in this scenario we were actually ending up waiting until the stale cluster was fully provisioned before deleting it and starting over with the (inventory-affected) cluster. Solution: Create the real pool with an initial `size=0`. Scale it up to `size=1` _after_ adding the inventory.
Goal: reduce e2e-pool wallclock time by ~35m. Problem Statement: When ClusterPool inventory (ClusterDeploymentCustomization) testing was added to e2e-pool (4fddbe7 / openshift#1672), it triggered ClusterPool's staleness algorithm such that we were actually wasting a whole cluster while waiting for the real pool to become ready. Grab a cup of coffee... To make the flow of the test a little bit easier, we were creating the real pool, then using its definition to generate the fake pool definition -- which does not have inventory -- and then adding inventory to the real pool. But if you add or change a pool's inventory, we mark all its clusters stale. So because of the flow above, when we initially created the real pool without inventory, it started provisioning a cluster. Then when we updated it (mere seconds later, if that), that cluster immediately became stale. Now, the way we decided to architect replacement of stale clusters, we prioritize _having claimable clusters_ over _all clusters being current_. Thus in this scenario we were actually ending up waiting until the stale cluster was fully provisioned before deleting it and starting over with the (inventory-affected) cluster. Solution: Create the real pool with an initial `size=0`. Scale it up to `size=1` _after_ adding the inventory. (cherry picked from commit 2a23f0a)
Implementation of https://github.com/openshift/hive/blob/master/docs/enhancements/clusterpool-inventory.md
x-ref: https://issues.redhat.com/browse/HIVE-1367