Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(feature): cleanup funcs should not rely on feature's data #1184

Merged

Conversation

bartoszmajsak
Copy link
Contributor

@bartoszmajsak bartoszmajsak commented Aug 16, 2024

Description

Initially, Cleanup functions did not rely on loading Feature's data for invocation, as certain data was held as fields of the Feature struct, making it implicitly available. With the refactoring of Feature to become a "glorified map" container, this behavior has changed, leading to errors under certain circumstances.

With DSCI.ServiceMesh now being a pointer to a struct, there is a corner case where the (now needed) ServiceMesh spec is nil, leading to a panic.

When the reconciliation of DSC is triggered, it is invoked for each component, even if it is never defined or changed. In a simple case where DSCI does not have a specified ServiceMesh (nil instead of the default), and DSC has a single Managed component that is not KServe/Serverless, the reconciliation will trigger the removal of KServe resources regardless. This, in turn, will call for the removal of Serverless features, which rely on SMCP config when they are applied. With data now being loaded in the cleanup step, this leads to an error, as there is no SMCP config to read from.

By default, the cleanup logic removes the owner FeatureTracker for each Feature, and this does not require loading Feature's data, as FeatureTracker is used as OwnerReference for each resource created for a given Feature.

In addition, the API allows defining custom functions that can be invoked during the clean up phase. There is one custom cleanup function requiring the Service Mesh control plane spec to remove part of the config introduced by its patch. The code in question only worked because it had default values to work on and was not failing if the patched object was not found.

The refactoring wrongly enforced the loading of Feature data during cleanup for this single case, and this cascaded to other Feature sets where it was unnecessary, exposing this faulty behavior.

With this change, no reference to Feature is enforced at the API level by introducing the CleanupFunc(ctx, client) type for defining custom cleanup logic. The custom patch function has been reworked to comply with this.

How Has This Been Tested?

Create following DSCI and DSC:

apiVersion: dscinitialization.opendatahub.io/v1
kind: DSCInitialization
metadata:
  name: default
spec:
  applicationsNamespace: opendatahub
  trustedCABundle:
    managementState: Removed
  monitoring:
    managementState: Managed
    namespace: opendatahub
---
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
  name: default
spec:
  components:
    modelregistry:
      managementState: "Managed"
Observe PANIC
panic: runtime error: invalid memory address or nil pointer dereference [recovered]                                                                                                           
    panic: runtime error: invalid memory address or nil pointer dereference                                                                                                                   
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x243ac21]                                                                                                                       
                                                                                                                                                                                              
goroutine 597 [running]:                                                                                                                                                                      
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()                                                                                                        
    /opt/app-root/src/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119 +0x1e5                                                                      
panic({0x26904e0?, 0x3e1baa0?})                                                                                                                                                               
    /usr/lib/golang/src/runtime/panic.go:914 +0x21f                                                                                                                                           
github.com/opendatahub-io/opendatahub-operator/v2/pkg/feature/servicemesh.glob..func1.1({0xc003f6d810?, 0x7?}, {0xc003f6d820?, 0xf?})                                                         
    /workspace/pkg/feature/servicemesh/data.go:35 +0x21                                                                                                                                       
github.com/opendatahub-io/opendatahub-operator/v2/pkg/feature.DataEntry[...].func1(0xc00355e500?)                                                                                             
    /workspace/pkg/feature/feature_data.go:17 +0x65                                                                                                                                           
github.com/opendatahub-io/opendatahub-operator/v2/pkg/feature.(*Feature).Cleanup(0xc00355e500, {0x2c8ef30, 0xc0033e1b00})                                                                     
    /workspace/pkg/feature/feature.go:146 +0x154                                                                                                                                              
github.com/opendatahub-io/opendatahub-operator/v2/pkg/feature.(*FeaturesHandler).Delete(0xc000ecbf20, {0x2c8ef30, 0xc0033e1b00})                                                              
    /workspace/pkg/feature/handler.go:87 +0x165                                                                                                                                               
github.com/opendatahub-io/opendatahub-operator/v2/components/kserve.(*Kserve).removeServerlessFeatures(0xc0004a59e8, {0x2c8ef30, 0xc0033e1b00}, 0xc0000194a0)                                 
    /workspace/components/kserve/kserve_config_handler.go:155 +0x185                                                                                                                          
github.com/opendatahub-io/opendatahub-operator/v2/components/kserve.(*Kserve).ReconcileComponent(0xc0004a59e8, {0x2c8ef30, 0xc0033e1b00}, {0x2c98728, 0xc000016480}, {{0x2c940c0?, 0xc000711bf
0?}, 0x3ff0000000000000?}, {0x2ca3f70, 0xc0006a0a80}, ...)                                                                                                                                    
    /workspace/components/kserve/kserve.go:110 +0x25f                                                                                                                                         
github.com/opendatahub-io/opendatahub-operator/v2/controllers/datasciencecluster.(*DataScienceClusterReconciler).reconcileSubComponent(0xc000019450, {0x2c8ef30, 0xc0033e1b00}, 0xc0006a0a80, 
{0x7f52b80c8d30, 0xc0004a59e8})                                                                                                                                                               
    /workspace/controllers/datasciencecluster/datasciencecluster_controller.go:317 +0x2c6                                                                                                     
github.com/opendatahub-io/opendatahub-operator/v2/controllers/datasciencecluster.(*DataScienceClusterReconciler).Reconcile(0xc000019450, {0x2c8ef30, 0xc0033e1b00}, {{{0x0?, 0x0?}, {0xc003fb7
a76?, 0xe64a85?}}})                                                                                                                                                                           
    /workspace/controllers/datasciencecluster/datasciencecluster_controller.go:243 +0xf77                                                                                                     
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x2c8ef30?, {0x2c8ef30?, 0xc0033e1b00?}, {{{0x0?, 0x25886e0?}, {0xc003fb7a76?, 0x2c7dde8?}}})                  
    /opt/app-root/src/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:122 +0xb7                                                                       
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000484780, {0x2c8ef68, 0xc000018730}, {0x27422c0?, 0xc003520120?})                                   
    /opt/app-root/src/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:323 +0x368                                                                      
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000484780, {0x2c8ef68, 0xc000018730})                                                             
    /opt/app-root/src/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:274 +0x1c9                                                                      
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()                                                                                                          
    /opt/app-root/src/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235 +0x79                                                                       
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 132                                                                                  
    /opt/app-root/src/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:231 +0x565

Deploy image quay.io/bmajsak/opendatahub-operator:cleanup-fix with this fix

.... no error \o/

Screenshot or short clip

PANIC

panic

Merge criteria

  • You have read the contributors guide.
  • Commit messages are meaningful - have a clear and concise summary and detailed explanation of what was changed and why.
  • Pull Request contains a description of the solution, a link to the JIRA issue, and to any dependent or related Pull Request.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Initially, `Cleanup` functions did not rely on loading Feature's
data for invocation, as certain data was held as fields of the Feature
struct, making it implicitly available. With the refactoring of Feature
to become a "glorified map" container, this behavior has changed,
leading to errors under certain circumstances.

With `DSCI.ServiceMesh` now being a pointer to a struct, there is a corner
case where the (now needed) ServiceMesh spec is `nil`, leading to a panic.

When the reconciliation of DSC is triggered, it is invoked for each
component, even if it is never defined or changed. In a simple
case where DSCI does not have a specified ServiceMesh (`nil` instead of the default),
and DSC has a single `Managed` component that is not KServe/Serverless,
the reconciliation will trigger the removal of KServe resources regardless.
This, in turn, will call for the removal of Serverless features, which rely on
SMCP config when they are applied. With data now being loaded in the cleanup
step, this leads to an error, as there is no SMCP config to read from.

By default, the cleanup logic removes the owner FeatureTracker for each Feature,
and this does not require loading Feature's data, as FeatureTracker is used
as `OwnerReference` for each resources created for a given Feature.

The API allows defining custom functions that can be invoked in
addition. There is one custom cleanup function requiring the Service Mesh
control plane spec to remove part of the config introduced by its patch.
The code in question only worked because it had default values to work
on and was not failing if patched object was not found.

The refactoring wrongly enforced loading of Feature data during cleanup for
this single case, and this cascaded to other Feature sets where it was
unnecessary exposing this faulty behaviour.

With this change, no reference to Feature is enforced at the API
level by introducing the `CleanupFunc(ctx, client)` type for defining
custom cleanup logic. The custom patch function has been reworked
to comply with this.
@openshift-ci openshift-ci bot requested review from grdryn and Sara4994 August 16, 2024 12:40
@zdtsw zdtsw requested review from VaishnaviHire and removed request for grdryn and Sara4994 August 16, 2024 13:06
Copy link

openshift-ci bot commented Aug 16, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: zdtsw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit a88cef0 into opendatahub-io:incubation Aug 16, 2024
8 checks passed
zdtsw pushed a commit to zdtsw-forking/rhods-operator that referenced this pull request Aug 29, 2024
…tahub-io#1184)

Initially, `Cleanup` functions did not rely on loading Feature's
data for invocation, as certain data was held as fields of the Feature
struct, making it implicitly available. With the refactoring of Feature
to become a "glorified map" container, this behavior has changed,
leading to errors under certain circumstances.

With `DSCI.ServiceMesh` now being a pointer to a struct, there is a corner
case where the (now needed) ServiceMesh spec is `nil`, leading to a panic.

When the reconciliation of DSC is triggered, it is invoked for each
component, even if it is never defined or changed. In a simple
case where DSCI does not have a specified ServiceMesh (`nil` instead of the default),
and DSC has a single `Managed` component that is not KServe/Serverless,
the reconciliation will trigger the removal of KServe resources regardless.
This, in turn, will call for the removal of Serverless features, which rely on
SMCP config when they are applied. With data now being loaded in the cleanup
step, this leads to an error, as there is no SMCP config to read from.

By default, the cleanup logic removes the owner FeatureTracker for each Feature,
and this does not require loading Feature's data, as FeatureTracker is used
as `OwnerReference` for each resources created for a given Feature.

The API allows defining custom functions that can be invoked in
addition. There is one custom cleanup function requiring the Service Mesh
control plane spec to remove part of the config introduced by its patch.
The code in question only worked because it had default values to work
on and was not failing if patched object was not found.

The refactoring wrongly enforced loading of Feature data during cleanup for
this single case, and this cascaded to other Feature sets where it was
unnecessary exposing this faulty behaviour.

With this change, no reference to Feature is enforced at the API
level by introducing the `CleanupFunc(ctx, client)` type for defining
custom cleanup logic. The custom patch function has been reworked
to comply with this.

(cherry picked from commit a88cef0)
openshift-merge-bot bot pushed a commit to red-hat-data-services/rhods-operator that referenced this pull request Aug 29, 2024
…tahub-io#1184)

Initially, `Cleanup` functions did not rely on loading Feature's
data for invocation, as certain data was held as fields of the Feature
struct, making it implicitly available. With the refactoring of Feature
to become a "glorified map" container, this behavior has changed,
leading to errors under certain circumstances.

With `DSCI.ServiceMesh` now being a pointer to a struct, there is a corner
case where the (now needed) ServiceMesh spec is `nil`, leading to a panic.

When the reconciliation of DSC is triggered, it is invoked for each
component, even if it is never defined or changed. In a simple
case where DSCI does not have a specified ServiceMesh (`nil` instead of the default),
and DSC has a single `Managed` component that is not KServe/Serverless,
the reconciliation will trigger the removal of KServe resources regardless.
This, in turn, will call for the removal of Serverless features, which rely on
SMCP config when they are applied. With data now being loaded in the cleanup
step, this leads to an error, as there is no SMCP config to read from.

By default, the cleanup logic removes the owner FeatureTracker for each Feature,
and this does not require loading Feature's data, as FeatureTracker is used
as `OwnerReference` for each resources created for a given Feature.

The API allows defining custom functions that can be invoked in
addition. There is one custom cleanup function requiring the Service Mesh
control plane spec to remove part of the config introduced by its patch.
The code in question only worked because it had default values to work
on and was not failing if patched object was not found.

The refactoring wrongly enforced loading of Feature data during cleanup for
this single case, and this cascaded to other Feature sets where it was
unnecessary exposing this faulty behaviour.

With this change, no reference to Feature is enforced at the API
level by introducing the `CleanupFunc(ctx, client)` type for defining
custom cleanup logic. The custom patch function has been reworked
to comply with this.

(cherry picked from commit a88cef0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants