Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROX-15941: Run multiple operators from FSS #951

Merged
merged 30 commits into from
Apr 28, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
242c381
Run multiple operators from FSS
kurlov Apr 13, 2023
64516ec
Refactor installation
kurlov Apr 13, 2023
d38aa18
Refactor chart isntallation
kurlov Apr 13, 2023
d17d882
Add TODO
kurlov Apr 14, 2023
935fa2c
Review comments
kurlov Apr 14, 2023
0740bdf
Review comments
kurlov Apr 19, 2023
c70bfc9
Add rollback doc
kurlov Apr 19, 2023
99bc121
Merge remote-tracking branch 'origin/main' into akurlov/ROX-15941_run…
kurlov Apr 19, 2023
43e3c5f
Fix deployment
kurlov Apr 19, 2023
3d95a06
Update fleetshard/README.md
kurlov Apr 19, 2023
cf717fe
Fix typo
kurlov Apr 19, 2023
818ed8f
Ignore long tag version for operator deployment
kurlov Apr 19, 2023
5ddc9ab
Update fleetshard/pkg/central/operator/upgrade.go
kurlov Apr 20, 2023
5e387b4
Update fleetshard/pkg/runtime/runtime.go
kurlov Apr 20, 2023
3697f26
Fail fast on parsing images
kurlov Apr 20, 2023
f40e752
Add TODO to disable installing operator subscription
kurlov Apr 20, 2023
73e743b
Extend rollback in README.md
kurlov Apr 20, 2023
0b8dd59
Set TODOs for dropping acs operator template
kurlov Apr 21, 2023
38bd083
Add re-terraform in case of rollback
kurlov Apr 21, 2023
7087bd4
Log namespace
kurlov Apr 21, 2023
d6b8a9d
Drop CRD check
kurlov Apr 24, 2023
c296bbc
Apply suggestions from code review
kurlov Apr 27, 2023
e525a86
Review coments
kurlov Apr 27, 2023
5b126c9
Update fleetshard/pkg/central/operator/upgrade_test.go
kurlov Apr 27, 2023
23e5e8e
Update fleetshard/pkg/central/charts/data/rhacs-operator/templates/rh…
kurlov Apr 27, 2023
9469605
Fix formatting
kurlov Apr 28, 2023
433b272
Add note about deleting OperatorGroup
kurlov Apr 28, 2023
2eb72bf
Rename unique set for unique images
kurlov Apr 28, 2023
6ac190d
Update fleetshard/pkg/central/operator/upgrade.go
kurlov Apr 28, 2023
6babd77
Rename deployment prefix and pass it completely
kurlov Apr 28, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
## TODO(ROX-16646): drop acs-operator.yaml template
{{- if .Values.acsOperator.enabled }}
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
Expand Down
1 change: 1 addition & 0 deletions dp-terraform/helm/rhacs-terraform/terraform_cluster.sh
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ if [[ "${OPERATOR_USE_UPSTREAM}" == "true" ]]; then
OPERATOR_SOURCE="rhacs-operators"
fi

# TODO(ROX-16645): set acsOperator.enabled to false
invoke_helm "${SCRIPT_DIR}" rhacs-terraform \
--namespace rhacs \
--set acsOperator.enabled=true \
Expand Down
1 change: 1 addition & 0 deletions dp-terraform/helm/rhacs-terraform/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ CLUSTER_ID="test-clusterId"
FM_ENDPOINT="127.0.0.1:443"
OCM_TOKEN="example-token"

# TODO(ROX-16645): set acsOperator.enabled to false
helm template rhacs-terraform \
--debug \
--namespace rhacs \
Expand Down
98 changes: 98 additions & 0 deletions fleetshard/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,3 +81,101 @@ STATIC_TOKEN=<generated value | bitwarden value> \
AUTH_TYPE=STATIC_TOKEN \
./dev/env/scripts/exec_fleetshard_sync.sh
```

### Manage ACS Operator(s)

Fleetshard-sync service is able to manage installation/update
of ACS Operator based on running and desired ACS Instances versions.
Fleetshard-sync operator ACS Operator management should replace OLM based approach.

#### Rollout installation/update of ACS Operator:

1. Make sure that OLM ACS Operator subscription is deleted.
OLM uses the subscription resource to subscribe to the latest version of an operator.
OLM reinstalls a new version of the operator even if the operator’s CSV was deleted earlier.
In effect, you must tell OLM that you do not want new versions of the operator to be installed by deleting the ACS Operator subscription
```
kubectl get subscription -n <operator_namespace>
kubectl delete subscription <subscription> -n <operator_namespace>
```

2. Delete the Operator’s ClusterServiceVersion.
kurlov marked this conversation as resolved.
Show resolved Hide resolved
The ClusterServiceVersion contains all the information that OLM needs to manage an operator,
and it effectively represents an operator that is installed on the cluster

```
kubectl get clusterserviceversion -n <operator_namespace>
kubectl delete clusterserviceversion rhacs-operator.<version> -n <operator_namespace>
```

3. Delete the Operator’s OperatorGroup.
```
kubectl get OperatorGroup -n <operator_namespace>
kubectl delete OperatorGroup rhacs-operator.<version> -n <operator_namespace>
```

4. Check that there is no running ACS Operator

```
kubectl get pods -n <operator_namespace>
NAME READY STATUS RESTARTS AGE
```

5. Turn on ACS Operator management feature flag

set `FEATURE_FLAG_UPGRADE_OPERATOR_ENABLED` to `true` and redeploy Fleetshard-sync service

6. Check that the ACS Operator is running again

```
kubectl get pods -n <operator_namespace>
NAME READY STATUS RESTARTS AGE
rhacs-operator-controller-manager-3.74.1-5765676ffc-l9bpp 2/2 Running 0 13s
...
```

7. Check deployment

```
kubectl get deployments -n <operator_namespace>
NAME READY UP-TO-DATE AVAILABLE AGE
rhacs-operator-controller-manager-3.74.1 1/1 1 1 27s
```

#### Rollback installation/update of ACS Operator:

1. Redeploy Fleetshard-sync with disabled `FEATURE_FLAG_UPGRADE_OPERATOR_ENABLED=false` environment variable
2. Delete existing ACS Operator deployment(s)
kurlov marked this conversation as resolved.
Show resolved Hide resolved

```
kubectl get deployments -n <operator_namespace>
kubectl delete deployment <deployment> -n <operator_namespace>
```

Also, delete metric Service and serviceAccount
```
kubectl delete service rhacs-operator-controller-manager-metrics-service -n <operator_namespace>
kubectl delete serviceAccount rhacs-operator-controller-manager -n <operator_namespace>
```

3. Check that there is no running ACS Operator pod(s)

```
kubectl get pods -n <operator_namespace>
NAME
...
```

4. Re-terraform the cluster
```
./terraform_cluster.sh <environment> <cluster>
```

5. Check that ACS Operator is running

```
kubectl get pods -n <operator_namespace>
NAME READY STATUS RESTARTS AGE
rhacs-operator-controller-manager-688d74ffb5-lkbm7 2/2 Running 0 13s
...
```
Copy link
Collaborator

@porridge porridge Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, what is the plan for keeping this file in sync with the upstream operator releases? Who, when and how copies the changes across the repos going forward?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRDs will be downloaded directly from the main stackrox repository.
But I don't have a good solution for deployment. There will be an alert for failed operator start which could help to spot if deployment is incorrect for particular version

Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
## Iterate over operator versions passed by fleet-shard sync
{{- range .Values.operator.images }}
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: rhacs-operator
control-plane: controller-manager
name: rhacs-operator-controller-manager
## Name field must contain up to 63 characters
## https://www.rfc-editor.org/rfc/rfc1123
name: {{ $.Values.operator.deploymentPrefix | lower }}{{ .tag | lower }}
namespace: stackrox-operator
spec:
replicas: 1
Expand Down Expand Up @@ -78,13 +82,11 @@ spec:
containerName: manager
resource: limits.memory
divisor: '0'
- name: OPERATOR_CONDITION_NAME
value: rhacs-operator.v3.74.0
{{ if .Values.operator.centralLabelSelector -}}
{{- if $.Values.operator.centralLabelSelector -}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{{- if $.Values.operator.centralLabelSelector -}}
{{- if $.Values.operator.centralLabelSelector -}}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this still needs to be removed

- name: CENTRAL_LABEL_SELECTOR
value: {{ .Values.operator.centralLabelSelector | quote }}
value: "rhacs.redhat.com/tag={{ substr 0 64 (lower .tag) }},rhacs.redhat.com/repository={{ substr 0 64 (lower .repository )}}"
{{- end }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{{- end }}
{{- end }}

image: {{ .Values.operator.image }}
image: "{{ .repository }}:{{ .tag }}"
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
Expand Down Expand Up @@ -130,3 +132,5 @@ spec:
serviceAccount: rhacs-operator-controller-manager
serviceAccountName: rhacs-operator-controller-manager
terminationGracePeriodSeconds: 10
---
{{- end }}
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ metadata:
labels:
app: rhacs-operator
control-plane: controller-manager
name: rhacs-operator-controller-manager-metrics-service
name: rhacs-operator-manager-metrics-service
spec:
ports:
- name: https
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,5 @@ imagePullSecrets:
- name: quay-ips
kind: ServiceAccount
metadata:
name: rhacs-operator-controller-manager
name: rhacs-operator-manager
namespace: stackrox-operator
15 changes: 12 additions & 3 deletions fleetshard/pkg/central/charts/data/rhacs-operator/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,15 @@
# Declare variables to be passed into your templates.

operator:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this, but consider renaming images to operators, or deployments such as

deployments:
- operator:
    image:
      repository: foo
      tag: bar
    resources: {}
  kubeProxy:
    resources: {}
 ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I would prefer to do it in the separate PR where more fields will be added

# TODO: Split image into registry and image name + tag
image: quay.io/rhacs-eng/stackrox-operator:3.74.0
# TODO: and values for resource limits and requests for both proxy and manager container
deploymentPrefix: rhacs-operator-manager-

# Each item in images should contain `repository` and `tag` key
# example:
# images:
# - repository: quay.io/rhacs-eng/stackrox-operator
# tag: 3.74.0
# - repository: quay.io/rhacs-eng/stackrox-operator
# tag: 3.74.1
images: []
# TODO: add values for resource limits and requests for both proxy and manager container
# TODO: add value for kube-rbac-proxy image
79 changes: 58 additions & 21 deletions fleetshard/pkg/central/operator/upgrade.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,42 +4,80 @@ package operator
import (
"context"
"fmt"
"strings"

"github.com/golang/glog"
apiErrors "k8s.io/apimachinery/pkg/api/errors"
"k8s.io/apimachinery/pkg/apis/meta/v1/unstructured"

"github.com/stackrox/acs-fleet-manager/fleetshard/pkg/central/charts"
"helm.sh/helm/v3/pkg/chart"
"helm.sh/helm/v3/pkg/chartutil"
apiErrors "k8s.io/apimachinery/pkg/api/errors"
"k8s.io/apimachinery/pkg/apis/meta/v1/unstructured"
ctrlClient "sigs.k8s.io/controller-runtime/pkg/client"
)

const (
operatorNamespace = "stackrox-operator"
releaseName = "rhacs-operator"
operatorImage = "quay.io/rhacs-eng/stackrox-operator:3.74.0"
operatorNamespace = "stackrox-operator"
releaseName = "rhacs-operator"
operatorDeploymentPrefix = "rhacs-operator-manager"

// deployment names should contain at most 63 characters
// RFC 1035 Label Names: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#rfc-1035-label-names
maxOperatorDeploymentNameLength = 63
)

func parseOperatorImages(images []string) ([]chartutil.Values, error) {
kurlov marked this conversation as resolved.
Show resolved Hide resolved
if len(images) == 0 {
return nil, fmt.Errorf("the list of images is empty")
}
var operatorImages []chartutil.Values
uniqueImages := make(map[string]bool)
for _, img := range images {
if !strings.Contains(img, ":") {
return nil, fmt.Errorf("failed to parse image %q", img)
}
strs := strings.Split(img, ":")
if len(strs) != 2 {
return nil, fmt.Errorf("failed to split image and tag from %q", img)
}
repo, tag := strs[0], strs[1]
if len(operatorDeploymentPrefix+"-"+tag) > maxOperatorDeploymentNameLength {
kurlov marked this conversation as resolved.
Show resolved Hide resolved
return nil, fmt.Errorf("%s-%s contains more than %d characters and cannot be used as a deployment name", operatorDeploymentPrefix, tag, maxOperatorDeploymentNameLength)
}
if _, used := uniqueImages[repo+tag]; !used {
uniqueImages[repo+tag] = true
img := chartutil.Values{"repository": repo, "tag": tag}
operatorImages = append(operatorImages, img)
}
}
return operatorImages, nil
}

// ACSOperatorManager keeps data necessary for managing ACS Operator
type ACSOperatorManager struct {
client ctrlClient.Client
resourcesChart *chart.Chart
}

// InstallOrUpgrade provisions or upgrades an existing ACS Operator from helm chart template
func (u *ACSOperatorManager) InstallOrUpgrade(ctx context.Context) error {
func (u *ACSOperatorManager) InstallOrUpgrade(ctx context.Context, images []string) error {
operatorImages, err := parseOperatorImages(images)
if err != nil {
return fmt.Errorf("failed to parse images: %w", err)
}
chartVals := chartutil.Values{
"operator": chartutil.Values{
"image": operatorImage,
"deploymentPrefix": operatorDeploymentPrefix + "-",
"images": operatorImages,
},
}

u.resourcesChart = charts.MustGetChart("rhacs-operator")
objs, err := charts.RenderToObjects(releaseName, operatorNamespace, u.resourcesChart, chartVals)
if err != nil {
return fmt.Errorf("installing operator chart: %w", err)
return fmt.Errorf("failed rendering operator chart: %w", err)
}

// TODO(ROX-16338): handle namespace assigning with refactoring of chart deployment
for _, obj := range objs {
if obj.GetNamespace() == "" {
obj.SetNamespace(operatorNamespace)
Expand All @@ -49,22 +87,21 @@ func (u *ACSOperatorManager) InstallOrUpgrade(ctx context.Context) error {
out.SetGroupVersionKind(obj.GroupVersionKind())
err := u.client.Get(ctx, key, &out)
if err == nil {
glog.V(10).Infof("Updating ACS Operator %s/%s", obj.GetNamespace(), obj.GetName())
glog.V(10).Infof("Updating %s/%s in %s namespace", obj.GetKind(), obj.GetName(), obj.GetNamespace())
obj.SetResourceVersion(out.GetResourceVersion())
err := u.client.Update(ctx, obj)
if err != nil {
return fmt.Errorf("failed to update ACS Operator %s/%s of type %v %s", key.Namespace, key.Name, obj.GroupVersionKind(), err)
return fmt.Errorf("failed to update object %s/%s in %s namespace: %w", obj.GetKind(), key.Name, key.Namespace, err)
}
} else {
if !apiErrors.IsNotFound(err) {
return fmt.Errorf("failed to retrieve object %s/%s in %s namespace: %w", obj.GetKind(), key.Name, key.Namespace, err)
}
err = u.client.Create(ctx, obj)
glog.Infof("Creating %s/%s in %s namespace", obj.GetKind(), obj.GetName(), obj.GetNamespace())
if err != nil && !apiErrors.IsAlreadyExists(err) {
return fmt.Errorf("failed to create object %s/%s in %s namespace: %w", obj.GetKind(), key.Name, key.Namespace, err)
}

continue
}
if !apiErrors.IsNotFound(err) {
return fmt.Errorf("failed to retrieve object %s/%s of type %v %s", key.Namespace, key.Name, obj.GroupVersionKind(), err)
}
err = u.client.Create(ctx, obj)
glog.V(10).Infof("Creating object %s/%s", obj.GetNamespace(), obj.GetName())
if err != nil && !apiErrors.IsAlreadyExists(err) {
return fmt.Errorf("failed to create object %s/%s of type %v: %w", key.Namespace, key.Name, obj.GroupVersionKind(), err)
}
}

Expand Down
Loading