Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MIMO] Move cluster certificate functionality to ClientHelper #3736

Merged
merged 5 commits into from
Sep 5, 2024

Conversation

hawkowl
Copy link
Collaborator

@hawkowl hawkowl commented Jul 30, 2024

Which issue this PR addresses:

Splitout of MIMO M1 work

What this PR does / why we need it:

Retrofits clienthelper support into some functions that MIMO tasks require

Test plan for issue:

CI, E2E

Is there any documentation that needs to be updated for this PR?

N/A

How do you know this will function as expected in production?

E2E should cover it, since it applies during install as well

@hawkowl hawkowl changed the title ove over TLS applying, as well as some clienthelper work [MIMO] Move cluster certificate functionality to ClientHelper Jul 30, 2024
@hawkowl hawkowl force-pushed the hawkowl/cleanups-cluster-tls branch from 1895b7b to 4c797bc Compare July 30, 2024 06:39
@hawkowl
Copy link
Collaborator Author

hawkowl commented Jul 30, 2024

/azp run ci, e2e

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

Copy link
Collaborator

@mociarain mociarain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things:

  • ClientHelper.New is never used. I think we can kill it without trouble.
  • Can we collapse the New stuff in this so we have private constructors and a public GetClient that builds or creates a client as needed. Basically, should this be a singleton?
  • General ask to add some more descriptive naming. Can we say what kind of client it is? It think k8sClient but I'm not sure. Maybe a better name for clientHelper is AROK8sClient i.e. our wrapped version of the k8s client.

return err
}

// initializeKubernetesClients initializes clients which are used
// once the cluster is up later on in the install process.
func (m *manager) initializeOperatorDeployer(ctx context.Context) (err error) {
m.aroOperatorDeployer, err = deploy.New(m.log, m.env, m.doc.OpenShiftCluster, m.arocli, m.client, m.extensionscli, m.kubernetescli, m.operatorcli)
m.aroOperatorDeployer, err = deploy.New(m.log, m.env, m.doc.OpenShiftCluster, m.arocli, m.ch.Client(), m.extensionscli, m.kubernetescli, m.operatorcli)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapping the client into the helper only to ask the helper to give the client right back is awkward. Either the client helper closes over the client, or hold on to both values, IMO. The in-between seems like an incorrect abstraction that closes over nothing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that. My question: is there a good reason to do so here that we are not aware of ?

I would simply make the client public instead of providing a getter as this indirect access doesn't bring much value (specially when we have multiple stuff that is called "Client", methods, struct fields, etc) IMHO.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it a bit, and I've made it close over the whole client instead, since that was pretty easy. This means that the places that we currently use a controller-runtime client, we can upgrade it to the clienthelper fairly easily.

Copy link

@AldoFusterTurpin AldoFusterTurpin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @hawkowl for this, I like that! 🙂
I just left some comments in case any make sense to you, but not a blocker.

pkg/cluster/apply.go Outdated Show resolved Hide resolved
utilpem "github.com/Azure/ARO-RP/pkg/util/pem"
)

func EnsureTLSSecretFromKeyvault(ctx context.Context, env env.Interface, ch clienthelper.Interface, target types.NamespacedName, certificateName string) error {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the same direction as my previous comment and talking about the ch clienthelper.Interface parameter... there is something that I have seen multiple times in the ARO-RP that I will consider a bad pattern. We tend to pass big interfaces where just one of the methods of the interface is used, something pretty common in our beloved friend Java but that can be easily avoided in Go.

In those cases, I think it would be a good idea to define just an interface with the method needed in this function and just update the header of the _ EnsureTLSSecretFromKeyvault_ to use that new one-method-interface, that is one of the beauties of implicit interfaces in Go: we can define an Interface IN THE CONSUMER package, not the implementer so we fulfill the requirement for this function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've altered the code to have Reader and Writer interfaces, similar to the controller-runtime client. This should let us define things a bit nicer if we want to restrict down the interface. I'm not too eager to narrow down the interfaces more than this, though.

I think that making single-method interfaces (especially defining them in consumers), is overly restricting and can cause a lot of work if any of those interfaces need to change in the future. Every time I've gone to use it, I've wished I hadn't later on. More granular interfaces from the provider can mitigate some of this duplication, but I'm also not sure how often we can get away with having super-granular interfaces. A Reader/Writer should hopefully do for now.

I also don't think it makes code easier to test when we have verified fakes (e.g. the controller-runtime client). It does make it easier to deal with when we're writing mocks for interfaces, since the mocks (and what the code might call) are therefore smaller, but I firmly believe that mocks are an anti-pattern and that even big interfaces like env.Interface would be a lot more tolerable if designed with a proper fake in parallel.


var cb []byte
for _, cert := range certs {
cb = append(cb, pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE", Bytes: cert.Raw})...)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super opinionated and optional: could we please move the line
pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE", Bytes: cert.Raw})...
to an aux variable for readability.

Type: corev1.SecretTypeTLS,
}

return ch.Ensure(ctx, secret)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As previously mentioned, as we just use the Ensure method of ch, maybe we could define here an interface with just this method to stick to the Interface Segregation Principle.

return err
}

// initializeKubernetesClients initializes clients which are used
// once the cluster is up later on in the install process.
func (m *manager) initializeOperatorDeployer(ctx context.Context) (err error) {
m.aroOperatorDeployer, err = deploy.New(m.log, m.env, m.doc.OpenShiftCluster, m.arocli, m.client, m.extensionscli, m.kubernetescli, m.operatorcli)
m.aroOperatorDeployer, err = deploy.New(m.log, m.env, m.doc.OpenShiftCluster, m.arocli, m.ch.Client(), m.extensionscli, m.kubernetescli, m.operatorcli)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that. My question: is there a good reason to do so here that we are not aware of ?

I would simply make the client public instead of providing a getter as this indirect access doesn't bring much value (specially when we have multiple stuff that is called "Client", methods, struct fields, etc) IMHO.

@hawkowl hawkowl force-pushed the hawkowl/cleanups-cluster-tls branch from 4c797bc to 5aefdaf Compare August 5, 2024 03:56
@hawkowl
Copy link
Collaborator Author

hawkowl commented Aug 5, 2024

/azp run ci, e2e

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

EnsureDeleted(ctx context.Context, gvk schema.GroupVersionKind, key types.NamespacedName) error
type Writer interface {
client.Writer
// Ensure applies self-contained objects to a Kubernetes API, merging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you're not changing this in the PR, so this is not a blocker here, but we should use server-side apply instead of whatever client-side logic exists, as it's guaranteed to be a better implementation :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to use server-side apply, but the fake implementation of controller-runtime doesn't yet support it :(

Refs:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No comment, just waving at @hawkowl 👋🏽

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the controller-runtime fakes are horrendous and IMO moving away from them as soon as possible is also a great boon to the project :)

SSA is very complex and faking it is likely not going to happen - not even just in some soon time-frame, but perhaps ever. Client fakes have fallen out of favor and everyone has a horror story of working around the mess that is envtest without gaining any confidence in their application functioning in the real world against a real API server. Even past that, the concerns Alvaro has in Troy's PR are a mountain of work that the upstream SIGs are not interested in taking up or sponsoring, from my understanding.

This is neither here nor there, but you might want to consider not putting SSA support in the controller-runtime fakes as a precondition to using this (amazingly useful and beneficial) technology in your production deployments.
 
See an example of how to test controllers or similar using static data in, static data out here: https://github.com/openshift/hypershift/blob/main/control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue beyond the requisite work are the apply types that are not scaffold with crd/operator-sdk/kubebuilder. That work is a side effect of the client (controller-runtime) not implementing that whereas client-go has that necessary logic.

The interface client being used here is the mechanism where that can't be utilized. I have no context around what this is trying to accomplish but if it is indeed using corev1 types where a kubernetes.Interface can be used, those applyConfigurations are available to fulfil the server side apply.

The faking of the clients are tracked with a "tracker" and trying to track a server side apply in any client would be trying to hold state for things it wouldn't worry about because of the client's responsibility. Having a test suite where that can happen seems outside of the scope of what the client/fakes could do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have worked with the controller-runtime ecosystem and upstream Kubernetes client-sets for many years, so I'm pretty familiar with the structure. I don't think I communicated the "static data in, static data out" link well - my suggestion is not that some client helper hold on to apply configurations, nor is it to somehow teach the controller-runtime client how to apply.

In case we get too far in the weeds - I just want to re-iterate that the comments here are not review comments for this PR, just a general comment. I'm not suggesting any changes are made right now.


What I am suggesting, though, is that operator reconciliation loops are written as pure functions - either take some set of cluster state as input, or allow dependency injection to take producers of cluster state as input, and produce state as output. Apply configurations have a wonderful property in that they encode your intent in static data - so, testing your lower-case r reconcile() is straightforward, and no mocks of any persuasion are necessary.

The top-level Reconcile() might look like:

func (r *reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // assemble the data necessary by using listers, etc
    input := r.getInput()
    
    // transform input state into intended state
    intent, err := r.reconcile(input)
    
    // realize intended state onto the cluster
    return r.execute(intent)
}

Where r.execute(intent) takes apply configurations and uses a kubernetes.Clientset to call Apply.


In my experience, relying heavily on mocks leads many teams to stumble into deeply frustrating issues, often many years down the line. Examples I've seen:

  • implementing highly-performant caching using metadata-only watches requires 100% of tests to be rewritten, as the fakes don't know how to propagate data between the metadata fake and the "real" fake
  • simplifying controller code to reduce maintenance burden and eliminate incorrect client-side patching by using server-side apply means all unit tests fail
  • fragile assumptions about the number of times your reconciliation is called and when, in what order, etc, lead integration tests to be a poor substitute for end-to-end testing against actual API servers, leading to test duplication
  • fakes make many highly surprising decisions, taking valuable mental overhead and productivty away from the team, like:
    • faking out field selectors (usually handled in the server) with indices, which requires test code to be written to create indices
    • objects with deletion timestamps but no finalizers cause some internal panic even though it's totally valid for your controller to see one in production
    • fake clients need subresources registered manually

Teams that spend time writing unit tests for unit-level concerns and do not reach to mocks, in my experience, move faster and get a higher RoI on their tests. End-to-end tests and chaos tests validate production modalities that integration tests attempt to, but cannot.

Copy link

@AldoFusterTurpin AldoFusterTurpin Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevekuznetsov I completely agree with that. 🥇

Even though I am by far the one less experienced in this conversation about this domain, I have seen similar problems in other domains that made me resonate a lot with your comment while reading it.

Sometimes we focus too much on trying to reproduce the state of the world where our app will run, and we forget that we should primarily test what our app does, and what our business logic is. It is almost always much simpler and easier to understand to just have our business logic in pure functions that expect input data and produce output data. We then forget about how this data will be gathered and what will be done with the result, and we can simply ensure that domainLogicFn(x) == expectedOutput.

This is a common concern that applies to a lot of different domains, but it is pretty common in k8s because most of the time the business logic gets mixed/loosed in the effort to understand/simulate the "state" where all that business logic will happen (due to the stateful nature of k8s controllers). So I just want to reaffirm that avoiding complex mocks when possible with pure functions is a great idea, and not just in this context 🙂

@hawkowl hawkowl force-pushed the hawkowl/cleanups-cluster-tls branch from 9906f6d to 4126dbc Compare August 14, 2024 03:06
@hawkowl
Copy link
Collaborator Author

hawkowl commented Aug 14, 2024

/azp run ci, e2e

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@github-actions github-actions bot added the needs-rebase branch needs a rebase label Aug 27, 2024
Copy link

Please rebase pull request.

@github-actions github-actions bot removed the needs-rebase branch needs a rebase label Sep 3, 2024
@hawkowl
Copy link
Collaborator Author

hawkowl commented Sep 3, 2024

/azp run ci, e2e

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@hawkowl
Copy link
Collaborator Author

hawkowl commented Sep 3, 2024

/azp run e2e

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@hawkowl
Copy link
Collaborator Author

hawkowl commented Sep 5, 2024

/azp run ci, e2e

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

Copy link
Contributor

@jaitaiwan jaitaiwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jaitaiwan jaitaiwan merged commit 44bc3cc into master Sep 5, 2024
24 checks passed
gouthamMN pushed a commit that referenced this pull request Sep 10, 2024
update makfile to use go 1.23

update docs

bump gotestsum to 1.12.0

bump golangci-lint to 1.59.1

use the fips compliant golang image

generate a secret for the operator from workload identity

Update pkg/operator/deploy/deploy.go

Co-authored-by: Ayato Tokubi <[email protected]>

get subscription info from the subscription doc rather than env

test the operator identity secret generation code properly

Fixed  to correctly reference the local  image, preventing unauthorized Docker Hub pulls.

Align docs hierarchy

Indent bullet points

Copy fluentbit image from arointsvc ACR to your ACR

It is needed since it is compared against a default image (and digest) from const file

ARO-9570: Add a controller to the ARO operator to lay down etchosts machine config

ARO-9570: Update controller to watch MCP and ARO Cluster object

ARO-9750: Add a controller to create the etchosts machineconfigs if they dont exist

Fix linting

Add licenses

bump golangci-lint to v1.60.3 and exclude printf, SA1006 and S1009 from lint

update golangci-lint version

use non fips golang image

use go 1.22

bump go in ci

add git to dockerignore

set buildvcs to false

upgrade to go 1.22.6

update docs

fix go lint

address comments

remove commented code from onebranch pipelines file

change var to const

fix unit-tests and api cloudError

This is the new CI-RP stage for the pipline (#3768)

* This is the new CI-RP stage for the pipline (#3753)

* Ensure Podman Service is Started and PODMAN_REMOTE_ARGS is Configured Automatically

Ensure Podman Service is Started and PODMAN_REMOTE_ARGS is Configured Automatically

Ensure Podman Service is Started and PODMAN_REMOTE_ARGS is Configured Automatically

removed the tag

Add Podman service start and remote args setup for seamless operation

Add sudo to start Podman service for elevated permissions and fix permission errors

Add sudo to start Podman service for elevated permissions and fix permission errors

Refactor Makefile: Update Podman service handling with sudo and remove default PODMAN_REMOTE_ARGS to improve flexibility and ensure proper permissions.

Add sudo to start Podman service for elevated permissions and fix permission errors

* Added Podman service target and set PODMAN_REMOTE_ARGS for seamless builds.

* fix the makefile

* added the port to fix the Makefile

Add smoke test for alerts from Alertmanager (#3801)

Move ARM swagger to subfolder (#3805)

To add new HCP RP, the ARO RP is moved into the subfolder openshiftclusters.

There are no additional changes, no impact on the SDK and clients.

Add the old make runlocal-rp as an alternative to containerization (#3789)

Add smoke test documents (#3813)

Adding Ayato to CODEOWNERS

Fix make ci-clean and runlocal-rp (#3806)

* Fix make ci-clean error for running work containers by buildah that prevents prune from working
* Fix make runlocal-rp image syntax

Upgrade to Podman 5 to fix the vuln

Install required binary for Podman 5 in ci

Switch back to OneBranch build image

Install crun

Install more OCI packages

Change home dir to /tmp for podman

see containers/podman#23818
for more details.

Use sudo for tdnf

bump golangci-lint version in dockerfile ci-rp

add go flags

update go ver in ci.yml

update test

Correct testing/time issues in pkg/deploy (#3808)

- Percolate up the time to wait for LB healthcheck probes, test @ 0 sec
- Correct a context timeout test case, test @ 0 sec timeout

Fix slow tests in /pkg/backend (#3809)

Fix slow tests in /pkg/frontend (#3810)

* Clarifying etcd cert renew test

- Updated the test to make it clear it is passing because timeout is being reached
- Updated the timeout from 10s -> 0s to pass faster

* Fix slow changefeed tests

Generate smaller OIDC keys for unit tests (#3811)

- significantly increases unit test performance by moving from 4096 -> 256 bit keys
- preserves 4096 bit keys for all non-testing scenarios

Make CI-RP Improvements (#3791)

- Remove linting from ci-rp
- Remove generate from ci-rp

Set CGO_ENABLED

update test command in ci-rp dockerfile

Separate Makefile targets for local vs containers (#3816)

- reverts changes to runlocal-rp
- updates old run-portal to runlocal-portal since it uses local bins
- adds new targets for containerized run of RP and Portal; opt-in
- fixes docs and pipelines to use updated targets

Drop some unneccessary dependencies by moving to `bingo` for tooling (#3719)

* Move to using bingo for tools
* go mod vendor

[MIMO] Move cluster certificate functionality to ClientHelper (#3736)

* move over TLS applying, as well as some clienthelper work

bump go in bingo

merge makefile changes from Master

more Makefile updates

add GO var in toplevel Makefile
@hawkowl hawkowl deleted the hawkowl/cleanups-cluster-tls branch October 7, 2024 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants