Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic resource pruning EP #109

Merged
merged 8 commits into from
May 3, 2022
Merged

Automatic resource pruning EP #109

merged 8 commits into from
May 3, 2022

Conversation

ryantking
Copy link
Contributor

Signed-off-by: Ryan King [email protected]

Signed-off-by: Ryan King <[email protected]>
Copy link
Member

@jmrodri jmrodri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass review. A few typos. I will do a more thorough review next week.

enhancements/automatic-resource-pruning.md Outdated Show resolved Hide resolved
enhancements/automatic-resource-pruning.md Show resolved Hide resolved
enhancements/automatic-resource-pruning.md Outdated Show resolved Hide resolved
enhancements/automatic-resource-pruning.md Outdated Show resolved Hide resolved
enhancements/automatic-resource-pruning.md Outdated Show resolved Hide resolved
Copy link
Contributor

@fgiloux fgiloux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good work. Some input from my side

enhancements/automatic-resource-pruning.md Outdated Show resolved Hide resolved
enhancements/automatic-resource-pruning.md Outdated Show resolved Hide resolved
enhancements/automatic-resource-pruning.md Outdated Show resolved Hide resolved
enhancements/automatic-resource-pruning.md Show resolved Hide resolved
enhancements/automatic-resource-pruning.md Outdated Show resolved Hide resolved
enhancements/automatic-resource-pruning.md Show resolved Hide resolved

#### Story 4

As an operator author, I want to prune a custom resources with specific status information when there is a certain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • "a custom resource"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that we want to limit it to "status" informtaion. Fields in spec may in addition also get considered, a class for VolumeSnapshotContent for instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not take status to mean Kubernetes status field. But a more generic status as in a specific condition or state. @ryantking thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not take status to mean Kubernetes status field. But a more generic status as in a specific condition or state. @ryantking thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in practice it will be contained in the status field, but the best practice for where to store operator-maintained data is out of scope for this EP so yes this can refer to any generic state information stored on the resource.

enhancements/automatic-resource-pruning.md Show resolved Hide resolved
type IsPruneableFunc func(obj *runtime.Object) error

// RegisterIsPruneableFunc registers a function to check whether it is safe to prune a resources of a certain type.
func RegisterIsPrunableFunc(gvk schema.GroupVersionKind, isPruneable IsPruneableFunc) { /* ... */ }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why IsPruneableFunc get registered and StrategyFunc not

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think because IsPrunableFunc has intimate knowledge of the GVK. While the Strategy is agnostic

Copy link
Member

@jmrodri jmrodri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished my review. A few suggested rephrasings. Nicely done.

- Swap word "ephemeral" out for "unbounded"
- Add alternative Go API proposal.

Signed-off-by: Ryan King <[email protected]>
Signed-off-by: Ryan King <[email protected]>
Signed-off-by: Ryan King <[email protected]>
Signed-off-by: Ryan King <[email protected]>
@jmrodri
Copy link
Member

jmrodri commented Feb 2, 2022

@ryantking the API proposed in Appendix A was easier for me to rationalize and understand. I was able to easily picture a workflow for that API. The API proposed in Appendix B was definitely more flexible but the workflow felt more awkward to me. Like it was more complicated but for little gain in the short term.

Basically I could not think of a scenario that was not doable with the API from Appendix A.

@fgiloux
Copy link
Contributor

fgiloux commented Feb 2, 2022

@ryantking
One thing that I am missing is that operators are often used in multi-tenant environments. Each tenant may want to have, if not a separate retention policy, at least the possibility to parameterize the retention policy.
Let take a CI/CD pipeline operator as an example. You may have a Pipeline custom resource, which creates pods for the processing. In case of nightly runs these pods (with the logs) may be kept for 3 days. In case of the execution of a Pipeline for a final release you may want to keep them for longer. Another scenario is that teams may want to have different retention periods for different projects.
With the approach in appendix A my understanding is that the IsPruneableFunc would need to be re-registered each time a new Pipeline is created possibly using a closure holding the retention period for each Pipeline. Is that the way you see it?


### Implementation-specific

- What type of Kubernetes object should we generically work with? E.g. `metav1.Object`or `runtime.Object`?
Copy link
Member

@joelanford joelanford Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My vote would be client.Object from controller-runtime, which is:

type Object interface {
	metav1.Object
	runtime.Object
}

That interface supports all well-behaved root API types and users get compile-time checks rather than having to resort to runtime type assertions when they need to access the other half of the object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joelanford will this work with an operator that was written with library-go?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know for sure, but I would think so. You can pass a client.Object to any function that accepts a metav1.Object or a runtime.Object, and all API root types (e.g. custom objects of CRDs and all built-in API types that are apply-able on-cluster) implement both interfaces.

Comment on lines +155 to +157
An alternative approach would be adding this logic to the core SDK and scaffolding it optionally during operation
generation. The primary drawbacks with this approach are the increased complexity to the implementation and adding it to
existing operators.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A second alternative would be to implement a brand new operator that exposes a new set of prune APIs, e.g.:

  • PruneStrategy
  • ClusterPruneStrategy

Where the spec could look something like:

spec:
  objects:
    - group: example.com
      version: v1
      kind: MyKind
      matchers:
        - selector:
            matchLabels:
              example.com/is-completed: true
        - maxAge: 10h
  default:
    matchers:
      - selector:
          matchLabels:
            example.com/is-completed: true
      - maxCount: 50
  

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joelanford would this operator be deployed with each operator? I can't see how a new operator would help individual operators cleanup their resources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would likely be deployed once per cluster, and then two scenarios would be in play:

  1. each operator could self-configure pruning by laying down one of the prune strategy objects, either as part of the operator deployment itself or associated with an operand
  2. a cluster admin could provide their own prune strategy objects to cover their specific needs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature already exists for jobs and TTL. Here is the KEP, which foresees that it can be extended to pods.

IMHO a drawback of a generic controller is that it is either limited to the greatest common factor, which basically means the resource metadata or you need to implement special logic for each resource.
A good example of that it the quota controller. Generic is quota on count of instances. And then you have specialized components for pods, services and PVCs with specific logic for resource requests and limits.

Having the specialized logic together with the controller that created the resources may offer more flexibility to the operator author.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fgiloux Yep, all good points. I'm not necessarily suggesting we should actually pursue the alternative, but it may be helpful to consider it in whatever design we come up with.

The main motivation for something like a separate controller is to make it extremely easy for an operator author to say, "here, prune this stuff in this way" without having to worry about the details and mechanics of how all that happens.

We could implement this as a library that operator authors plug into their controllers as well. For example, perhaps they would just add something in main.go like this:

maxAge := time.Minute * 60
if err := prune.Prune(ctx, mgr, prune.Config{
	Selectors: prune.Selectors{
		&myapi.OtherKind{}: prune.MaxAge(maxAge),
	},
    PruneInterval: 5 * time.Minute,
}); err != nll {
	setupLog.Error(err, "pruner failed")
	os.Exit(1)
}

where the prune package has something like:

package prune

type Selectors map[client.Object]Selector

type Selector interface {
	Select(ctx context.Context, in []client.Object) ([]client.Object, error)
}

type SelectorFunc func(ctx context.Context, in []client.Object) ([]client.Object, error)
func (f SelectorFunc) Select(ctx context.Context, in []client.Object) ([]client.Object, error) {
	return f(ctx, in)
}

func MaxAge(maxAge time.Duration) SelectorFunc {
	return SelectorFunc(func(ctx context.Context, in []client.Object) ([]client.Object, error) {
		var out []client.Object
		for _, obj := range in {
			if in.GetCreationTimestamp().Before(time.Now().Add(-maxAge)) {
				out = append(out, obj)	
			}
		}
		return out
	})
}

// IsPruneableFunc is a function that checks a the data of an object to see whether or not it is safe to prune it.
// It should return `nil` if it is safe to prune, `ErrUnpruneable` if it is unsafe, or another error.
// It should safely assert the object is the expected type, otherwise it might panic.
type IsPruneableFunc func(obj *runtime.Object) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to be able to pass a logger. Either explicitly (my preference) or through the context as controller-runtime does. A logger should be passed to StrategyFunc and Prune which have context as parameter, in the same way.

@fgiloux
Copy link
Contributor

fgiloux commented Apr 5, 2022

@ryantking I have had a look at the latest version of the enhancement proposal and it looks good to me. The only things that I have slight concerns with are:

  • the tenancy aspect as mentioned previously. I don't see how different strategies can be configured by tenant for the same resource type.
  • getting a logger externally configured and passed to StrategyFunc and IsPrunableFunc

@ryantking
Copy link
Contributor Author

@fgiloux As far as the tenancy aspect goes, in the example of nightly pipelines vs release pipelines, I think that you could solve it with an IsPruneable func?

func pipelineIsPruneable(obj client.Object) error {
    pipeline, ok := obj.(*pipelinesv1.Pipeline)
    // check assertion
    if isRelease(pipeline) and age(pipeline) < 3 * time.Week {
        return ErrUnprunable
    }
    // additional logic
    return nil
}

Would that pseudo-code work in theory? Another option for multi-tenancy is to create multiple Pruner objects, maybe one per project? I'll integrate the registry type from appendix B into A so each pruner can have its own set of IsPruneable funcs if that is desirable. I'm imagining a pattern similar to http.ServeMux where new http.Server objects start with http.DefaultServeMux, but the user can swap that out to a custom one.

Re: logging, I'll make sure the proposal incorporates the logging changes you made to the current implementation.

@fgiloux
Copy link
Contributor

fgiloux commented Apr 6, 2022

@ryantking I think it is important to consider personas:

  • the operator author designs capabilities and leverages this library for pruning. It does not decide on whether the user is in a multi-tenant environment and how each namespace or custom resource is configured. There is no upfront knowledge of something like isRelease: in cluster A they may decide to have only CI pipelines and in cluster B only release pipelines. In a smaller company they may only have a single cluster and dev, CI, tests, release pipelines all on it. They may also want to have different options for product 1 and product 2.
  • All the scenarios described above need to be configurable by a user/cluster or namespace administrator.
    The question is how do we design this library to make it easy for the operator author to have different parameters (retention time or number of instances) taken in consideration during pruning. These parameters are only known at runtime so that something statically coded like "isRelease" is not an option. The approach of creating multiple Pruners, one by namespace could work but I am concerned by the cost when there are 10,000 namespaces. If the retention parameters is directly configured in the resource or retrievable for instance through an ownerRef the logic could be embedded into pipelineIsPruneable something like here.


// Pruner is an object that runs a prune job.
type Pruner struct {
// ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having had a closer look at the EP I would very much like to see what you intend to have in type Pruner struct to understand what can be configurable through type PrunerOption func(p *Pruner)

@ryantking
Copy link
Contributor Author

@fgiloux I'm still struggling to understand which specific design choices in the proposed API limit the use cases you are laying out. If you are giving me use cases that you want to make sure are covered, then can you present them as user stories so I can add them to the EP? If there are use cases that you think are not possible with the proposed API then I think we should talk offline in greater detail to figure out what changes I need to make. The way I look at it, there are two real ways the user implements prune logic:

  1. The IsPruneable function registry. This allows the operator author to define per-GVK when a resource is eligible for pruning. This function can know about namespaces or read configuration from a different resource or do basically anything that can be determined from the object itself, the type of the object, and whatever information it can retrieve from external sources including the cluster itself.
  2. The StrategyFunc: This allows the operator author to define how many pruneable objects should be removed, and how to select the specific objects from the set.

A concrete example would be an IsPruneable function that determines if a pipeline object is a release or nightly job then a strategy that keeps the latest 32 nightly runs. You could even create two pruners, one for releases that keeps only the supported versions' pipelines and one for nightly that keeps the last 30 days worth.

Let me know your thoughts or feel free to reach out if you want to talk in more detail.

@fgiloux
Copy link
Contributor

fgiloux commented May 3, 2022

@ryantking I am fine with the proposal. Mind that I have little karma on this repository as I am not part of the Operator Framework organisation.

@ryantking ryantking merged commit fe91f5e into operator-framework:master May 3, 2022
@ryantking ryantking deleted the autopruning branch May 3, 2022 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants