Skip to content

Kepler Operator Requirement

Sam Yuan edited this page Sep 19, 2022 · 8 revisions

Requirement

The operator should be able to probe the cluster and nodes to ensure Kepler is running on a supported environment and starts up with the right configuration.

After Kepler is up, the operator should integrate with Prometheus and Grafana to create a ServiceMonitor and Grafana dashboard, in accordance with the CRD spec.

Cluster Probe

The Operator will probe the nodes and resolve dependency, install the following pkg if missing (if not possible, avoid using those nodes):

  • Kernel-devel
  • Cgroup

CRD Spec

The CRD specifies the following:

  • Kepler deployment
  • RBAC, deployment configuration (including whether using /proc (for cgroup v1), the model server endpoint, whether use estimator), metrics Service
  • Kepler Integration
  • ServiceMonitor, Grafana instance, datasource, dashboard

minimum scope:

  • just service, deployment etc, for kepler to ensure user able to set up kepler on their own cluster by a sample kubectl apply -f
  • for any port and k8s resource as cluster role permission we'd better defined in manifests.
  • we'd better don't have any permission as cluster role binding at the minimum scope.

The document of minimum scope will guide developer to develop and configuration kepler deployment (created by operator) with any kind of other tools on observability, service mesh, disk, key management and so on.

Extendable: Considering with extendable with other tools, take prometheus operator as sample, we can define some specific fields/properties in CRD for extendable. Any cluster role binding used to integrated with tools out of kepler code scope should be here. (for example cluster role binding for service monitoring)

  • monitoring: prometheus
  • distributed tracing: jaeger/OTEL
  • logging: ELK? optional:
  • cert management operator? for (m)tls?
  • service mesh?

any of able extendable should base on minimum scope, for example port setting. and free for request as github issue for new tools integration.

Clone this wiki locally