Skip to content

Kepler Operator v1 Working Doc

Sunyanan Choochotkaew edited this page Nov 8, 2022 · 10 revisions

Design and spec discussion Kepler Operator v1

Following the existing discussion here

CR and Controllers:

The proposed CRs

      flowchart TD;
     machine-config
     kepler-system
     kepler-collected-metric
     kepler-exported-power
Loading
  • Instead of using integrated-operator-install to install prometheus and grafana via the operator, it should be left upon the user to set up the monitoring stack.

  • Each components should be represented as a separate CR and managed by a separate controller

kepler-system

apiVersion: sustainability-computing-io/v1aplha1
kind: Kepler
metadata:
 name: kepler-system
 namespace: kepler-system
spec:
 scape-interval:
 daemon:
   exporter:
     image:
     port: (default: 9102)
   estimator-sidecar:
     enabled: (default: false)
     image:
     mnt-path: (default: /tmp)
 model-server:
     enabled: (default: :warning:false)
     storage:
       type: (default: local? , values: local, hostpath, nfs, external (such as via s3))
       path: (default: models)
     sampling-period:

Open Questions

  • kepler-collectd-metric and kepler-exported-power What these components are meant to do ? It seems like they set some configurations. Where does Kepler use these configs?
    • kepler-collectd-metric: list of metrics to collect by collector pkg separated by input source.

    • kepler-exported-power: list of metrics to export to prometheus for each level (node, package, pod)

      Now these configurations are in two locations: exporter.go, config.go However, these sections are supposed to be refactored set as environments via config map.

      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: kepler-cfm
        namespace: monitoring
      data:
        SOURCE.COUNTER: enabled
        SOURCE.CGROUP: enabled
        SOURCE.KUBELET: enabled
        SOURCE.GPU: enabled
        EXPORT_METRICS: cpu_cycles, cached_miss, cpu_time, ...

      Currently, the list of metrics from each source (COUNTER,CGROUP,etc.) is fixed for grouping power models. Shall we change? (low priority)

      Additionally, some configurations are still hard-coded such as PodTotalPowerModelConfig in pod_power.go.

  • Should the Operator for now just expose model weights and and have an option to also enable online training as long as energy metrics are supported (or should the operator just use the model server for exposing the models). If we want to enable online training, how do we intend to store the new models. Should it just be stored as PVs?
Clone this wiki locally