Getting visibility into CMMS metrics using Google Cloud Monitoring.
We can use Google Managed Prometheus (GMP) on GKE clusters to get metrics into Google Cloud Monitoring. Using OpenTelemetry we can also collect more detailed metrics from CMMS component pods.
The configuration for the dev
environment can be
used as the basis for deploying CMMS components using Google Kubernetes Engine
(GKE) on another Google Cloud project.
Many operations can be done either via the gcloud CLI or the Google Cloud web console. This guide picks whichever is most convenient for that operation. Feel free to use whichever you prefer.
- OpenTelemetryCollector
- OpenTelemetry Instrumentation
- GMP ClusterPodMonitoring
- GMP PodMonitoring
- NetworkPolicy
Deploy a Halo component. See the related guides: Create Kingdom Cluster, Create Duchy Cluster, or Create Reporting Cluster.
This can be done via the Google Cloud Console under "Features", or using the gcloud CLI. For example, assuming a cluster named "kingdom":
gcloud container clusters update kingdom --enable-managed-prometheus
Make sure that the least-privilege service account you created for the cluster has permissions to access the Cloud Monitoring API. See Cluster Configuration.
Deploying to the cluster is generally done by applying a K8s object
configuration file. You can use the dev
configurations as a base to get
started. The dev
configurations are YAML files that are generated from files
written in CUE using Bazel rules.
You can customize the generated object configuration as-needed.
The default dev
configuration for OpenTelemetry collection is in
, which
depends on open_telemetry.cue
The default build target is //src/main/k8s/dev:open_telemetry_gke
The dev
configuration is in
. The build
target is //src/main/k8s/dev:prometheus_gke
You must use a cert-manager,
OpenTelemetry Operator,
and collector image that are compatible with each other. See the
Compatibility matrix
and the collector image specified in
kubectl apply -f
kubectl apply -f
You can just use kubectl apply
, specifying the configuration files you created
in the previous step.
You will need to restart all the Deployments to pick up the Java agent instrumentation.
for deployment in $(kubectl get deployments -o name); do kubectl rollout restart $deployment; done
Visit the
Managed Prometheus
page in Cloud Console. Query up
and scrape_samples_scraped
The first one tells you which targets it can find and whether they are up, and the latter is a good way to check that scraping is occurring. If it hasn't been long enough, the latter might show all 0's, but after a couple of minutes you should be seeing results for every target that is up.
The above adds OpenTelemetry JVM and RPC metrics. With the above as a base, it is possible to add other metrics that can be scraped.
See kubelet
- rpc_client_duration_bucket
- rpc_client_duration_count
- rpc_client_duration_sum
- rpc_server_duration_bucket
- rpc_server_duration_count
- rpc_server_duration_sum
- process_runtime_jvm_buffer_count
- process_runtime_jvm_buffer_limit
- process_runtime_jvm_buffer_usage
- process_runtime_jvm_classes_current_loaded
- process_runtime_jvm_classes_loaded
- process_runtime_jvm_classes_unloaded
- process_runtime_jvm_cpu_utilization
- process_runtime_jvm_memory_committed
- process_runtime_jvm_memory_init
- process_runtime_jvm_memory_limit
- process_runtime_jvm_memory_usage
- process_runtime_jvm_system_cpu_load_1m
- process_runtime_jvm_system_cpu_utilization
- process_runtime_jvm_threads_count
- active_non_daemon_thread_count
- jni_wall_clock_duration_millis
- stage_wall_clock_duration_millis
- stage_cpu_time_duration_millis
- initialization_phase_crypto_cpu_time_duration_millis
- setup_phase_crypto_cpu_time_duration_millis
- execution_phase_one_crypto_cpu_time_duration_millis
- execution_phase_two_crypto_cpu_time_duration_millis
- execution_phase_three_crypto_cpu_time_duration_millis