Getting visibility into CMMS metrics using Google Cloud Monitoring.
We can use Google Managed Prometheus (GMP) on GKE clusters to get metrics into Google Cloud Monitoring. Using OpenTelemetry we can also collect more detailed metrics from CMMS component pods.
The configuration for the dev
environment can be
used as the basis for deploying CMMS components using Google Kubernetes Engine
(GKE) on another Google Cloud project.
Many operations can be done either via the gcloud CLI or the Google Cloud web console. This guide picks whichever is most convenient for that operation. Feel free to use whichever you prefer.
- OpenTelemetryCollector
default
- OpenTelemetry Instrumentation
open-telemetry-java-agent
- GMP ClusterPodMonitoring
opentelemetry-collector-pod-monitor
- GMP PodMonitoring
collector-pod-monitor
- NetworkPolicy
opentelemetry-collector-network-policy
Deploy a Halo component. See the related guides: Create Kingdom Cluster, Create Duchy Cluster, or Create Reporting Cluster.
This can be done via the Google Cloud Console under "Features", or using the gcloud CLI. For example, assuming a cluster named "kingdom":
gcloud container clusters update kingdom --enable-managed-prometheus
Make sure that the least-privilege service account you created for the cluster has permissions to access the Cloud Monitoring API. See Cluster Configuration.
Deploying to the cluster is generally done by applying a K8s object
configuration file. You can use the dev
configurations as a base to get
started. The dev
configurations are YAML files that are generated from files
written in CUE using Bazel rules.
You can customize the generated object configuration as-needed.
The default dev
configuration for OpenTelemetry collection is in
open_telemetry_gke.cue
, which
depends on open_telemetry.cue
.
The default build target is //src/main/k8s/dev:open_telemetry_gke
.
The dev
configuration is in
prometheus_gke.cue
. The build
target is //src/main/k8s/dev:prometheus_gke
.
You must use a cert-manager,
OpenTelemetry Operator,
and collector image that are compatible with each other. See the
Compatibility matrix
and the collector image specified in
open_telemetry.cue
.
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.11.2/cert-manager.yaml
kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/download/v0.77.0/opentelemetry-operator.yaml
You can just use kubectl apply
, specifying the configuration files you created
in the previous step.
You will need to restart all the Deployments to pick up the Java agent instrumentation.
for deployment in $(kubectl get deployments -o name); do kubectl rollout restart $deployment; done
Visit the
Managed Prometheus
page in Cloud Console. Query up
and scrape_samples_scraped
.
The first one tells you which targets it can find and whether they are up, and the latter is a good way to check that scraping is occurring. If it hasn't been long enough, the latter might show all 0's, but after a couple of minutes you should be seeing results for every target that is up.
The above adds OpenTelemetry JVM and RPC metrics. With the above as a base, it is possible to add other metrics that can be scraped.
See kubelet
- rpc_client_duration_bucket
- rpc_client_duration_count
- rpc_client_duration_sum
- rpc_server_duration_bucket
- rpc_server_duration_count
- rpc_server_duration_sum
- process_runtime_jvm_buffer_count
- process_runtime_jvm_buffer_limit
- process_runtime_jvm_buffer_usage
- process_runtime_jvm_classes_current_loaded
- process_runtime_jvm_classes_loaded
- process_runtime_jvm_classes_unloaded
- process_runtime_jvm_cpu_utilization
- process_runtime_jvm_memory_committed
- process_runtime_jvm_memory_init
- process_runtime_jvm_memory_limit
- process_runtime_jvm_memory_usage
- process_runtime_jvm_system_cpu_load_1m
- process_runtime_jvm_system_cpu_utilization
- process_runtime_jvm_threads_count
- active_non_daemon_thread_count
- jni_wall_clock_duration_millis
- stage_wall_clock_duration_millis
- stage_cpu_time_duration_millis
- initialization_phase_crypto_cpu_time_duration_millis
- setup_phase_crypto_cpu_time_duration_millis
- execution_phase_one_crypto_cpu_time_duration_millis
- execution_phase_two_crypto_cpu_time_duration_millis
- execution_phase_three_crypto_cpu_time_duration_millis