Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: Add additional operational excellence metrics #156

Merged
merged 8 commits into from
Aug 6, 2024

Conversation

chandankumar4
Copy link
Contributor

@chandankumar4 chandankumar4 commented Jul 30, 2024

Fixes #97

Modifications

Add additional operational excellence metrics

metrics data

# HELP isb_services_synced_total The total number of ISB service synced
# TYPE isb_services_synced_total counter
isb_services_synced_total{intuit_alert="true"} 1
# HELP leader_election_master_status Gauge of if the reporting system is master of the relevant lease, 0 indicates backup, 1 indicates master. 'name' is the string used to identify the lease. Please make sure to group by name.
# TYPE leader_election_master_status gauge
leader_election_master_status{name="numaplane-controller-lock"} 1
# HELP numaflow_controller_kube_request_total The total number of kubernetes request for numaflow controller
# TYPE numaflow_controller_kube_request_total counter
numaflow_controller_kube_request_total{intuit_alert="true"} 103
# HELP numaflow_controller_kubectl_execution_total The total number of kubectl execution for numaflow controller
# TYPE numaflow_controller_kubectl_execution_total counter
numaflow_controller_kubectl_execution_total{intuit_alert="true"} 16
# HELP numaflow_controller_running Number of Numaflow controller running
# TYPE numaflow_controller_running gauge
numaflow_controller_running{intuit_alert="true",version="0.0.2"} 1
# HELP numaflow_controller_synced_total The total number of Numaflow controller synced
# TYPE numaflow_controller_synced_total counter
numaflow_controller_synced_total{intuit_alert="true"} 1
# HELP numaflow_isb_services_running Number of Numaflow ISB Service running
# TYPE numaflow_isb_services_running gauge
numaflow_isb_services_running{intuit_alert="true"} 1
# HELP numaflow_kube_resource_cache Number of kubernetes resource object in cache
# TYPE numaflow_kube_resource_cache gauge
numaflow_kube_resource_cache{K8SVersion="1.26",intuit_alert="true"} 94
# HELP numaflow_kube_resource_monitored Number of monitored kubernetes resource object in cache
# TYPE numaflow_kube_resource_monitored gauge
numaflow_kube_resource_monitored{intuit_alert="true"} 0
# HELP numaflow_pipelines_running Number of Numaflow pipelines running
# TYPE numaflow_pipelines_running gauge
numaflow_pipelines_running{intuit_alert="true"} 1
# HELP numaplane_reconciliation_duration_seconds Duration of pipeline reconciliation
# TYPE numaplane_reconciliation_duration_seconds histogram
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="create",type="numaflow-controller-rollout-controller",le="0.005"} 0
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="create",type="numaflow-controller-rollout-controller",le="0.01"} 0
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="create",type="numaflow-controller-rollout-controller",le="0.025"} 0
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="create",type="numaflow-controller-rollout-controller",le="0.05"} 0
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="create",type="numaflow-controller-rollout-controller",le="0.1"} 0
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="create",type="numaflow-controller-rollout-controller",le="0.25"} 0
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="create",type="numaflow-controller-rollout-controller",le="0.5"} 0
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="create",type="numaflow-controller-rollout-controller",le="1"} 0
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="create",type="numaflow-controller-rollout-controller",le="2.5"} 1
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="create",type="numaflow-controller-rollout-controller",le="5"} 1
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="create",type="numaflow-controller-rollout-controller",le="10"} 1
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="create",type="numaflow-controller-rollout-controller",le="+Inf"} 1
numaplane_reconciliation_duration_seconds_sum{intuit_alert="true",phase="create",type="numaflow-controller-rollout-controller"} 1.832084167
numaplane_reconciliation_duration_seconds_count{intuit_alert="true",phase="create",type="numaflow-controller-rollout-controller"} 1
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="create",type="pipeline-rollout-controller",le="0.005"} 0
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="create",type="pipeline-rollout-controller",le="0.01"} 0
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="create",type="pipeline-rollout-controller",le="0.025"} 1
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="create",type="pipeline-rollout-controller",le="0.05"} 1
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="create",type="pipeline-rollout-controller",le="0.1"} 1
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="create",type="pipeline-rollout-controller",le="0.25"} 1
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="create",type="pipeline-rollout-controller",le="0.5"} 1
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="create",type="pipeline-rollout-controller",le="1"} 1
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="create",type="pipeline-rollout-controller",le="2.5"} 1
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="create",type="pipeline-rollout-controller",le="5"} 1
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="create",type="pipeline-rollout-controller",le="10"} 1
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="create",type="pipeline-rollout-controller",le="+Inf"} 1
numaplane_reconciliation_duration_seconds_sum{intuit_alert="true",phase="create",type="pipeline-rollout-controller"} 0.023695542
numaplane_reconciliation_duration_seconds_count{intuit_alert="true",phase="create",type="pipeline-rollout-controller"} 1
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="update",type="pipeline-rollout-controller",le="0.005"} 0
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="update",type="pipeline-rollout-controller",le="0.01"} 0
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="update",type="pipeline-rollout-controller",le="0.025"} 0
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="update",type="pipeline-rollout-controller",le="0.05"} 1
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="update",type="pipeline-rollout-controller",le="0.1"} 1
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="update",type="pipeline-rollout-controller",le="0.25"} 1
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="update",type="pipeline-rollout-controller",le="0.5"} 1
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="update",type="pipeline-rollout-controller",le="1"} 1
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="update",type="pipeline-rollout-controller",le="2.5"} 1
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="update",type="pipeline-rollout-controller",le="5"} 1
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="update",type="pipeline-rollout-controller",le="10"} 1
numaplane_reconciliation_duration_seconds_bucket{intuit_alert="true",phase="update",type="pipeline-rollout-controller",le="+Inf"} 1
numaplane_reconciliation_duration_seconds_sum{intuit_alert="true",phase="update",type="pipeline-rollout-controller"} 0.04209125
numaplane_reconciliation_duration_seconds_count{intuit_alert="true",phase="update",type="pipeline-rollout-controller"} 1
# HELP pipeline_rollout_queue_length The length of pipeline rollout queue
# TYPE pipeline_rollout_queue_length gauge
pipeline_rollout_queue_length{intuit_alert="true"} 0
# HELP pipeline_synced_total The total number of pipeline synced
# TYPE pipeline_synced_total counter
pipeline_synced_total{intuit_alert="true"} 4
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter

Verification

@chandankumar4 chandankumar4 force-pushed the additional-metrics branch 3 times, most recently from 9dd9b29 to 2d09a1f Compare July 31, 2024 12:39
Signed-off-by: chandankumar4 <[email protected]>
@xdevxy
Copy link
Contributor

xdevxy commented Aug 2, 2024

Fixes #97

Modifications

Add additional operational excellence metrics

metrics data

# HELP numaflow_controller_running Number of Numaflow controller running
# TYPE numaflow_controller_running gauge
numaflow_controller_running{name="numaflow-controller",namespace="example-namespace",version="1.2.1"} 1
# HELP numaflow_isb_service_running Number of Numaflow ISB Service running
# TYPE numaflow_isb_service_running gauge
numaflow_isb_service_running 1
# HELP numaflow_kube_request_total The total number of kubernetes request for numaflow controller
# TYPE numaflow_kube_request_total counter
numaflow_kube_request_total 235
# HELP numaflow_kube_resource_cache Number of kubernetes resource object in cache
# TYPE numaflow_kube_resource_cache gauge
numaflow_kube_resource_cache{K8SVersion="1.26"} 95
# HELP numaflow_kube_resource_monitored Number of monitored kubernetes resource object in cache
# TYPE numaflow_kube_resource_monitored gauge
numaflow_kube_resource_monitored 0
# HELP numaflow_kubectl_execution_total The total number of kubectl execution for numaflow controller
# TYPE numaflow_kubectl_execution_total counter
numaflow_kubectl_execution_total 32
# HELP numaflow_pipelines_running Number of Numaflow pipelines running
# TYPE numaflow_pipelines_running gauge
numaflow_pipelines_running 1
# HELP numaplane_reconciliation_duration_seconds Duration of pipeline reconciliation
# TYPE numaplane_reconciliation_duration_seconds histogram
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="isbsvc-rollout-controller",le="0.005"} 0
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="isbsvc-rollout-controller",le="0.01"} 11
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="isbsvc-rollout-controller",le="0.025"} 13
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="isbsvc-rollout-controller",le="0.05"} 13
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="isbsvc-rollout-controller",le="0.1"} 13
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="isbsvc-rollout-controller",le="0.25"} 14
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="isbsvc-rollout-controller",le="0.5"} 14
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="isbsvc-rollout-controller",le="1"} 14
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="isbsvc-rollout-controller",le="2.5"} 14
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="isbsvc-rollout-controller",le="5"} 14
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="isbsvc-rollout-controller",le="10"} 14
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="isbsvc-rollout-controller",le="+Inf"} 14
numaplane_reconciliation_duration_seconds_sum{phase="create",type="isbsvc-rollout-controller"} 0.22133087599999995
numaplane_reconciliation_duration_seconds_count{phase="create",type="isbsvc-rollout-controller"} 14
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="numaflow-controller-rollout-controller",le="0.005"} 0
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="numaflow-controller-rollout-controller",le="0.01"} 0
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="numaflow-controller-rollout-controller",le="0.025"} 0
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="numaflow-controller-rollout-controller",le="0.05"} 0
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="numaflow-controller-rollout-controller",le="0.1"} 0
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="numaflow-controller-rollout-controller",le="0.25"} 1
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="numaflow-controller-rollout-controller",le="0.5"} 2
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="numaflow-controller-rollout-controller",le="1"} 2
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="numaflow-controller-rollout-controller",le="2.5"} 2
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="numaflow-controller-rollout-controller",le="5"} 2
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="numaflow-controller-rollout-controller",le="10"} 2
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="numaflow-controller-rollout-controller",le="+Inf"} 2
numaplane_reconciliation_duration_seconds_sum{phase="create",type="numaflow-controller-rollout-controller"} 0.652376542
numaplane_reconciliation_duration_seconds_count{phase="create",type="numaflow-controller-rollout-controller"} 2
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="pipeline-rollout-controller",le="0.005"} 1
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="pipeline-rollout-controller",le="0.01"} 8
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="pipeline-rollout-controller",le="0.025"} 13
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="pipeline-rollout-controller",le="0.05"} 13
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="pipeline-rollout-controller",le="0.1"} 13
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="pipeline-rollout-controller",le="0.25"} 13
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="pipeline-rollout-controller",le="0.5"} 13
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="pipeline-rollout-controller",le="1"} 13
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="pipeline-rollout-controller",le="2.5"} 13
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="pipeline-rollout-controller",le="5"} 13
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="pipeline-rollout-controller",le="10"} 13
numaplane_reconciliation_duration_seconds_bucket{phase="create",type="pipeline-rollout-controller",le="+Inf"} 13
numaplane_reconciliation_duration_seconds_sum{phase="create",type="pipeline-rollout-controller"} 0.11895362499999998
numaplane_reconciliation_duration_seconds_count{phase="create",type="pipeline-rollout-controller"} 13
# HELP pipeline_rollout_queue_length The length of pipeline rollout queue
# TYPE pipeline_rollout_queue_length gauge
pipeline_rollout_queue_length 1
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 3.6
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 11
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 1.05197568e+08
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.7224031732e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 5.611327488e+09
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes 1.8446744073709552e+19
# HELP rest_client_requests_total Number of HTTP requests, partitioned by status code, method, and host.
# TYPE rest_client_requests_total counter
rest_client_requests_total{code="200",host="10.96.0.1:443",method="GET"} 1147
rest_client_requests_total{code="200",host="10.96.0.1:443",method="PATCH"} 12
rest_client_requests_total{code="200",host="10.96.0.1:443",method="PUT"} 869
rest_client_requests_total{code="201",host="10.96.0.1:443",method="POST"} 1

Verification

why this doesn't include the total number of ISB service synced, etc?

internal/util/metrics/metrics.go Outdated Show resolved Hide resolved
internal/util/metrics/metrics.go Outdated Show resolved Hide resolved
@xdevxy xdevxy marked this pull request as draft August 2, 2024 18:36
@@ -84,19 +85,24 @@ func NewNumaflowControllerRolloutReconciler(
kubectl kubeUtil.Kubectl,
customMetrics *metrics.CustomMetrics,
) (*NumaflowControllerRolloutReconciler, error) {
stateCache := sync.NewLiveStateCache(rawConfig)
stateCache := sync.NewLiveStateCache(rawConfig, customMetrics)
newRawConfig := metrics.AddMetricsTransportWrapper(customMetrics, rawConfig)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we call this function here as opposed to from main.go?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now I see in the issue it says we're counting these only for NumaflowControllerRollout, but i'm not sure why

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numaflow Controller is through gitops sync

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that we need to hold up this PR for this, but why do we only want to count "Number of kubernetes requests executed during reconciliation" for Numaflow Controller and not for ISBService and Pipeline?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think we would want all, @chandankumar4 can you create an follow up issue to track this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have covered it in this PR

@juliev0
Copy link
Collaborator

juliev0 commented Aug 2, 2024

for any metrics that are particular to Numaflow Controller can we have "numaflow_controller" or "numaflow_ctlr" in the name?

@juliev0
Copy link
Collaborator

juliev0 commented Aug 2, 2024

Why is numaflow_controller_running parameterized by namespace but numaflow_pipelines_running and numaflow_isb_service_running aren't? (this question is based on the PR description console output)

@chandankumar4
Copy link
Contributor Author

chandankumar4 commented Aug 5, 2024

why this doesn't include the total number of ISB service synced, etc?

Fixed, It was missed register that variable. @xdevxy

@chandankumar4
Copy link
Contributor Author

Why is numaflow_controller_running parameterized by namespace but numaflow_pipelines_running and numaflow_isb_service_running aren't? (this question is based on the PR description console output)

numaflow_isb_service_running and numaflow_pipelines_running is also parameterized with namespace but it's not added in label also for numaflow_controller_running I have removed the label namespace but left to update PR description. I'll update it. Thanks

Signed-off-by: chandankumar4 <[email protected]>
@xdevxy
Copy link
Contributor

xdevxy commented Aug 5, 2024

Why is numaflow_controller_running parameterized by namespace but numaflow_pipelines_running and numaflow_isb_service_running aren't? (this question is based on the PR description console output)

I think we initially thought numaflow_controller_running should parameterize by namespace because per namespace only one controller, but now this may no longer hold true with no downtime upgrade. @chandankumar4 Do you mind having an issue to track this, you can use one issue to track all the left tasks in this PR. Thanks!

@@ -23,6 +23,7 @@ require (
)

require (
github.com/argoproj/pkg v0.13.7-0.20230626144333-d56162821bd1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this dependency?

Copy link
Contributor Author

@chandankumar4 chandankumar4 Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using it here if we don't wanted to use then need to copy that fn here with their dependencies.

Copy link
Contributor

@xdevxy xdevxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do the open comments in follow ups.

return ctrl.Result{}, statusUpdateErr
}

r.customMetrics.ISBServicesSyncFailed.WithLabelValues().Inc()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this could be done just once above if statusUpdateErr != nil

Copy link
Contributor Author

@chandankumar4 chandankumar4 Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For having it only once, how It will cover if statusUpdateErr!=nil and err != nil, it should be called twice?

@@ -317,6 +327,7 @@ func (r *ISBServiceRolloutReconciler) processExistingISBService(ctx context.Cont
}

isbServiceRollout.Status.MarkDeployed(isbServiceRollout.Generation)
r.customMetrics.ReconciliationDuration.WithLabelValues(ControllerISBSVCRollout, "update").Observe(time.Since(syncStartTime).Seconds())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, we don't know that this is an "update" vs a "no-op", do we?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly here, if isbServiceNeedsUpdating {...}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: sorry, but what do you think about enclosing it in the else block that's directly above it? I think then it's very clear that this statement goes with the dataLossPrevention=false case only and there will be less chance of messing it up later. Otherwise, I had to look closely at the dataLossPrevention=true code above to confirm that it was in fact returning and would't hit this.

@chandankumar4
Copy link
Contributor Author

chandankumar4 commented Aug 6, 2024

Why is numaflow_controller_running parameterized by namespace but numaflow_pipelines_running and numaflow_isb_service_running aren't? (this question is based on the PR description console output)

I think we initially thought numaflow_controller_running should parameterize by namespace because per namespace only one controller, but now this may no longer hold true with no downtime upgrade. @chandankumar4 Do you mind having an issue to track this, you can use one issue to track all the left tasks in this PR. Thanks!

btw I have already remove the namespace label from metrics but do we also wanted to revert this change and allow multiple numaplane-controller in one namespace?

@juliev0
Copy link
Collaborator

juliev0 commented Aug 6, 2024

I'm kind of thinking about in a future PR, I wonder if we should consider avoiding doing metrics in the inner code and trying to as much as possible do them in one place, like in reconcile() or Reconcile(). I'm worried about future code in which we accidentally add some new code and miss adding in the metric there. Like in the case of the syncTime metrics, where we need to know if we created, updated, or did nothing, what if we returned a value from the inner functions to indicate the operation that was performed? Would it make the code too ugly? It does seem like less opportunity for error at least.

Signed-off-by: chandankumar4 <[email protected]>
@juliev0 juliev0 marked this pull request as ready for review August 6, 2024 16:06
@juliev0 juliev0 merged commit f807c64 into numaproj:main Aug 6, 2024
5 checks passed
@chandankumar4 chandankumar4 deleted the additional-metrics branch August 6, 2024 17:20
@xdevxy
Copy link
Contributor

xdevxy commented Aug 6, 2024

Why is numaflow_controller_running parameterized by namespace but numaflow_pipelines_running and numaflow_isb_service_running aren't? (this question is based on the PR description console output)

I think we initially thought numaflow_controller_running should parameterize by namespace because per namespace only one controller, but now this may no longer hold true with no downtime upgrade. @chandankumar4 Do you mind having an issue to track this, you can use one issue to track all the left tasks in this PR. Thanks!

btw I have already remove the namespace label from metrics but do we also wanted to revert this change and allow multiple numaplane-controller in one namespace?

Right, we may need to revert that change in future, but if you remove the namespace label in this PR then no need to open an issue for it. The revert will be tracked by no downtime upgrade feature, Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add additional operational excellence metrics in NumaRollout
3 participants