feat: Definition of monitoring indicators #104

kaori-seasons · 2023-05-18T01:29:12Z

Background&Motivation

Due to the summary of the previous meetings, we have obtained the back pressure mechanism based on the PID algorithm in spark, and passed the test to deliver events through eventbus, so we need to define monitoring indicators and collect indicators based on opentelemry

What needs to be done

In the current startup process of rocketmq-eventbridge, when we start to set up a synchronization task and fill in the corresponding source..transform.sink, the task is always isolated with the identifier of runnerName, which means that we need to count the indicator dimensions Should use runnerName as label.

After starting the program, in the starting thread, there will be three threads executing in sequence (EventSubcuriber, EventBusListener, EventRuleTransfer, EventTargetPusher). If the user posts an event, I will explain the indicators we care about in the three threads one by one:

In the EventBusListener thread, because it is located at the most upstream of pulling messages, and according to the PID algorithm, it is necessary to calculate the oscillation formula for the downstream feedback indicators to control the backpressure effect under large traffic, so two key indicators need to be set:

1. Calculate the number of consumed messages/s based on the number of elements in the blocking queue and the delay of each call to the poll method (time sampling from the start of EventSubcuriber)
1. Take the runnerName as the unit, record the number of elements pushed to the downstream blocking queue, and the production time (calculate the sampling time from the start of the EventSubcuriber) to calculate the number of production messages/s
1. The traffic fluctuation when the synchronization task event corresponding to each runnerName fails to retry (if the delay is too high, the number of retries needs to be adjusted)

In the EventRuleTransfer thread, due to subscribing to related event rules and needing to maintain the TransForm operation in each of the above task processing links, the user needs to know whether the operation is successful or failed, and the following indicators are defined:

1. Calculate the delay of each transform operation when calling asynchronously
1. Consume TPS from the upstream blocking queue, that is, the number of consumed messages/s
1. Percentile delay when retrying in units of runnerName
1. The TPS delivered to the downstream blocking queue, that is, the number of production messages/s
Since subscription rules are CPU-intensive operations, we should pay attention to the number of asynchronous calling threads in progress, so as not to be blocked during the asynchronous to synchronous transition due to a large number of asynchronous threads for a long time

For eventTarget Pusher, it is located in the last processing link of delivery to the target end. The system divides its positioning into two parts:

Delivery to the downstream target event source
When the downstream consumption capacity is smaller than the upstream production capacity, it is necessary to enable the upstream to perceive the production speed of the current downstream delivery event to the target end so as to control the feedback rate

Define the following indicators:

1. The delay time delivered to the downstream target event source (the sampling time started from EventRuleTransfer), the current accumulated total number of delivered events, and the instantaneous production speed
1. Take runnerName as the unit, count the number of threads currently running in each runnerName
1. The delay time from the upstream EventRuleTransfer consumption event (the sampling time started from EventRuleTransfer), the current accumulated total number of delivered events, and the instantaneous consumption speed

Why do this

We need to provide eventbridge with observability capabilities to monitor the statistics of the event delivery process and adjust parameters. At the same time, it is also preparing for future access to crd resources of kubernetes

What benefits are bred

It is easy to troubleshoot problems, optimize parameters, and adjust retry strategies

How to achieve it(Alpha)

The label of Metrics：

label	labelName
account_id	account ID
runnerName	resource name
source	Source type
target	Target type
status	0: failed，1: succeed

Metrics details:

Metrics Type	Metrics Name	Unit	Desc	Label
counter	eventbridge_eventbus_in_events_total	count	the event count of putted to event bus.	account_id,runnerName,status
gauge	eventbridge_eventrule_latency_seconds	second	the latency of event rule subscription	account_id,runnerName
counter	eventbridge_eventrule_filter_events_total	count	the event count of event rule filted	account_id,runnerName,status
histogram	eventbridge_eventrule_trigger_latency	millisecond	trigger target latency: le_100_ms,le_300_ms,le_500_ms,le_1_s,le_3_s,le_5_s,le_overflow	account_id,runnerName,status

continue..

kaori-seasons · 2023-05-18T01:47:30Z

At present, due to work reasons, the proposal has not been completed, and I hope to give suggestions on this Friday

@2011shenlin @Jashinck

2011shenlin · 2023-05-18T06:48:33Z

can refer to：

The label of Metrics：

account_id	account ID
name	resource name
source	Source type
target	Target type
status	0: failed，1: succeed

Metrics details:

Metrics Type	Metrics Name	Unit	Desc	Label
counter	eventbridge_eventbus_in_events_total	count	the event count of putted to event bus.	account_id,name,status
gauge	eventbridge_eventrule_latency_seconds	second	the latency of event rule subscription	account_id,name
counter	eventbridge_eventrule_filter_events_total	count	the event count of event rule filted	account_id,name,status
histogram	eventbridge_eventrule_trigger_latency	millisecond	trigger target latency: le_100_ms,le_300_ms,le_500_ms,le_1_s,le_3_s,le_5_s,le_overflow	account_id,name,status

Jashinck · 2023-05-18T09:21:39Z

MetricesName, eventbridge_eventrule_trigger_latency -> eventbridge_eventtarget_trigger_latency

kaori-seasons · 2023-05-19T02:06:22Z

can refer to：

The label of Metrics：

account_id account ID
name resource name
source Source type
target Target type
status 0: failed，1: succeed
Metrics details:

Metrics Type Metrics Name Unit Desc Label
counter eventbridge_eventbus_in_events_total count the event count of putted to event bus. account_id,name,status
gauge eventbridge_eventrule_latency_seconds second the latency of event rule subscription account_id,name
counter eventbridge_eventrule_filter_events_total count the event count of event rule filted account_id,name,status
histogram eventbridge_eventrule_trigger_latency millisecond trigger target latency: le_100_ms,le_300_ms,le_500_ms,le_1_s,le_3_s,le_5_s,le_overflow account_id,name,status

Thank you very much for your reply! I will try my best to complete the design of the proposal and the related indicator definition form this week

kaori-seasons mentioned this issue May 18, 2023

feat: support opentelemry collector and Define indicators #101

Closed

3 tasks

kaori-seasons mentioned this issue Jul 5, 2023

feat: support opentelemry collector and Define indicators #132

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Definition of monitoring indicators #104

feat: Definition of monitoring indicators #104

kaori-seasons commented May 18, 2023 •

edited

Loading

kaori-seasons commented May 18, 2023

2011shenlin commented May 18, 2023 •

edited

Loading

Jashinck commented May 18, 2023

kaori-seasons commented May 19, 2023

feat: Definition of monitoring indicators #104

feat: Definition of monitoring indicators #104

Comments

kaori-seasons commented May 18, 2023 • edited Loading

Background&Motivation

What needs to be done

Why do this

What benefits are bred

How to achieve it(Alpha)

kaori-seasons commented May 18, 2023

2011shenlin commented May 18, 2023 • edited Loading

Jashinck commented May 18, 2023

kaori-seasons commented May 19, 2023

kaori-seasons commented May 18, 2023 •

edited

Loading

2011shenlin commented May 18, 2023 •

edited

Loading