Channel mpsc_import_notification_stream #611

LukeWheeldon · 2023-06-15T14:56:37Z

I get those messages below on a regular basis. It's much less than before 0.9.42, but it's still there once in a while on most/all my nodes.

Labels
alertname = UnboundedChannelPersistentlyLarge
chain = polkadot
entity = mpsc_import_notification_stream
instance = server:9615
job = server
network = polkadot
severity = warning
Annotations
message = Channel mpsc_import_notification_stream on node server:9615 contains more than 200 items for more than 5 minutes. Node might be frozen.

_Originally posted by @LukeWheeldon in #679

The text was updated successfully, but these errors were encountered:

BulatSaif · 2023-09-08T13:06:22Z

I took a closer look at the issue, and it is caused by this alerting rule:

polkadot-sdk/substrate/scripts/ci/monitoring/alerting-rules/alerting-rules.yaml

Lines 148 to 159 in 92d7751

    
             - alert: UnboundedChannelPersistentlyLarge 
        
               expr: '( 
        
                   (substrate_unbounded_channel_len{action = "send"} - 
        
                       ignoring(action) substrate_unbounded_channel_len{action = "received"}) 
        
                   or on(instance) substrate_unbounded_channel_len{action = "send"} 
        
               ) >= 200' 
        
               for: 5m 
        
               labels: 
        
                 severity: warning 
        
               annotations: 
        
                 message: 'Channel {{ $labels.entity }} on node {{ $labels.instance }} contains 
        
                 more than 200 items for more than 5 minutes. Node might be frozen.'

After checking the historical data, I found that the first time mpsc_import_notification_stream started to increase was on 26.03.2023, right after we upgraded from v0.9.39 to v0.9.40. The alert did not cause issues initially because it took 10 days to go from 0 to 200.
I also examined the metric substrate_unbounded_channel_len, and it appears that its behavior has changed. Prior to v0.9.40, substrate_unbounded_channel_len had three actions: send|received|dropped, and the dropped action was counted twice, both as dropped and as received.

v0.9.39:  send - received = processing 
v0.9.40:  send - received = dropped + processing

Therefore, we need to update the alert as follows:

v0.9.40: send - (received + dropped) = processing

These changes should help address the issue.

altonen · 2023-09-08T13:16:36Z

@dmitry-markin you worked on the tracing-unbounded alert system, could you take a look at this?

dmitry-markin · 2023-09-11T07:58:57Z

@BulatSaif Thank you for the investigation! I see that I introduced a breaking change in paritytech/substrate#13504. Namely, I stopped counting dropped messages as received, and that broke alerts. There are three options to get it fixed:

Reintroduce the logic that counted dropped messages as received in substrate.
Update the alerting rules in prometheus to correctly take into account dropped messages when calculating the channel size.
Calculate the channel size in substrate and report the resulting number to prometheus.

I would vote for either option 2 or 3, because considering dropped messages to be received seems counterintuitive to me. Also, the alert is likely triggered only after the node is stopped, because the messages in this channel can be dropped only when the node is terminated. Taking this into account, I don't think calculating the channel size in prometheus alerts using sent, received and dropped is a good idea as it carries the numbers from the previously running node instances.

Option 2 is a little bit simpler, because we don't need to update substrate, but option 3 is more correct IMO.

@altonen What do you think?

altonen · 2023-09-11T08:10:39Z

I would go for option 3

BulatSaif · 2023-09-11T09:09:46Z

Calculate the channel size in substrate and report the resulting number to prometheus.

It is better to avoid any metrics calculation on application level, it may cause unnecessary load.

However, there is a catch when implementing the second option. Many of the substrate_unbounded_channel_len metrics lack dropped labels and only have send and received. I assume that a metric is not reported until the first message is dropped. The same assumption holds for the received label based on the current alert:

Current Alert:
(send - received) or send > 200

If received is null, then (send - received) is also null. Consequently, (null) or send will return send.

To address this, the new alert should be modified as follows:

New Alert:
(send - ((received + dropped) or received) or send > 200)

Here's the final expression:

expr: '(
  (
    substrate_unbounded_channel_len{action = "send"} -
    ignoring(action)(
      (
        substrate_unbounded_channel_len{action = "received"} +
        ignoring(action) substrate_unbounded_channel_len{action = "dropped"}
      ) or on(instance) substrate_unbounded_channel_len{action = "received"}
    )
  ) or on(instance) substrate_unbounded_channel_len{action = "send"}
) >= 200'

dmitry-markin · 2023-09-11T15:55:04Z

In this specific case channel size reporting is cheap. And it's more robust, because it might happen that the dropped messages are not reported on the node shut down (e.g., if the node shuts down fast enough for metrics to be not reported to prometheus).

dmitry-markin · 2023-09-12T11:44:06Z

@BulatSaif Could you reconfigure the alerts to use a new metric substrate_unbounded_channel_size, please, once the changes from PR #1489 are deployed? It contains the queue size (the number of messages sent but not yet received/dropped) labeled just by a channel name.

BulatSaif · 2023-09-14T13:32:47Z

@BulatSaif Could you reconfigure the alerts to use a new metric substrate_unbounded_channel_size, please, once the changes from PR #1489 are deployed? It contains the queue size (the number of messages sent but not yet received/dropped) labeled just by a channel name.

I created PR with alert fix: #1568

#1568) # Description Follow up for #1489. Closes #611 Before we calculated the channel size during alert expression but in #1489 a new metric was introduced that reports channel size. ## Changes: 1. updated alert rule to use new metric.

paritytech#1568) # Description Follow up for paritytech#1489. Closes paritytech#611 Before we calculated the channel size during alert expression but in paritytech#1489 a new metric was introduced that reports channel size. ## Changes: 1. updated alert rule to use new metric.

Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023

dmitry-markin mentioned this issue Sep 11, 2023

Report tracing_unbounded channel size to prometheus #1489

Merged

BulatSaif mentioned this issue Sep 14, 2023

Update the alerts to use a new metric substrate_unbounded_channel_size #1568

Merged

BulatSaif closed this as completed in #1568 Oct 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Channel mpsc_import_notification_stream #611

Channel mpsc_import_notification_stream #611

LukeWheeldon commented Jun 15, 2023

BulatSaif commented Sep 8, 2023 •

edited

Loading

altonen commented Sep 8, 2023

dmitry-markin commented Sep 11, 2023 •

edited

Loading

altonen commented Sep 11, 2023

BulatSaif commented Sep 11, 2023

dmitry-markin commented Sep 11, 2023

dmitry-markin commented Sep 12, 2023 •

edited

Loading

BulatSaif commented Sep 14, 2023

Channel mpsc_import_notification_stream #611

Channel mpsc_import_notification_stream #611

Comments

LukeWheeldon commented Jun 15, 2023

BulatSaif commented Sep 8, 2023 • edited Loading

altonen commented Sep 8, 2023

dmitry-markin commented Sep 11, 2023 • edited Loading

altonen commented Sep 11, 2023

BulatSaif commented Sep 11, 2023

dmitry-markin commented Sep 11, 2023

dmitry-markin commented Sep 12, 2023 • edited Loading

BulatSaif commented Sep 14, 2023

BulatSaif commented Sep 8, 2023 •

edited

Loading

dmitry-markin commented Sep 11, 2023 •

edited

Loading

dmitry-markin commented Sep 12, 2023 •

edited

Loading