Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] HStore JRaft Histograms Metrics Value NaN Error #2627

Closed
1 task done
JackyYangPassion opened this issue Aug 8, 2024 · 3 comments · Fixed by #2631
Closed
1 task done

[Bug] HStore JRaft Histograms Metrics Value NaN Error #2627

JackyYangPassion opened this issue Aug 8, 2024 · 3 comments · Fixed by #2631
Labels
bug Something isn't working raft store Store module

Comments

@JackyYangPassion
Copy link
Contributor

Bug Type (问题类型)

None

Before submit

  • 我已经确认现有的 IssuesFAQ 中没有相同 / 重复问题 (I have confirmed and searched that there are no similar problems in the historical issue and documents)

Environment (环境信息)

  • Server Version: Master (Apache Release Version)
  • Backend: HStore

Expected & Actual behavior (期望与实际表现)

当前问题

Histograms 类型指标值获取出现异常 NaN

jraft_append_logs_bytes_summary_count{group="0",handle="data",hg="store",} NaN
# HELP jraft_append_logs_bytes_mean  
# TYPE jraft_append_logs_bytes_mean gauge
jraft_append_logs_bytes_mean{group="0",handle="data",hg="store",} NaN
# HELP jraft_append_logs_bytes_min  
# TYPE jraft_append_logs_bytes_min gauge
jraft_append_logs_bytes_min{group="0",handle="data",hg="store",} NaN
# HELP jraft_append_logs_bytes_median  
# TYPE jraft_append_logs_bytes_median gauge
jraft_append_logs_bytes_median{group="0",handle="data",hg="store",} NaN
# HELP jraft_append_logs_bytes_max  
# TYPE jraft_append_logs_bytes_max gauge
jraft_append_logs_bytes_max{group="0",handle="data",hg="store",} NaN
# HELP jraft_append_logs_bytes_summary_sum  
# TYPE jraft_append_logs_bytes_summary_sum gauge
jraft_append_logs_bytes_summary_sum{group="0",handle="data",hg="store",} NaN
# HELP jraft_append_logs_bytes_summary  
# TYPE jraft_append_logs_bytes_summary gauge
jraft_append_logs_bytes_summary{group="0",handle="data",hg="store",quantile="0.95",} NaN
jraft_append_logs_bytes_summary{group="0",handle="data",hg="store",quantile="0.75",} NaN
jraft_append_logs_bytes_summary{group="0",handle="data",hg="store",quantile="0.999",} NaN
jraft_append_logs_bytes_summary{group="0",handle="data",hg="store",quantile="0.98",} NaN
jraft_append_logs_bytes_summary{group="0",handle="data",hg="store",quantile="0.99",} NaN
jraft_append_logs_bytes_summary{group="0",handle="data",hg="store",quantile="0.5",} NaN

期望结果

通过Spring Actuator 接口获取正确的JRaft 监控指标

curl http://ip:8620/actuator/prometheus |grep jraft_append_logs_bytes

指标详情

-- <hg_0/127.0.0.1:8510> -- Histograms ------------------------------------------------------------------
append-logs-bytes
             count = 67710
               min = 110
               max = 110
              mean = 110.00
            stddev = 0.00
            median = 110.00
              75% <= 110.00
              95% <= 110.00
              98% <= 110.00
              99% <= 110.00
            99.9% <= 110.00
append-logs-count
             count = 67710
               min = 1
               max = 1
              mean = 1.00
            stddev = 0.00
            median = 1.00
              75% <= 1.00
              95% <= 1.00
              98% <= 1.00
              99% <= 1.00
            99.9% <= 1.00
fsm-apply-tasks-count
             count = 67705
               min = 1
               max = 1
              mean = 1.00
            stddev = 0.00
            median = 1.00
              75% <= 1.00
              95% <= 1.00
              98% <= 1.00
              99% <= 1.00
            99.9% <= 1.00
handle-append-entries-count
             count = 20
               min = 0
               max = 0
              mean = 0.00
            stddev = 0.00
            median = 0.00
              75% <= 0.00
              95% <= 0.00
              98% <= 0.00
              99% <= 0.00
            99.9% <= 0.00
replicate-entries-bytes
             count = 135378
               min = 110
               max = 110
              mean = 110.00
            stddev = 0.00
            median = 110.00
              75% <= 110.00
              95% <= 110.00
              98% <= 110.00
              99% <= 110.00
            99.9% <= 110.00
replicate-entries-count
             count = 135378
               min = 1
               max = 1
              mean = 1.00
            stddev = 0.00
            median = 1.00
              75% <= 1.00
              95% <= 1.00
              98% <= 1.00
              99% <= 1.00
            99.9% <= 1.00
replicator-hg_0/127.0.0.1:8512::64.replicate-inflights-count
             count = 20365
               min = 1
               max = 1
              mean = 1.00
            stddev = 0.00
            median = 1.00
              75% <= 1.00
              95% <= 1.00
              98% <= 1.00
              99% <= 1.00
            99.9% <= 1.00
replicator-hg_0/127.0.0.1:8513::80.replicate-inflights-count
             count = 20365
               min = 1
               max = 1
              mean = 1.00
            stddev = 0.00
            median = 1.00
              75% <= 1.00
              95% <= 1.00
              98% <= 1.00
              99% <= 1.00
            99.9% <= 1.00

具体指标

append-logs-bytes
append-logs-count
fsm-apply-tasks-count
handle-append-entries-count
replicate-entries-bytes
replicate-entries-count
replicate-inflights-count

Vertex/Edge example (问题点 / 边数据举例)

No response

Schema [VertexLabel, EdgeLabel, IndexLabel] (元数据结构)

No response

@JackyYangPassion JackyYangPassion added the bug Something isn't working label Aug 8, 2024
@dosubot dosubot bot added raft store Store module labels Aug 8, 2024
Copy link

dosubot bot commented Aug 8, 2024

To resolve the issue of histogram metrics values for JRaft returning NaN in the HStore backend of the Master (Apache Release Version) of your server, you should ensure that the HistogramWrapper class is correctly updating its snapshot. The current implementation updates the snapshot every 30 seconds, which might be causing the NaN values if the histogram is not being updated frequently enough.

Here is the relevant code snippet that shows how the HistogramWrapper class is implemented:

private static class HistogramWrapper {

    private final com.codahale.metrics.Histogram histogram;
    private Snapshot snapshot;
    private long ts = System.currentTimeMillis();

    HistogramWrapper(com.codahale.metrics.Histogram histogram) {
        this.histogram = histogram;
        this.snapshot = this.histogram.getSnapshot();
    }

    Snapshot getSnapshot() {
        if (System.currentTimeMillis() - this.ts > 30_000) {
            this.snapshot = this.histogram.getSnapshot();
            this.ts = System.currentTimeMillis();
        }
        return this.snapshot;
    }
}

To ensure that the histogram metrics are updated correctly, you might want to reduce the interval for updating the snapshot or ensure that the histogram is being updated frequently enough. You can adjust the interval by modifying the condition in the getSnapshot method:

Snapshot getSnapshot() {
    if (System.currentTimeMillis() - this.ts > 10_000) { // Update every 10 seconds
        this.snapshot = this.histogram.getSnapshot();
        this.ts = System.currentTimeMillis();
    }
    return this.snapshot;
}

Additionally, ensure that the histogram is being updated with new data points frequently enough to avoid NaN values. If the histogram is not receiving new data points, it might result in NaN values when the snapshot is taken [1].

To continue talking to Dosu, mention @dosu.

@JackyYangPassion
Copy link
Contributor Author

@dosu 此处不需要定时拉取,直接histogram.getSnapshot(), 对性能影响很大?

Copy link

dosubot bot commented Aug 8, 2024

Yes, directly calling histogram.getSnapshot() without the 30-second interval would have a significant performance impact. The getSnapshot() method in the HistogramWrapper class is designed to update the snapshot only if 30 seconds have passed since the last update to avoid the performance overhead of frequently calling histogram.getSnapshot(), which can be an expensive operation. Removing this interval check would result in histogram.getSnapshot() being called every time getSnapshot() is invoked, leading to a significant performance impact [1][2].

To continue talking to Dosu, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working raft store Store module
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant