*: add v3 snapshot metrics (fsync, network) #9997

gyuho · 2018-08-12T05:01:41Z

v3 snapshots are large and critical operations, often affecting the cluster availabilities. Currently, we do not have any metrics around v3 snapshots. Only way to debug is looking at server logs after something bad happens.

etcd_snap_db_fsync_duration_seconds_count
etcd_snap_db_save_total_duration_seconds_bucket

to monitor v3 snapshot save operations on local node.

etcd_network_snapshot_send_success
etcd_network_snapshot_send_failures
etcd_network_snapshot_send_total_duration_seconds
etcd_network_snapshot_receive_success
etcd_network_snapshot_receive_failures
etcd_network_snapshot_receive_total_duration_seconds

to monitor v3 snapshot operations between remote peers.

Distribution would be:

0.1 second or more
...
25.6 seconds or more
51.2 seconds or more

This records successful snapshot sends/receives as well, because frequent snapshots affect cluster availabilities, as bad as spikes in etcd_network_snapshot_send/receive_failures.

Will update http://etcd.readthedocs.io/en/latest and operation docs in following PRs.

ref. #9438

/cc @wenjiaswe

codecov-io · 2018-08-12T06:05:24Z

Codecov Report

Merging #9997 into master will increase coverage by 0.4%.
The diff coverage is 75%.

@@            Coverage Diff            @@
##           master    #9997     +/-   ##
=========================================
+ Coverage   69.24%   69.65%   +0.4%     
=========================================
  Files         386      386             
  Lines       36030    36057     +27     
=========================================
+ Hits        24948    25114    +166     
+ Misses       9276     9158    -118     
+ Partials     1806     1785     -21

Impacted Files	Coverage Δ
etcdserver/api/snap/metrics.go	`100% <100%> (ø)`	⬆️
etcdserver/api/rafthttp/metrics.go	`100% <100%> (ø)`	⬆️
etcdserver/api/snap/db.go	`64% <100%> (+3.13%)`	⬆️
etcdserver/api/rafthttp/http.go	`65.58% <50%> (-0.48%)`	⬇️
etcdserver/api/rafthttp/snapshot_sender.go	`81.55% <90%> (+0.94%)`	⬆️
clientv3/namespace/watch.go	`72.72% <0%> (-6.07%)`	⬇️
integration/bridge.go	`70.22% <0%> (-1.53%)`	⬇️
etcdserver/server.go	`74.24% <0%> (+0.14%)`	⬆️
clientv3/watch.go	`91.93% <0%> (+0.42%)`	⬆️
pkg/transport/listener_tls.go	`66.22% <0%> (+0.66%)`	⬆️
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e72730a...6f4c509. Read the comment docs.

jpbetz

Looks right. A couple minor suggestions for the names, help strings and bucket sizes of the metrics.

jpbetz · 2018-08-15T18:41:35Z

etcdserver/api/rafthttp/http.go

 	if r.Method != "POST" {
 		w.Header().Set("Allow", "POST")
 		http.Error(w, "Method Not Allowed", http.StatusMethodNotAllowed)
+		snapshotReceiveFailures.WithLabelValues("").Inc()


"" feels a bit vague. Should we add a constant we always use here that is more descriptive?

jpbetz · 2018-08-15T18:45:14Z

etcdserver/api/rafthttp/metrics.go

+		Namespace: "etcd",
+		Subsystem: "network",
+		Name:      "snapshot_send",
+		Help:      "Total number of snapshot sends",


Operators might not understand what this metric is for. They're more familiar with "snapshot save" and "snapshot restore" commands. Maybe clarify it a bit more? e.g. "Total number of snapshots send from a member to a client or another member" ?

jpbetz · 2018-08-15T18:48:27Z

etcdserver/api/rafthttp/metrics.go

+	snapshotSend = prometheus.NewCounterVec(prometheus.CounterOpts{
+		Namespace: "etcd",
+		Subsystem: "network",
+		Name:      "snapshot_send",


This is for successful sends only, right? Maybe just clarify that in the help string so that it's clear that the total sent count would be this + snapshot_send_failures? Alternatively we could name this "snapshot_send_success", but that seems a bit too verbose..

Agree. Will suffix success.

jpbetz · 2018-08-15T18:50:33Z

etcdserver/api/rafthttp/metrics.go

+		Help:      "Total latency distributions of v3 snapshot sends",
+
+		// lowest bucket start of upper bound 0.001 sec (1 ms) with factor 2
+		// highest bucket start of 0.001 sec * 2^13 == 8.192 sec


Is 8 sec enough for the highest bucket? Would we spare a couple more metric entries to get this up to maybe 30 sec or so? Might be helpful if/when things go wrong.

Good point. I will update the highest bucket up to 30-second.

etcd_snap_db_fsync_duration_seconds_count etcd_snap_db_save_total_duration_seconds_bucket Signed-off-by: Gyuho Lee <[email protected]>

Distribution would be: 0.1 second or more ... 25.6 seconds or more 51.2 seconds or more etcd_network_snapshot_send_success etcd_network_snapshot_send_failures etcd_network_snapshot_send_total_duration_seconds etcd_network_snapshot_receive_success etcd_network_snapshot_receive_failures etcd_network_snapshot_receive_total_duration_seconds Signed-off-by: Gyuho Lee <[email protected]>

gyuho · 2018-08-15T19:57:51Z

@jpbetz All addressed. PTAL. Thanks.

jpbetz

lgtm

…econds" metric Currently, only v2 metrics ("stats.FollowerStats") tracks Raft message send latencies. Add Prometheus histogram to track Raft messages for writes, since heartbeats are probed (see etcd-io#10022) and snapshots are already being tracked via etcd-io#9997. ``` etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0001"} 1 etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0002"} 1 etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0001"} 9 etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0002"} 9 etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0001"} 8 etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0002"} 8 ``` Signed-off-by: Gyuho Lee <[email protected]>

…97-upstream-release-3.1 Automated cherry pick of #9997

…econds" metric Currently, only v2 metrics ("stats.FollowerStats") tracks Raft message send latencies. Add Prometheus histogram to track Raft messages for writes, since heartbeats are probed (see etcd-io#10022) and snapshots are already being tracked via etcd-io#9997. ``` etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0001"} 1 etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0002"} 1 etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0001"} 9 etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0002"} 9 etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0001"} 8 etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0002"} 8 ``` Signed-off-by: Gyuho Lee <[email protected]>

…97-upstream-release-3.2 Automated cherry pick of #9997

…97-upstream-release-3.3 Automated cherry pick of #9997

gyuho requested review from jpbetz and xiang90 August 12, 2018 05:01

gyuho added the Release-Note label Aug 12, 2018

gyuho force-pushed the snap-metrics branch from b63790b to ee69351 Compare August 13, 2018 07:40

jpbetz suggested changes Aug 15, 2018

View reviewed changes

gyuho force-pushed the snap-metrics branch from ee69351 to 9adfd9c Compare August 15, 2018 19:53

gyuho added 2 commits August 15, 2018 12:56

etcdserver/api/snap: add v3 snapshot fsync metrics

c392cd2

etcd_snap_db_fsync_duration_seconds_count etcd_snap_db_save_total_duration_seconds_bucket Signed-off-by: Gyuho Lee <[email protected]>

gyuho force-pushed the snap-metrics branch from 9adfd9c to 6f4c509 Compare August 15, 2018 19:57

jpbetz approved these changes Aug 15, 2018

View reviewed changes

gyuho merged commit 2a6bc7d into etcd-io:master Aug 15, 2018

gyuho deleted the snap-metrics branch August 15, 2018 21:16

gyuho mentioned this pull request Aug 18, 2018

rafthttp: add Raft send latency metric for writes #10023

Closed

This was referenced Aug 28, 2018

Automated cherry pick of #9997 #10041

Merged

Automated cherry pick of #9997 #10042

Merged

Automated cherry pick of #9997 #10043

Merged

gyuho added a commit that referenced this pull request Aug 29, 2018

Merge pull request #10043 from wenjiaswe/automated-cherry-pick-of-#99…

14883ca

…97-upstream-release-3.1 Automated cherry pick of #9997

wenjiaswe added a commit that referenced this pull request Sep 4, 2018

Merge pull request #10042 from wenjiaswe/automated-cherry-pick-of-#99…

9452e5c

…97-upstream-release-3.2 Automated cherry pick of #9997

wenjiaswe added a commit that referenced this pull request Oct 3, 2018

Merge pull request #10041 from wenjiaswe/automated-cherry-pick-of-#99…

cb57901

…97-upstream-release-3.3 Automated cherry pick of #9997

thaJeztah mentioned this pull request Apr 12, 2019

bump github.com/coreos/etcd v3.3.12 moby/swarmkit#2851

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: add v3 snapshot metrics (fsync, network) #9997

*: add v3 snapshot metrics (fsync, network) #9997

gyuho commented Aug 12, 2018 •

edited

Loading

codecov-io commented Aug 12, 2018 •

edited

Loading

jpbetz left a comment

jpbetz Aug 15, 2018

jpbetz Aug 15, 2018

jpbetz Aug 15, 2018

gyuho Aug 15, 2018

jpbetz Aug 15, 2018

gyuho Aug 15, 2018

gyuho commented Aug 15, 2018

jpbetz left a comment

*: add v3 snapshot metrics (fsync, network) #9997

*: add v3 snapshot metrics (fsync, network) #9997

Conversation

gyuho commented Aug 12, 2018 • edited Loading

codecov-io commented Aug 12, 2018 • edited Loading

Codecov Report

jpbetz left a comment

Choose a reason for hiding this comment

jpbetz Aug 15, 2018

Choose a reason for hiding this comment

jpbetz Aug 15, 2018

Choose a reason for hiding this comment

jpbetz Aug 15, 2018

Choose a reason for hiding this comment

gyuho Aug 15, 2018

Choose a reason for hiding this comment

jpbetz Aug 15, 2018

Choose a reason for hiding this comment

gyuho Aug 15, 2018

Choose a reason for hiding this comment

gyuho commented Aug 15, 2018

jpbetz left a comment

Choose a reason for hiding this comment

gyuho commented Aug 12, 2018 •

edited

Loading

codecov-io commented Aug 12, 2018 •

edited

Loading