Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timing stats to publication process #76771

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 151 additions & 0 deletions docs/reference/cluster/nodes-stats.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -2077,6 +2077,157 @@ Number of incompatible differences between published cluster states.
(integer)
Number of compatible differences between published cluster states.
=======

`master_timing`::
(object)
Contains low-level statistics about how long various activities took while the
node was the elected master. Omitted if the node is not master-eligible. Every
field whose name ends in `_time` within this object is also represented as a
raw number of milliseconds in a field whose name ends in `_time_millis`. The
human-readable fields with a `_time` suffix are only returned if requested with
the `?human=true` query parameter.
+
.Properties of `master_timing`
[%collapsible]
=======
`unchanged`::
(object)
Contains statistics about cluster state update attempts that did not change the
cluster state.
+
.Properties of `unchanged`
[%collapsible]
========
`count`::
(long)
The number of cluster state update attempts that did not change the cluster
state since the node started.

`computation_time`::
(<<time-units,time value>>)
The cumulative amount of time spent computing no-op cluster state updates since
the node started.

`notification_time`::
(<<time-units,time value>>)
The cumulative amount of time spent notifying listeners of a no-op cluster
state update since the node started.

========

`success`::
(object)
Contains statistics about cluster state update attempts that did not change the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are update attempts that did change the cluster state successfully?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bah yes. Copy/paste might have been involved a bit here.

cluster state.
+
.Properties of `success`
[%collapsible]
========
`count`::
(long)
The number of cluster state update attempts that successfully changed the
cluster state since the node started.

`computation_time`::
(<<time-units,time value>>)
The cumulative amount of time spent computing cluster state updates that were
ultimately successful since the node started.

`publication_time`::
(<<time-units,time value>>)
The cumulative amount of time spent publishing cluster state updates which
ultimately succeeded, which includes everything from the end of the computation
of the new cluster state until the publication has finished and the master node
is ready to start processing the next state update.

`context_construction_time`::
(<<time-units,time value>>)
The cumulative amount of time spent constructing a _publication context_ since
the node started for publications that ultimately succeeded. This statistic
includes the time spent computing the difference between cluster states and
preparing a serialized representation of the state.

`commit_time`::
(<<time-units,time value>>)
The cumulative amount of time spent waiting for a successful cluster state
update to _commit_, which happens when a majority of the master nodes have
written the state to disk.

`completion_time`::
(<<time-units,time value>>)
The cumulative amount of time spent waiting for a successful cluster state
update to _complete_, which happens when all the nodes have applied the cluster
state except for the elected master.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps clarify that these are from publication start, i.e., IIUC includes publication_time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, see bee970a.


`master_apply_time`::
(<<time-units,time value>>)
The cumulative amount of time spent successfully applying cluster state updates
on the elected master since the node started.

`notification_time`::
(<<time-units,time value>>)
The cumulative amount of time spent notifying listeners of a successful cluster
state update since the node started.

========

`failure`::
(object)
Contains statistics about cluster state update attempts that did not
successfully change the cluster state, typically because a new master node was
elected before completion.
+
.Properties of `failure`
[%collapsible]
========
`count`::
(long)
The number of cluster state update attempts that failed to change the cluster
state since the node started.

`computation_time`::
(<<time-units,time value>>)
The cumulative amount of time spent computing cluster state updates that were
ultimately unsuccessful since the node started.

`publication_time`::
(<<time-units,time value>>)
The cumulative amount of time spent publishing cluster state updates which
ultimately failed, which includes everything from the end of the computation of
the new cluster state until the publication has finished and the master node is
ready to start processing the next state update.

`context_construction_time`::
(<<time-units,time value>>)
The cumulative amount of time spent constructing a _publication context_ since
the node started for publications that ultimately failed. This statistic
includes the time spent computing the difference between cluster states and
preparing a serialized representation of the state.

`commit_time`::
(<<time-units,time value>>)
The cumulative amount of time spent waiting for an unsuccessful cluster state
update to _commit_, which happens when a majority of the master nodes have
written the state to disk.

`completion_time`::
(<<time-units,time value>>)
The cumulative amount of time spent waiting for an unsuccessful cluster state
update to _complete_, which happens when all the nodes have applied the cluster
state except for the elected master.

`master_apply_time`::
(<<time-units,time value>>)
The cumulative amount of time spent unsuccessfully applying cluster state
updates on the elected master since the node started.

`notification_time`::
(<<time-units,time value>>)
The cumulative amount of time spent notifying listeners of a failed cluster
state update since the node started.

========
=======
======

[[cluster-nodes-stats-api-response-body-ingest]]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,3 +40,99 @@
- gte: { nodes.$master.discovery.published_cluster_states.incompatible_diffs: 0 }
- gte: { nodes.$master.discovery.published_cluster_states.compatible_diffs: 0 }
- is_false: nodes.$master.roles

---
"Master timing stats":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really just asserting that the stats really do come out of the API and that we do the right thing with ?human.

- skip:
features: [arbitrary_key]
version: "- 7.99.99"
reason: "master timing stats added in 8.0.0"

- do:
nodes.info:
node_id: _master
- set:
nodes._arbitrary_key_: master

- do:
nodes.stats:
metric: [ discovery ]

- gte: { nodes.$master.discovery.master_timing.unchanged.count: 0 }
- gte: { nodes.$master.discovery.master_timing.unchanged.computation_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.unchanged.notification_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.success.count: 0 }
- gte: { nodes.$master.discovery.master_timing.success.computation_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.success.publication_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.success.context_construction_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.success.commit_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.success.completion_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.success.master_apply_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.success.notification_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.failure.count: 0 }
- gte: { nodes.$master.discovery.master_timing.failure.computation_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.failure.publication_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.failure.context_construction_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.failure.commit_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.failure.completion_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.failure.master_apply_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.failure.notification_time_millis: 0 }

- is_false: nodes.$master.discovery.master_timing.unchanged.computation_time
- is_false: nodes.$master.discovery.master_timing.unchanged.notification_time
- is_false: nodes.$master.discovery.master_timing.success.computation_time
- is_false: nodes.$master.discovery.master_timing.success.publication_time
- is_false: nodes.$master.discovery.master_timing.success.context_construction_time
- is_false: nodes.$master.discovery.master_timing.success.commit_time
- is_false: nodes.$master.discovery.master_timing.success.completion_time
- is_false: nodes.$master.discovery.master_timing.success.master_apply_time
- is_false: nodes.$master.discovery.master_timing.success.notification_time
- is_false: nodes.$master.discovery.master_timing.failure.computation_time
- is_false: nodes.$master.discovery.master_timing.failure.publication_time
- is_false: nodes.$master.discovery.master_timing.failure.context_construction_time
- is_false: nodes.$master.discovery.master_timing.failure.commit_time
- is_false: nodes.$master.discovery.master_timing.failure.completion_time
- is_false: nodes.$master.discovery.master_timing.failure.master_apply_time
- is_false: nodes.$master.discovery.master_timing.failure.notification_time

- do:
nodes.stats:
metric: [ discovery ]
human: true

- gte: { nodes.$master.discovery.master_timing.unchanged.count: 0 }
- gte: { nodes.$master.discovery.master_timing.unchanged.computation_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.unchanged.notification_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.success.count: 0 }
- gte: { nodes.$master.discovery.master_timing.success.computation_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.success.publication_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.success.context_construction_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.success.commit_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.success.completion_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.success.master_apply_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.success.notification_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.failure.count: 0 }
- gte: { nodes.$master.discovery.master_timing.failure.computation_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.failure.publication_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.failure.context_construction_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.failure.commit_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.failure.completion_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.failure.master_apply_time_millis: 0 }
- gte: { nodes.$master.discovery.master_timing.failure.notification_time_millis: 0 }

- is_true: nodes.$master.discovery.master_timing.unchanged.computation_time
- is_true: nodes.$master.discovery.master_timing.unchanged.notification_time
- is_true: nodes.$master.discovery.master_timing.success.computation_time
- is_true: nodes.$master.discovery.master_timing.success.publication_time
- is_true: nodes.$master.discovery.master_timing.success.context_construction_time
- is_true: nodes.$master.discovery.master_timing.success.commit_time
- is_true: nodes.$master.discovery.master_timing.success.completion_time
- is_true: nodes.$master.discovery.master_timing.success.master_apply_time
- is_true: nodes.$master.discovery.master_timing.success.notification_time
- is_true: nodes.$master.discovery.master_timing.failure.computation_time
- is_true: nodes.$master.discovery.master_timing.failure.publication_time
- is_true: nodes.$master.discovery.master_timing.failure.context_construction_time
- is_true: nodes.$master.discovery.master_timing.failure.commit_time
- is_true: nodes.$master.discovery.master_timing.failure.completion_time
- is_true: nodes.$master.discovery.master_timing.failure.master_apply_time
- is_true: nodes.$master.discovery.master_timing.failure.notification_time
Loading