-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stats to record how often the ClusterState diff mechanism is used successfully #26973
Stats to record how often the ClusterState diff mechanism is used successfully #26973
Conversation
… used successfully. It's believed that using diffs obsoletes the other mechanism for reusing the bits of the ClusterState that didn't change between updates, but in fact we don't know for sure how often the diff mechanism works successfully. The stats collected here will tell us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code wise, I left some minor comments. Looks good otherwise.
I have some higher level comments:
- This is currently implemented on the master side. I think it will have more value on the node side - we can see that if a node is struggling it falls back more to full cluster state publishing etc.
- We need a unit test.
- Can you also add a rest test - this will allow use to check we do the right thing in a mixed cluster version (if it's above the right version).
- think this can go into 6.1.
@@ -90,6 +94,10 @@ | |||
private final IncomingClusterStateListener incomingClusterStateListener; | |||
private final DiscoverySettings discoverySettings; | |||
|
|||
private long fullClusterStateSentCount = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should probably be atomic...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's sneaky. 😇
/** | ||
* Class encapsulating stats about the PublishClusterStateAction | ||
*/ | ||
public class PublishClusterStateStats implements Writeable, ToXContentFragment { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want ToXContentObject here.
private final long clusterStateDiffSentCount; | ||
private final long incompatibleClusterStateDiffVersionCount; | ||
|
||
public PublishClusterStateStats(long fullClusterStateSentCount, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you java doc the parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Params or getters? Or both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @bleskes comments, I left a few more.
this.incompatibleClusterStateDiffVersionCount = incompatibleClusterStateDiffVersionCount; | ||
} | ||
|
||
public PublishClusterStateStats(StreamInput streamInput) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We normally name these in
, it leaves for shorter code and it would be consistent with other places in the codebase.
out.writeVLong(incompatibleClusterStateDiffVersionCount); | ||
} | ||
|
||
static final class Fields { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've been moving away from these inner fields classes, we don't need them to encapsulate some strings.
Thanks. I think I've addressed all the comments except for the ones asking for more tests, for which I need a bit more guidance. Not sure I like the inconsistency between |
There is more work to be done in the 6.x branch before the version guard can be reduced like this. This reverts commit f004119.
@jasontedor @bleskes Could you have another look at this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @DaveCTurner . This looks great. I left some very minor feedback
@@ -90,6 +98,10 @@ | |||
private final IncomingClusterStateListener incomingClusterStateListener; | |||
private final DiscoverySettings discoverySettings; | |||
|
|||
private AtomicLong fullClusterStateReceivedCount = new AtomicLong(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please make these final?
@@ -90,6 +98,10 @@ | |||
private final IncomingClusterStateListener incomingClusterStateListener; | |||
private final DiscoverySettings discoverySettings; | |||
|
|||
private AtomicLong fullClusterStateReceivedCount = new AtomicLong(); | |||
private AtomicLong clusterStateDiffReceivedCount = new AtomicLong(); | |||
private AtomicLong compatibleClusterStateDiffReceivedCount = new AtomicLong(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally find having a incompatibleDiff count more handy as it's what we'll be looking for. Maybe just have compatibleDiffCount and incompatibleDiffCount (instead of total)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. How picky do we want to be about concurrency here? I can do this by counting the cluster state diff as incompatible until successfully applied (i.e. no exceptions thrown) and then to decrement the incompatible
count and increment the compatible
one, but that does mean you could see incorrect values if you get the stats at the wrong moment.
On second thought the stats()
method is already a bit racy.
I have various other, more complicated, ideas:
- counting more things, to distinguish in-progress applications from rejected ones.
- catching an
IncompatibleClusterStateVersionException
on the way past and updating the stats in the handler. - keeping the accumulators as-is and calculating the difference in the
stats()
method (potentially also racy, not sure yet). - introducing locks around stats reads and updates.
Any particular preferences?
@@ -26,4 +28,6 @@ | |||
- is_false: nodes.$master.name | |||
- is_false: nodes.$master.jvm | |||
- is_true: nodes.$master.discovery | |||
- is_true: nodes.$master.discovery.cluster_state_queue | |||
- is_true: nodes.$master.discovery.published_cluster_states_received |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also check the sub elements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is_true
fails on the subfields because 0 is falsey. I'm leaning towards saying gte: 0
as a proxy for "is present and numeric".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. If Ci is happy I’m happy.
|
||
@Override | ||
public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException { | ||
builder.startObject("published_cluster_states_received"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit - strictly speaking these are publishing stats, can we open the object with just published cluster stats (drop received). You can maybe received back in the keys, which can be shorted by dropping the cluster states from the key names - it’s implied from the object they’re in.
@bleskes squashed remaining nits. @jasontedor would you have a look too? |
@jasontedor is this ok with you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few more comments.
@@ -64,4 +80,5 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws | |||
public PendingClusterStateStats getQueueStats() { | |||
return queueStats; | |||
} | |||
public PublishClusterStateStats getPublishStats() { return publishStats; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you add a newline before this method, and wrap the method definition over multiple lines please?
return builder; | ||
} | ||
|
||
public long getFullClusterStateReceivedCount() { return fullClusterStateReceivedCount; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method can be package-private.
|
||
public long getFullClusterStateReceivedCount() { return fullClusterStateReceivedCount; } | ||
|
||
public long getIncompatibleClusterStateDiffReceivedCount() { return incompatibleClusterStateDiffReceivedCount; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method can be package-private.
|
||
public long getIncompatibleClusterStateDiffReceivedCount() { return incompatibleClusterStateDiffReceivedCount; } | ||
|
||
public long getCompatibleClusterStateDiffReceivedCount() { return compatibleClusterStateDiffReceivedCount; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method can be package-private.
builder.field("incompatible_diffs", incompatibleClusterStateDiffReceivedCount); | ||
builder.field("compatible_diffs", compatibleClusterStateDiffReceivedCount); | ||
builder.endObject(); | ||
return builder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you can do something like this:
diff --git a/core/src/main/java/org/elasticsearch/discovery/zen/PublishClusterStateStats.java b/core/src/main/java/org/elasticsearch/discovery/zen/PublishClusterStateStats.java
index 356b9a29dc..36794e880f 100644
--- a/core/src/main/java/org/elasticsearch/discovery/zen/PublishClusterStateStats.java
+++ b/core/src/main/java/org/elasticsearch/discovery/zen/PublishClusterStateStats.java
@@ -65,9 +65,11 @@ public class PublishClusterStateStats implements Writeable, ToXContentObject {
@Override
public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
builder.startObject("published_cluster_states");
- builder.field("full_states", fullClusterStateReceivedCount);
- builder.field("incompatible_diffs", incompatibleClusterStateDiffReceivedCount);
- builder.field("compatible_diffs", compatibleClusterStateDiffReceivedCount);
+ {
+ builder.field("full_states", fullClusterStateReceivedCount);
+ builder.field("incompatible_diffs", incompatibleClusterStateDiffReceivedCount);
+ builder.field("compatible_diffs", compatibleClusterStateDiffReceivedCount);
+ }
builder.endObject();
return builder;
}
which makes the JSON-structure clearer in the code.
@@ -413,7 +413,8 @@ public void onNewClusterStateFailed(Exception e) { | |||
@Override | |||
public DiscoveryStats stats() { | |||
PendingClusterStateStats queueStats = pendingStatesQueue.stats(); | |||
return new DiscoveryStats(queueStats); | |||
PublishClusterStateStats publishStats = publishClusterState.stats(); | |||
return new DiscoveryStats(queueStats, publishStats); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think:
diff --git a/core/src/main/java/org/elasticsearch/discovery/zen/ZenDiscovery.java b/core/src/main/java/org/elasticsearch/discovery/zen/ZenDiscovery.java
index 5f62ee6b97..d688a5d5cd 100644
--- a/core/src/main/java/org/elasticsearch/discovery/zen/ZenDiscovery.java
+++ b/core/src/main/java/org/elasticsearch/discovery/zen/ZenDiscovery.java
@@ -412,9 +412,7 @@ public class ZenDiscovery extends AbstractLifecycleComponent implements Discover
@Override
public DiscoveryStats stats() {
- PendingClusterStateStats queueStats = pendingStatesQueue.stats();
- PublishClusterStateStats publishStats = publishClusterState.stats();
- return new DiscoveryStats(queueStats, publishStats);
+ return new DiscoveryStats(pendingStatesQueue.stats(), publishClusterState.stats());
}
public DiscoverySettings getDiscoverySettings() {
is fine?
All addressed @jasontedor - could you review again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thanks both. The backport is on its way. |
…cessfully (elastic#26973) It's believed that using diffs obsoletes the other mechanism for reusing the bits of the ClusterState that didn't change between updates, but in fact we don't know for sure how often the diff mechanism works successfully. The stats collected here will tell us.
…cessfully (#27107) It's believed that using diffs obsoletes the other mechanism for reusing the bits of the ClusterState that didn't change between updates, but in fact we don't know for sure how often the diff mechanism works successfully. The stats collected here will tell us. Backport of #26973 to 6.x
* master: Ignore .DS_Store files on macOS Docs: Fix ingest geoip config location (elastic#27110) Adjust SHA-512 supported format on plugin install Make ShardSearchTarget optional when parsing ShardSearchFailure (elastic#27078) [Docs] Clarify mapping `index` option default (elastic#27104) Decouple BulkProcessor from ThreadPool (elastic#26727) Stats to record how often the ClusterState diff mechanism is used successfully (elastic#26973) Tie-break shard path decision based on total number of shards on path (elastic#27039)
It's believed that using diffs obsoletes the other mechanism for reusing the
bits of the ClusterState that didn't change between updates, but in fact we
don't know for sure how often the diff mechanism works successfully. The stats
collected here will tell us.
PR opened for comments on approach, and because I'm about to get on a plane so it's either now or much later. I think I'd like to add stats on how often these things are received as well as sent in a similar fashion.