Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stats to record how often the ClusterState diff mechanism is used successfully #26973

Merged
merged 33 commits into from
Oct 25, 2017

Conversation

DaveCTurner
Copy link
Contributor

@DaveCTurner DaveCTurner commented Oct 11, 2017

It's believed that using diffs obsoletes the other mechanism for reusing the
bits of the ClusterState that didn't change between updates, but in fact we
don't know for sure how often the diff mechanism works successfully. The stats
collected here will tell us.

PR opened for comments on approach, and because I'm about to get on a plane so it's either now or much later. I think I'd like to add stats on how often these things are received as well as sent in a similar fashion.

… used successfully.

It's believed that using diffs obsoletes the other mechanism for reusing the
bits of the ClusterState that didn't change between updates, but in fact we
don't know for sure how often the diff mechanism works successfully. The stats
collected here will tell us.
@DaveCTurner DaveCTurner added :Data Management/Stats Statistics tracking and retrieval APIs v7.0.0 labels Oct 11, 2017
Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code wise, I left some minor comments. Looks good otherwise.

I have some higher level comments:

  1. This is currently implemented on the master side. I think it will have more value on the node side - we can see that if a node is struggling it falls back more to full cluster state publishing etc.
  2. We need a unit test.
  3. Can you also add a rest test - this will allow use to check we do the right thing in a mixed cluster version (if it's above the right version).
  4. think this can go into 6.1.

@@ -90,6 +94,10 @@
private final IncomingClusterStateListener incomingClusterStateListener;
private final DiscoverySettings discoverySettings;

private long fullClusterStateSentCount = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should probably be atomic...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's sneaky. 😇

/**
* Class encapsulating stats about the PublishClusterStateAction
*/
public class PublishClusterStateStats implements Writeable, ToXContentFragment {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want ToXContentObject here.

private final long clusterStateDiffSentCount;
private final long incompatibleClusterStateDiffVersionCount;

public PublishClusterStateStats(long fullClusterStateSentCount,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you java doc the parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Params or getters? Or both?

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @bleskes comments, I left a few more.

this.incompatibleClusterStateDiffVersionCount = incompatibleClusterStateDiffVersionCount;
}

public PublishClusterStateStats(StreamInput streamInput) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We normally name these in, it leaves for shorter code and it would be consistent with other places in the codebase.

out.writeVLong(incompatibleClusterStateDiffVersionCount);
}

static final class Fields {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been moving away from these inner fields classes, we don't need them to encapsulate some strings.

@DaveCTurner
Copy link
Contributor Author

Thanks. I think I've addressed all the comments except for the ones asking for more tests, for which I need a bit more guidance. Not sure I like the inconsistency between incompatibleClusterStateDiffsSentCount and compatibleClusterStateDiffReceivedCount but they were the cleanest things to measure given the existing code. I could do something more complicated to smooth that out if desired.

There is more work to be done in the 6.x branch before the version guard can be
reduced like this.

This reverts commit f004119.
@DaveCTurner
Copy link
Contributor Author

@jasontedor @bleskes Could you have another look at this?

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @DaveCTurner . This looks great. I left some very minor feedback

@@ -90,6 +98,10 @@
private final IncomingClusterStateListener incomingClusterStateListener;
private final DiscoverySettings discoverySettings;

private AtomicLong fullClusterStateReceivedCount = new AtomicLong();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please make these final?

@@ -90,6 +98,10 @@
private final IncomingClusterStateListener incomingClusterStateListener;
private final DiscoverySettings discoverySettings;

private AtomicLong fullClusterStateReceivedCount = new AtomicLong();
private AtomicLong clusterStateDiffReceivedCount = new AtomicLong();
private AtomicLong compatibleClusterStateDiffReceivedCount = new AtomicLong();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally find having a incompatibleDiff count more handy as it's what we'll be looking for. Maybe just have compatibleDiffCount and incompatibleDiffCount (instead of total)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. How picky do we want to be about concurrency here? I can do this by counting the cluster state diff as incompatible until successfully applied (i.e. no exceptions thrown) and then to decrement the incompatible count and increment the compatible one, but that does mean you could see incorrect values if you get the stats at the wrong moment.

On second thought the stats() method is already a bit racy.

I have various other, more complicated, ideas:

  • counting more things, to distinguish in-progress applications from rejected ones.
  • catching an IncompatibleClusterStateVersionException on the way past and updating the stats in the handler.
  • keeping the accumulators as-is and calculating the difference in the stats() method (potentially also racy, not sure yet).
  • introducing locks around stats reads and updates.

Any particular preferences?

@@ -26,4 +28,6 @@
- is_false: nodes.$master.name
- is_false: nodes.$master.jvm
- is_true: nodes.$master.discovery
- is_true: nodes.$master.discovery.cluster_state_queue
- is_true: nodes.$master.discovery.published_cluster_states_received
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also check the sub elements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is_true fails on the subfields because 0 is falsey. I'm leaning towards saying gte: 0 as a proxy for "is present and numeric".

@DaveCTurner
Copy link
Contributor Author

@bleskes changed in b589f36 as requested. I think that's everything.

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. If Ci is happy I’m happy.


@Override
public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
builder.startObject("published_cluster_states_received");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - strictly speaking these are publishing stats, can we open the object with just published cluster stats (drop received). You can maybe received back in the keys, which can be shorted by dropping the cluster states from the key names - it’s implied from the object they’re in.

@DaveCTurner
Copy link
Contributor Author

@bleskes squashed remaining nits. @jasontedor would you have a look too?

@DaveCTurner
Copy link
Contributor Author

@jasontedor is this ok with you?

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few more comments.

@@ -64,4 +80,5 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws
public PendingClusterStateStats getQueueStats() {
return queueStats;
}
public PublishClusterStateStats getPublishStats() { return publishStats; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you add a newline before this method, and wrap the method definition over multiple lines please?

return builder;
}

public long getFullClusterStateReceivedCount() { return fullClusterStateReceivedCount; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method can be package-private.


public long getFullClusterStateReceivedCount() { return fullClusterStateReceivedCount; }

public long getIncompatibleClusterStateDiffReceivedCount() { return incompatibleClusterStateDiffReceivedCount; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method can be package-private.


public long getIncompatibleClusterStateDiffReceivedCount() { return incompatibleClusterStateDiffReceivedCount; }

public long getCompatibleClusterStateDiffReceivedCount() { return compatibleClusterStateDiffReceivedCount; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method can be package-private.

builder.field("incompatible_diffs", incompatibleClusterStateDiffReceivedCount);
builder.field("compatible_diffs", compatibleClusterStateDiffReceivedCount);
builder.endObject();
return builder;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you can do something like this:

diff --git a/core/src/main/java/org/elasticsearch/discovery/zen/PublishClusterStateStats.java b/core/src/main/java/org/elasticsearch/discovery/zen/PublishClusterStateStats.java
index 356b9a29dc..36794e880f 100644
--- a/core/src/main/java/org/elasticsearch/discovery/zen/PublishClusterStateStats.java
+++ b/core/src/main/java/org/elasticsearch/discovery/zen/PublishClusterStateStats.java
@@ -65,9 +65,11 @@ public class PublishClusterStateStats implements Writeable, ToXContentObject {
     @Override
     public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
         builder.startObject("published_cluster_states");
-        builder.field("full_states", fullClusterStateReceivedCount);
-        builder.field("incompatible_diffs", incompatibleClusterStateDiffReceivedCount);
-        builder.field("compatible_diffs", compatibleClusterStateDiffReceivedCount);
+        {
+            builder.field("full_states", fullClusterStateReceivedCount);
+            builder.field("incompatible_diffs", incompatibleClusterStateDiffReceivedCount);
+            builder.field("compatible_diffs", compatibleClusterStateDiffReceivedCount);
+        }
         builder.endObject();
         return builder;
     }

which makes the JSON-structure clearer in the code.

@@ -413,7 +413,8 @@ public void onNewClusterStateFailed(Exception e) {
@Override
public DiscoveryStats stats() {
PendingClusterStateStats queueStats = pendingStatesQueue.stats();
return new DiscoveryStats(queueStats);
PublishClusterStateStats publishStats = publishClusterState.stats();
return new DiscoveryStats(queueStats, publishStats);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think:

diff --git a/core/src/main/java/org/elasticsearch/discovery/zen/ZenDiscovery.java b/core/src/main/java/org/elasticsearch/discovery/zen/ZenDiscovery.java
index 5f62ee6b97..d688a5d5cd 100644
--- a/core/src/main/java/org/elasticsearch/discovery/zen/ZenDiscovery.java
+++ b/core/src/main/java/org/elasticsearch/discovery/zen/ZenDiscovery.java
@@ -412,9 +412,7 @@ public class ZenDiscovery extends AbstractLifecycleComponent implements Discover
 
     @Override
     public DiscoveryStats stats() {
-        PendingClusterStateStats queueStats = pendingStatesQueue.stats();
-        PublishClusterStateStats publishStats = publishClusterState.stats();
-        return new DiscoveryStats(queueStats, publishStats);
+        return new DiscoveryStats(pendingStatesQueue.stats(), publishClusterState.stats());
     }
 
     public DiscoverySettings getDiscoverySettings() {

is fine?

@DaveCTurner
Copy link
Contributor Author

All addressed @jasontedor - could you review again?

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@DaveCTurner DaveCTurner merged commit cc3364e into elastic:master Oct 25, 2017
@DaveCTurner
Copy link
Contributor Author

Thanks both. The backport is on its way.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Oct 25, 2017
…cessfully (elastic#26973)

It's believed that using diffs obsoletes the other mechanism for reusing the
bits of the ClusterState that didn't change between updates, but in fact we
don't know for sure how often the diff mechanism works successfully. The stats
collected here will tell us.
DaveCTurner added a commit that referenced this pull request Oct 25, 2017
…cessfully (#27107)

It's believed that using diffs obsoletes the other mechanism for reusing the bits of the ClusterState that didn't change between updates, but in fact we don't know for sure how often the diff mechanism works successfully. The stats collected here will tell us.

Backport of #26973 to 6.x
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Oct 26, 2017
* master:
  Ignore .DS_Store files on macOS
  Docs: Fix ingest geoip config location (elastic#27110)
  Adjust SHA-512 supported format on plugin install
  Make ShardSearchTarget optional when parsing ShardSearchFailure (elastic#27078)
  [Docs] Clarify mapping `index` option default (elastic#27104)
  Decouple BulkProcessor from ThreadPool (elastic#26727)
  Stats to record how often the ClusterState diff mechanism is used successfully (elastic#26973)
  Tie-break shard path decision based on total number of shards on path (elastic#27039)
@DaveCTurner DaveCTurner deleted the 2017-10-11-discovery-stats branch July 23, 2022 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Stats Statistics tracking and retrieval APIs >enhancement v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants